Bootstrap Perception Under Hardware Depth Failure
Six-panel grid showing the V9 (Lighthouse) production deployment on the 459-frame corridor evaluation set. Top row: RGB input · raw Femto Bolt ToF depth (note the dead-pixel pattern across the floor) · zero-shot DA3-Small reference depth. Bottom row: V9 raw inference · confidence-gated fusion of ToF and DA3 (the foundation-model baseline) · confidence-gated fusion of ToF and V9 (the production deployment realization, consumed directly by the local costmap).
A mobile robot navigating an indoor corridor finds that its Orbbec Femto Bolt ToF camera returns valid depth on 20.3% of pixels (459-frame measurement, verified end-to-end inside the Docker container — see Docker). The polished floor, glass walls, and out-of-range surfaces kill the rest. The LiDAR still works, but it scans a 2D plane — it does not see chairs, tabletops, or torsos.
This project asks whether a single RGB camera, running a learned monocular depth model, can fill the gap left by the dead ToF pixels. Not to replace structured-light depth — to make the robot’s costmap dense enough to navigate safely.
The position this work defends, in one sentence: monocular depth alone cannot replace ToF, but it is an unexpectedly strong fusion partner. Fusing LiDAR with depth recovers +55 % occupied costmap cells in narrow corridors and turns a robot that cannot see most of its environment into one that can.
What this site documents
| Page | Content |
|---|---|
| Architecture | Mermaid diagrams: two-repo split, student model, runtime fusion (with median-scale calibration), training loss |
| Concepts | Bootstrap perception, scale calibration, confidence-gated fusion, four-layer sensing, specification and deployment, knowledge distillation — the ideas the rest of the site assumes |
| Hardware | Femto Bolt, RPLiDAR S2, Jetson Orin Nano, Traxxas Maxx 4S — the deployment platform |
| Model lineage (V1 → V9) | One page per training iteration with what changed, why, results, and verdict |
| Training Pipeline | Loss functions, teacher ensemble, the V1-V9 narrative |
| Evaluation | Corridor depth metrics, costmap ablation, FPR decomposition |
| Calibration Study | Reviewer-requested experiment on affine alignment sensitivity |
| Decisions and tradeoffs | Specification vs deployment realization, deferred APE evaluation, INT8 calibration scope, V9 specialist tradeoff |
| Deployment | ONNX export, TensorRT, Jetson benchmarks, ROS 2 integration |
| Datasets | Frame counts, formats, naming conventions, hosting status |
| Demo Videos | Comparison videos across all models and datasets |
| Docker | One-command reproducibility, verified build |
Headline results
| Capability | Number | Source |
|---|---|---|
| Best general indoor model (V5) | 0.572 m NYU val RMSE | V5 |
| Best NYU model (V6) | 0.519 m NYU val RMSE | V6 |
| Production corridor specialist (V9) | 0.382 m LILocBench RMSE, 9 / 10 Gazebo success | V9 |
| DA3-Small zero-shot on Jetson | 218 FPS / 4.6 ms / 2.7 GB | TensorRT FP16 at 308×308 |
| Costmap recovery (L+D vs L-only) | +55% occupied cells | Evaluation |
| ToF sensor dead-pixel rate (corridor) | 79.7% (verified in container) | Docker |
| End-to-end Docker reproducibility | smoke + 459-frame eval | Docker |
Honest caveats
- APE / SLAM evaluation deferred to future work. A preliminary measurement was confounded by mismatched rosbag playback rates; matched-rate re-evaluation is identified as future work. Details.
- V9 is a corridor specialist. Worse than V3 on NYU. The tradeoff is intentional and disclosed. Details.
- The 5.2% FPR is not free. Decomposed into model hallucinations (49%), sensor-fill artifacts (35%), inflation artifacts (18%). Details.
- INT8 calibration in
export_trt.pyis a stub. Defaults to random noise. FP16 numbers are real; INT8 numbers are not validated. Details. - The fusion pipeline has two complementary implementations. The formal specification (used for evaluation and reporting) computes affine alignment explicitly; the deployment realization (used for on-vehicle inference) applies the same principle via per-pixel substitution within the embedded compute budget. The two are operationally equivalent in the deployment regime. Details.
Quick start (Docker, verified)
The Dockerfile and compose stack were built and run end-to-end on a Linux host. The smoke test and 459-frame corridor evaluation both pass.
# Build (CPU image, ~6.3 GB; needs --network=host on networks with restricted DNS)
docker build --network=host -t ml-inference .
# Smoke — forward pass on random tensor, confirms model loads
docker compose run --rm smoke-test
# → "Model forward pass OK: depth (1, 1, 240, 320), seg (1, 6, 240, 320)"
# → "PyTorch 2.11.0+cu130, OpenCV 4.13.0, NumPy 2.2.6"
# Corridor depth evaluation — 459 frames, prints per-bin RMSE
docker compose run --rm eval-corridor
# → RMSE 1.366 m (raw, uncalibrated; consistent with the reported 1.418 m within run-to-run variance)
# → 79.7% sensor dead-pixel rate confirms the bootstrap-perception premise
# Calibration sensitivity sweep
docker compose run --rm calibration
# Comparison videos (CPU; takes ~20 min)
docker compose run --rm grid-videos
Model weights (hpc_outputs/*.pt) and evaluation data (corridor_eval_data/) are volume-mounted, not baked into the image. HuggingFace hosting is on the to-do list — see Datasets for the current status.
Stack
PyTorch 2.0+ · timm (EfficientViT-B1) · ONNX · TensorRT FP16 · OpenCV-headless · Jetson Orin Nano 8GB · Orbbec Femto Bolt · RPLiDAR S2 · ROS 2 Humble · Nav2 · SLAM Toolbox · NYU Greene HPC (L40S)
A robotics research project. Author and contributor information is omitted from this site during the active review window and will be added once review completes.