Deployment: HPC to Jetson
The training pipeline runs on NYU HPC (A100/L40S GPUs, 64 GB VRAM). The deployment target is a Jetson Orin Nano 8GB strapped to a Traxxas Maxx 4S RC car. The gap between those two environments is the subject of this page.
Export pipeline
export_trt.py takes a PyTorch checkpoint and produces three things:
- ONNX model — portable, runs anywhere with
onnxruntime. Used for debugging and CPU-only validation. - TensorRT FP16 engine — half-precision inference, 2× speedup over FP32 on Jetson, negligible accuracy loss.
- TensorRT INT8 engine — 4× speedup over FP32, but requires a calibration dataset to maintain accuracy.
# Full export (ONNX + FP16 + INT8)
python export_trt.py --checkpoint hpc_outputs/best_depth_v9.pt
# ONNX-only (no TRT, runs on laptop)
python export_trt.py --checkpoint hpc_outputs/best_depth_v9.pt --skip-trt
# INT8 with real calibration images
python export_trt.py --checkpoint hpc_outputs/best_depth_v9.pt \
--calib-images corridor_eval_data/rgb/
INT8 caveat: The default calibration set in export_trt.py is np.random.rand(...). The export succeeds and the engine runs, but the quantization quality is poor. For production INT8, always supply --calib-images with real images from the target domain.
Jetson benchmarks
benchmark_jetson.py measures latency, GPU memory, and depth RMSE across all exported engines.
| Engine | Resolution | Latency (ms) | GPU Mem (MB) | Depth RMSE (m) |
|---|---|---|---|---|
| DA3-Small FP16 | 308×308 | 4.6 | 2,700 | 0.522 |
| V9 Student FP16 | 240×320 | 2.1 | 890 | 0.382* |
| V9 Student INT8 (random calib) | 240×320 | 1.4 | 640 | — |
*V9 RMSE is on LILocBench, not the corridor set. On the corridor set, V9 is 1.418 m (see Evaluation).
DA3-Small at 218 FPS (FP16, 308×308) is the production runtime model. The V9 student runs alongside it as a corridor-specialized companion. Both fit in the Orin Nano’s 8 GB shared memory with room for the ROS 2 stack.
What benchmark_jetson.py does not measure: segmentation mIoU. The mIoU column is initialized but never populated (lines 150, 186). Latency and depth RMSE are real; mIoU is not.
ROS 2 integration
The student model plugs into the NCHSB ROS 2 stack through three nodes:
Student TRT Node → Subscribes to /camera/rgb, runs the TensorRT engine, publishes /student/depth and /student/segmentation.
Depth Fusion Node → Subscribes to /student/depth and /camera/depth (ToF). For each pixel: if the ToF sensor reports valid depth above a confidence threshold, output the ToF reading. Otherwise, output the student prediction. Publishes /perception/fused_depth.
Point Cloud XYZ Node → Back-projects /perception/fused_depth into a 3D point cloud using the camera intrinsics. Publishes /perception/fused_depth_points, which Nav2’s local costmap observes.
The fusion logic in the Depth Fusion Node matches the training target exactly: trust hardware where it works, fall back to learned depth where it does not.
Specification and deployment realization
The formal specification expresses fusion as affine alignment: \(d_{metric} = \alpha \cdot d_{pred} + \beta\), where \(\alpha, \beta\) are estimated from a calibration set.
The live Depth Fusion Node does not implement affine alignment. It implements confidence-gated substitution: use ToF if valid, use student otherwise. No \(\alpha\), no \(\beta\), no per-pixel scaling.
The full mapping between the two implementations lives at Specification and Deployment Realization. The affine formulation is the precise mathematical statement used for evaluation; the substitution approach is the embedded realization that achieves the same costmap geometry under the on-vehicle compute and latency budget.
Both approaches arrive at the same costmap: dense depth from a mixture of hardware and learned sources. The evaluation scripts (run_costmap_ablation.py) use per-frame median scaling (the affine flavor). The runtime uses substitution. The Calibration Study quantifies the gap.
Reproducing on a Jetson
# On the Jetson (JetPack 5.x, TensorRT pre-installed)
scp best_depth_v9.pt vortex@<jetson-ip>:~/
ssh vortex@<jetson-ip>
python export_trt.py --checkpoint best_depth_v9.pt
python benchmark_jetson.py --engine-dir exported/
The exported engines are platform-specific. An engine built on an L40S will not run on a Jetson Orin Nano. Always export on the target device.