System architecture
Four diagrams. Each captures one slice of the system: the two-repository split, the student model, the runtime fusion pipeline, and the training loss. All four use a left-to-right reading direction; the page CSS allows horizontal scroll for wide diagrams when the viewport is narrow.
1. Two-repository split
The off-board training pipeline (this repository) produces a TensorRT engine that the on-vehicle runtime (sibling repository NCHSB) consumes. The .engine artifact is the only file that crosses the boundary.
DA3 + YOLO + SAM2"]:::hpc M["manifest.jsonl"]:::hpc TR["train.py
EfficientViT-B1
student"]:::hpc EX["export_trt.py
ONNX → TensorRT"]:::hpc AR[".engine
(boundary)"]:::pivot STN["Student TRT
Node"]:::jet DFN["Depth Fusion
Node"]:::jet PCN["PointCloud
XYZ Node"]:::jet N2["Nav2 local
costmap"]:::jet T --> M --> TR --> EX --> AR --> STN --> DFN --> PCN --> N2 classDef hpc fill:#e8f0ff,stroke:#999 classDef jet fill:#d4e7c5,stroke:#999 classDef pivot fill:#fff3cd,stroke:#666,stroke-width:2px
Blue nodes run on NYU Greene HPC (L40S 48 GB partitions). Green nodes run on the Jetson Orin Nano at ~30 Hz perception. The yellow .engine artifact is the only build product that needs to be transferred between the two systems.
2. Student model architecture
EfficientViT-B1 encoder (5.31 × 10⁶ parameters), 128-channel neck, two parallel transposed-convolution decoders. ImageNet normalization is applied within the forward pass for inference-time preprocessing parity.
[B, 3, 240, 320]"]:::input ENC["EfficientViT-B1
encoder
(5.31 × 10⁶ params)"]:::block NECK["Neck
Conv2d(256→128, 1×1)"]:::neck DEC_D["Depth decoder
3 × DecoderBlock
+ Conv2d(16,1) + ReLU"]:::block DEC_S["Seg decoder
3 × DecoderBlock
+ Conv2d(16,6)"]:::block OUT_D["Depth output
[B, 1, 240, 320]
metric (m)"]:::output OUT_S["Seg logits
[B, 6, 240, 320]"]:::output IN --> ENC --> NECK NECK --> DEC_D --> OUT_D NECK --> DEC_S --> OUT_S ENC -.skip features.-> DEC_D ENC -.skip features.-> DEC_S classDef input fill:#f5f5f5,stroke:#999 classDef block fill:#e8f0ff,stroke:#999 classDef neck fill:#fff3cd,stroke:#999 classDef output fill:#d4e7c5,stroke:#999
The encoder’s four stages produce feature maps at H/2, H/4, H/8, and H/16 resolutions with 40, 80, 160, and 256 channels. Each DecoderBlock(in, out) implements Upsample(2×) → Conv2d(in + skip, out, 3×3) → BatchNorm → ReLU. Skip connections route per-block: decoder block 1 receives stage-3 features, block 2 stage-2, block 3 stage-1. The dashed arrows represent this aggregate skip topology; per-block wiring is uniform across the two decoder paths.
3. Runtime depth fusion
Per frame on the Jetson: student inference, per-frame median-scale calibration against surviving ToF pixels, per-pixel confidence-gated substitution.
1280 × 720"]:::input TOF["ToF depth
+ confidence
(~22 % valid)"]:::input STUDENT["V9 student
TensorRT FP16
~5 ms"]:::block SCALE["Median-scale
s = median(d_ToF / d_V9)
over conf ≥ 0.5 pixels"]:::calib GATE{"Per-pixel:
conf ≥ 0.5 AND
0.05 ≤ d_ToF ≤ 10.0 m?"}:::calib FUSED["Fused depth
1280 × 720"]:::output NAV["Nav2 local
costmap"]:::output RGB --> STUDENT --> SCALE --> GATE TOF --> SCALE TOF --> GATE GATE -->|yes: d_ToF| FUSED GATE -->|no: s · d_V9| FUSED FUSED --> NAV classDef input fill:#f5f5f5,stroke:#999 classDef block fill:#e8f0ff,stroke:#999 classDef calib fill:#fff3cd,stroke:#999 classDef output fill:#d4e7c5,stroke:#999
Two operations per frame:
- Median-scale calibration —
s = median(d_ToF / d_student)over pixels with confidence ≥ 0.5. One scalar per frame. See Scale Calibration. - Confidence-gated substitution — per pixel, use the raw ToF reading where confident and in range; otherwise use
s · d_student. See Confidence-Gated Fusion.
The student segmentation output (not shown here) is consumed separately by the Class Costmap Node to apply per-class inflation radii (glass = 0.20 m, person = 0.30 m, wall = 0.12 m). Activation in the default Nav2 observation source list is opt-in via configuration. See Specification and Deployment Realization.
4. Training loss composition
The multi-task loss combines a depth term (berHu), a segmentation term (cross-entropy), and an edge-aware smoothness regularizer, combined via Kendall multi-task weighting.
+ per-frame targets"]:::input BERHU["berHu loss
(depth)"]:::loss CE["Cross-entropy
(segmentation)"]:::loss EDGE["Edge-aware
smoothness"]:::loss KENDALL["Kendall weighting
+ edge regularizer
(log σ² ∈ [-2, 2])"]:::weight TOTAL["Total scalar loss
(AdamW backprop)"]:::output INPUTS --> BERHU --> KENDALL INPUTS --> CE --> KENDALL INPUTS --> EDGE --> KENDALL KENDALL --> TOTAL classDef input fill:#f5f5f5,stroke:#999 classDef loss fill:#e8f0ff,stroke:#999 classDef weight fill:#fff3cd,stroke:#999 classDef output fill:#d4e7c5,stroke:#999
The hybrid depth target lives in models/losses.py:HybridDepthLoss. Important nuance: the training target chooses between DA3 and ToF at the frame level (does this frame have a DA3 label?), while the deployment fusion chooses between ToF and student at the pixel level (does this pixel have valid ToF?). Both implement the same supervision principle — prefer hardware ground truth where available, fall back to the learned signal where not — at the granularity appropriate to each stage. See decisions for the full mapping.