Training pipeline: V1 through V9

This page documents the full student lineage — nine training iterations across two backbone architectures, three dataset mixtures, and two loss formulations. It is written as a story, not a changelog, because the decisions make no sense without the context that produced them.

Per-version detail: every version below has a dedicated page with architecture, results, and a verdict. See the model lineage index for the full set, or jump to a specific version: V1 · V2 · V3 · V4 · V5 · V6 · V7 · V8 · V9.

The problem

We need a small network that takes an RGB image and outputs:

Dense metric depth (240×320, meters)
6-class semantic segmentation (floor, wall, glass, person, furniture, other)

The depth output will fuse with whatever ToF pixels survived hardware failure. The segmentation output will modulate costmap inflation (glass gets wider berth than wall). Both heads share a single encoder — we cannot afford two networks on a Jetson Orin Nano at 30 FPS.

Teacher ensemble

Training labels come from three foundation models, none of which run at real time on embedded hardware:

Teacher	What it produces	Speed on HPC (A100)
DA3-Metric-Large	Dense metric depth from RGB; scale comes from `focal · raw / 300`	~12 FPS
YOLOv8-Large	Bounding boxes for person, furniture, background	~30 FPS
SAM2-Large	Instance masks from YOLO-prompted boxes; fused with geometric floor/wall/glass heuristics	~5 FPS

The teacher pipeline runs on NYU HPC via SLURM. Per-frame outputs are written to $SCRATCH/nyu_teacher_data/ and linked into a manifest.jsonl that the training script reads.

The loss function

The depth head trains on a hybrid target: for each pixel, if the ToF sensor reports valid depth above a confidence threshold, the target is the ToF reading. Otherwise, the target is the DA3-Metric-Large prediction. This is not a design choice made for convenience — it mirrors the runtime fusion policy exactly. The student learns to predict what the fused output should look like, not what either sensor alone reports.

The full loss (from models/losses.py):

\[\mathcal{L} = w_d \cdot \text{berHu}(d_{pred}, d_{target}) + w_s \cdot \text{CE}(s_{pred}, s_{target}) + w_e \cdot \text{EdgeSmooth}(d_{pred}, I_{rgb})\]

where $w_d, w_s, w_e$ are either fixed or learned via Kendall uncertainty weighting (V3+). The berHu loss transitions from L1 to L2 at the 80th percentile of the per-batch residual — punishes small errors linearly and large errors quadratically.

V1: initial distillation baseline

Backbone: MobileNetV3-Small. Teacher: DA2-Large.

RMSE on NYU val: 75.37 m. Not a typo.

DA2 outputs relative depth — values in [0, 1] with no metric scale. The training loop expected metric depth in meters. Every prediction was off by orders of magnitude. The model learned to predict normalized values, and the RMSE metric computed the gap against meter-scale targets.

The fix was not complicated: switch to DA3-Metric-Large, which has a proper scale anchor. But the lesson was important — relative depth models need an explicit alignment step before you can use them as supervision.

V2: loss-weighting diagnostic

Experiments with Kendall uncertainty clamping on V1’s architecture. The backbone was still MobileNetV3-Small with DA2 targets. Not a useful result — just confirming that loss weighting cannot fix fundamentally wrong supervision.

V3: recipe rewrite with metric-scale teacher

Backbone: MobileNetV3-Small. Teacher: DA3-Large.

Changes from V1: berHu loss (replaces MSE), Kendall uncertainty weighting (learns task weights), two-LR optimizer (backbone and decoders at different rates).

NYU RMSE: 1.160 m.

This was the first iteration where the numbers started to make physical sense. A meter-scale error on indoor depth (typical range 1–6 m) is meaningful but not great. The model was underfitting — MobileNetV3-Small’s encoder did not have enough capacity to capture the spatial structure the depth head needed.