V6: Multi-Domain Pretraining Followed by NYU Fine-Tuning
V6 introduces a two-stage training schedule: multi-domain pretraining on SUN RGB-D and DIODE Indoor, followed by NYU Depth V2 fine-tuning under the V5 augmentation pipeline. The configuration achieves 0.519 m NYU validation RMSE (the lowest NYU result in the lineage) and serves as the fine-tuning base for the corridor specialist V9.
Configuration
| Property | Value |
|---|---|
| Architecture | EfficientViT-B1 (unchanged from V4) |
| Stage 1 corpus | SUN RGB-D (~10 × 10³ frames) + DIODE Indoor (~8 × 10³ frames) |
| Stage 2 corpus | NYU Depth V2 (1,159 train frames) with V5 augmentation pipeline |
| Loss formulation | berHu + cross-entropy + edge-aware smoothness, Kendall-weighted |
| Optimizer | AdamW; encoder LR 3 × 10⁻⁵, decoder LR 3 × 10⁻⁴ |
| Stage 1 HPC job | 3093046 (38 / 50 epochs, terminated by scheduler — see Operational History) |
| Stage 2 HPC job | 3098656 (154 / 200 epochs, terminated at walltime) |
| Failed sibling jobs | 3084108, 3088530 |
| NYU val RMSE | 0.519 m (lowest NYU result in lineage) |
| NYU val mIoU (6-class) | 48.5 % (regression vs V5’s 63.7 %; see Mixed-Supervision Effects) |
| Femto Bolt corridor RMSE | 2.158 m (statistically equivalent to V5’s 2.186 m) |
| HuggingFace identifier | NishantPushparaju/vortex-depth-v6-pretrained |
| Codename | Cornerstone (the foundation other specialists build on) |
| Checkpoint | hpc_outputs/best_depth_v6.pt |
Method
The configuration applies a two-stage training schedule designed to broaden the encoder’s prior over indoor depth structure before specialization on the primary training corpus.
~10 × 10³ frames
(no segmentation labels)"] DIODE["DIODE Indoor
~8 × 10³ frames
(no segmentation labels)"] NYU1["NYU Depth V2
1.159 × 10³ frames
(with 6-class segmentation)"] S1MIX["Combined ~19 × 10³ frames
Segmentation loss skipped
where labels absent"] SUN --> S1MIX DIODE --> S1MIX NYU1 --> S1MIX S1OUT["Job 3093046
38 / 50 epochs
(scheduler termination)"] S1MIX --> S1OUT end subgraph Stage2["Stage 2: NYU fine-tuning"] direction TB NYU2["NYU Depth V2
+ V5 augmentation pipeline"] S2OUT["Job 3098656
154 / 200 epochs
(walltime termination)"] NYU2 --> S2OUT end Stage1 --> CKPT["V6 pretrain checkpoint
(encoder prior established)"] CKPT --> Stage2 Stage2 --> RESULT["V6 final checkpoint
NYU val RMSE: 0.519 m
NYU val mIoU: 48.5 %"] FAIL1["Job 3084108
NaN loss
(ignore-index edge case)"] FAIL2["Job 3088530
GPU utilization kill
(dataloader bottleneck)"] FAIL1 -.->|"informed loss-function fix"| Stage1 FAIL2 -.->|"informed dataloader profiling"| Stage1 style Stage1 fill:#e8f0ff style Stage2 fill:#fff3cd style RESULT fill:#d4e7c5 style FAIL1 fill:#fde2e2 style FAIL2 fill:#fde2e2
Diagram source: assets/diagrams/models/v6-pretrain-finetune-timeline.mmd.
Stage 1: Multi-domain pretraining
SUN RGB-D and DIODE Indoor are folded into the training mixture alongside NYU Depth V2. Both auxiliary datasets provide metric depth ground truth that does not require teacher inference. The objective is to expose the encoder to indoor texture and lighting variance not represented in NYU’s apartment-scale scenes.
| Dataset | Approximate frame count | Segmentation labels | Role |
|---|---|---|---|
| SUN RGB-D | 10 × 10³ | None (all pixels = ignore index 255) | Texture diversity |
| DIODE Indoor | 8 × 10³ | None (all pixels = ignore index 255) | Texture diversity |
| NYU Depth V2 | 1.159 × 10³ | 6-class | Primary supervision |
The auxiliary datasets do not provide segmentation supervision; pixel labels are uniformly set to the ignore index. The segmentation head receives no gradient on auxiliary-dataset batches, requiring the cross-entropy loss to handle all-ignore conditions safely (see Loss-Function Edge Case below).
Stage 2: NYU fine-tuning
The auxiliary datasets are removed. Training continues on NYU Depth V2 alone with the V5 augmentation pipeline (horizontal flip + ColorJitter + random crop / resize). The pretrain establishes a richer encoder prior; the fine-tune adapts the encoder to NYU’s scene statistics and re-engages the segmentation head.
Quantitative Results
| Metric | V6 | V5 | Δ |
|---|---|---|---|
| NYU val RMSE | 0.519 m | 0.572 m | −9.3 % |
| NYU val mIoU (6-class) | 48.5 % | 63.7 % | −15.2 pp |
| Femto Bolt corridor RMSE | 2.158 m | 2.186 m | −1.3 % (statistically equivalent) |
The NYU val RMSE improvement of 9.3 % at fixed architecture is attributable to the multi-domain pretraining stage. The improvement is smaller than V4 → V5 (26 %) but is the second-largest improvement on the NYU validation metric in the lineage. The corridor RMSE is statistically unchanged from V5: the multi-domain pretrain does not improve corridor performance directly, since the auxiliary datasets share the apartment-scale geometry of NYU rather than the corridor-scale geometry of the deployment environment.
Operational History
The V6 result of 0.519 m NYU val RMSE was produced by a training program that included two scheduler-terminated runs, one walltime termination, and one configuration error. The full sequence:
| HPC job | Outcome | Cause | Effect |
|---|---|---|---|
| 3084108 | Terminated | NaN loss propagation | Triggered investigation; produced the loss-function edge-case fix |
| 3088530 | Terminated by scheduler | GPU utilization below partition threshold | Identified as dataloader bottleneck on parallel filesystem |
| 3093046 | Terminated by scheduler at 38 / 50 epochs | Same GPU utilization issue | Snapshotted as Stage 1 pretrain checkpoint |
| 3098656 | Walltime timeout at 154 / 200 epochs | 12-hour partition limit | Snapshotted as Stage 2 (V6 final) checkpoint |
The final V6 checkpoint represents two terminated runs producing usable artifacts. Subsequent training programs include explicit dataloader profiling at job submission time and partition-aware walltime budgeting to reduce reliance on terminated-run snapshots.
Loss-Function Edge Case
Job 3084108 was terminated by NaN loss propagation. The cause was a corner case in PyTorch’s nn.CrossEntropyLoss(ignore_index=255): when every pixel in a batch is set to the ignore index, the reduction divides by zero and returns NaN. The NaN propagates through the multi-task loss and corrupts the optimizer state within a small number of iterations.
The condition occurs whenever a batch consists entirely of frames from datasets without segmentation supervision (SUN RGB-D, DIODE), which becomes likely under randomized batch sampling. The fix:
# Within MultiTaskLoss.forward
has_valid_seg = (seg_target != 255).any()
if has_valid_seg:
L_seg = self.ce(pred_seg, seg_target)
else:
L_seg = pred_seg.sum() * 0.0 # Returns zero with gradient connectivity preserved
The guard is now applied across all multi-domain training configurations.
Mixed-Supervision Effects
The 15.2 pp regression in NYU val mIoU (V6: 48.5 % vs V5: 63.7 %) is attributable to mixed-supervision effects during Stage 1 pretraining. The segmentation head receives effective gradient on only the NYU subset of pretrain batches (~6 % of frames). By the start of Stage 2 fine-tuning, the segmentation decoder has drifted toward zero-output predictions on the auxiliary-dataset distribution and must re-learn the 6-class structure on NYU. The 154-epoch fine-tune (truncated by walltime) does not fully recover the V5 mIoU.
This is a documented tradeoff: the depth head benefits from the pretrain (richer encoder representations transfer); the segmentation head pays for the auxiliary-dataset supervision gap. The configuration accepts the segmentation regression because depth is the load-bearing output for the costmap fusion pipeline. For deployments where segmentation accuracy is primary, V5 is the preferred checkpoint.
Comparative Performance as Fine-Tuning Base
V6’s primary value lies in its role as a fine-tuning base for domain-specialized configurations. The V6 vs V5 comparison as initialization for corridor specialization:
| Specialist | Initialization | LILocBench corridor RMSE | NYU val RMSE after fine-tune |
|---|---|---|---|
| V7 | V5 | 0.445 m | 1.315 m |
| V9 | V6 | 0.382 m | 1.553 m |
Initializing from V6 rather than V5 produces a 14 % relative improvement in LILocBench corridor RMSE at fixed fine-tuning protocol. The richer encoder prior established by the V6 pretrain transfers to corridor specialization more effectively than V5’s NYU-only training.
Demonstration
Six-panel video sequence over the 459-frame corridor_eval set. Top row (raw inputs and reference): RGB input · raw Femto Bolt ToF depth · zero-shot DA3-Small reference depth (median-scale aligned). Bottom row (V6-specific predictions and fusions): V6 student raw inference · confidence-gated fusion of ToF and DA3 · confidence-gated fusion of ToF and V6. V6’s prediction structure on the corridor distribution is comparable to V5 (both general-purpose models with similar Femto Bolt corridor RMSE: 2.158 m and 2.186 m, respectively). V6 (Cornerstone) is the recommended fine-tuning base for additional corridor specialists rather than the recommended deployment checkpoint for corridor environments — see V9 for the production corridor model.
Findings
V6 supports two findings on training-program design:
-
Multi-domain pretraining transfers to downstream specialization quality. The 14 % LILocBench corridor RMSE improvement between V7 (V5-initialized) and V9 (V6-initialized) under fixed fine-tuning protocol isolates the effect of the pretraining stage on the specialization endpoint.
-
Mixed-dataset training requires explicit handling of supervision gaps. Datasets without complete annotation coverage produce loss-function edge cases (NaN propagation under ignore-index reductions) that surface only at runtime. Training programs combining sources with heterogeneous annotation must include explicit guards against zero-supervision conditions per loss term.
→ V7 reports the corridor specialization initialized from V5, establishing the baseline against which V9 is compared. → V9 reports the corridor specialization initialized from V6, the recommended deployment checkpoint for corridor environments.