V4: Encoder Substitution at Fixed Training Recipe
V4 substitutes EfficientViT-B1 for MobileNetV3-Small as the encoder, holding the V3 training recipe constant. The configuration achieves 0.774 m NYU val RMSE, a 33 % relative reduction from V3, and is the first configuration evaluated against the deployment camera (Femto Bolt corridor RMSE: 1.373 m). All subsequent configurations (V5 through V9) inherit the V4 encoder.
Configuration
| Property | Value |
|---|---|
| Architecture | EfficientViT-B1 encoder, dual-head decoder (depth + segmentation) |
| Trainable parameters | 5.31 × 10⁶ (3.5 × increase over V3) |
| Teacher | DA3-Metric-Large (unchanged from V3) |
| Loss formulation | berHu + cross-entropy + edge-aware smoothness, Kendall-weighted (unchanged from V3) |
| Optimizer | AdamW; encoder LR 3 × 10⁻⁵, decoder LR 3 × 10⁻⁴ (unchanged from V3) |
| Schedule | Cosine annealing, 200 epochs |
| Batch size | 16 |
| Training corpus | NYU Depth V2 |
| HPC job ID | 3043912 |
| NYU val RMSE | 0.774 m (down from V3’s 1.160 m, −33 %) |
| NYU val mIoU (6-class) | 51.0 % (up from V3’s 39.3 %, +11.7 pp) |
| Femto Bolt corridor RMSE | 1.373 m (first deployment-camera measurement) |
| Codename | Pivot (the architectural turning point of the lineage) |
Method
The configuration differs from V3 by a single experimental variable: the encoder backbone. All other components — teacher selection, loss formulation, augmentation pipeline, optimizer settings, and training schedule — are held constant, isolating the effect of the encoder substitution.
EfficientViT-B1 (Cai et al. 2023) is a hybrid attention-convolution architecture with linear-complexity multi-head self-attention. The architectural difference relative to V3’s encoder is shown below.
Receptive field grows linearly with depth
Limited capacity for long-range structure"] M4 -.-> M_NOTE end subgraph V4enc["V4: EfficientViT-B1 (5.31 × 10⁶ params)"] direction TB E_IN["Input [B, 3, 240, 320]"] E1["Stage 1: 40 ch (H/2)
+ multi-head attention"] E2["Stage 2: 80 ch (H/4)
+ multi-head attention"] E3["Stage 3: 160 ch (H/8)
+ multi-head attention"] E4["Stage 4: 256 ch (H/16)
+ multi-head attention"] E_IN --> E1 --> E2 --> E3 --> E4 E_NOTE["Hybrid attention-convolution
Linear-complexity attention
Captures long-range dependencies
(corridor lines, perspective)"] E4 -.-> E_NOTE end V3RES["NYU val RMSE: 1.160 m
NYU val mIoU: 39.3 %"] V4RES["NYU val RMSE: 0.774 m (−33 %)
NYU val mIoU: 51.0 % (+11.7 pp)"] V3enc -.->|"trained with V3 recipe"| V3RES V4enc -.->|"same V3 recipe, encoder swap"| V4RES style V3enc fill:#fde2e2 style V4enc fill:#d4e7c5 style V3RES fill:#fde2e2 style V4RES fill:#d4e7c5
Diagram source: assets/diagrams/models/v4-encoder-comparison.mmd.
The capacity comparison and selection rationale relative to neighboring options:
| Encoder | Parameters | Rationale |
|---|---|---|
| MobileNetV3-Small (V3) | ~1.5 × 10⁶ | Insufficient capacity for fine spatial structure at depth boundaries |
| EfficientViT-B1 (V4) | 5.31 × 10⁶ | Selected; meets Jetson Orin Nano latency budget at ~5 ms inference |
| EfficientViT-B2 | ~15.4 × 10⁶ | Evaluated in train_iter7b_b2.slurm; marginal accuracy gain insufficient to justify ~3 × inference cost |
| EfficientNet-B3 | ~12 × 10⁶ | Considered; rejected on inference latency |
EfficientViT-B1 occupies the operational sweet spot: sufficient capacity to capture long-range structure for indoor depth estimation, and inference latency compatible with the 30 Hz perception loop on the deployment hardware.
Architectural Detail
The EfficientViT-B1 encoder produces feature maps at four resolution stages with channel widths [40, 80, 160, 256]. The decoder consumes these via skip connections to a 128-channel neck, with two parallel transposed-convolution paths producing depth and segmentation outputs at the input resolution.
Each DecoderBlock(in, out) implements:
Upsample(2×) → Conv2d(in + skip, out, 3×3) → BatchNorm → ReLU
The skip connection is drawn from the encoder stage of matching spatial resolution and concatenated with the upsampled feature map prior to convolution. ImageNet normalization is applied within the forward pass to ensure preprocessing parity between training and inference.
Quantitative Results
| Metric | V4 | V3 | Δ |
|---|---|---|---|
| NYU val RMSE | 0.774 m | 1.160 m | −33 % |
| NYU val mIoU (6-class) | 51.0 % | 39.3 % | +11.7 pp |
| Femto Bolt corridor RMSE | 1.373 m | not measured | — |
The 33 % NYU RMSE reduction is attributable to the encoder substitution under fixed-recipe conditions. The 11.7 pp mIoU improvement is similarly encoder-attributable; segmentation typically benefits from increased encoder capacity due to the sharper class boundaries available with richer feature representations.
The Femto Bolt corridor RMSE (1.373 m) substantially exceeds the NYU val RMSE. This gap is attributable to distribution shift: V4 was trained on clean teacher labels over NYU’s apartment-scale scenes, and the deployment recordings include exposure dynamics, motion blur, and corridor geometry not represented in the training distribution. This gap motivated the augmentation work introduced at V5.
Concurrent Contribution Pattern
The V4 result reflects the joint effect of two parallel contributions: the encoder substitution and the V3 recipe rewrite. Either contribution in isolation produces a substantially weaker result. The same recipe applied to MobileNetV3-Small produced V3’s 1.160 m RMSE; the same encoder applied to V1’s MSE-against-relative-depth supervision would inherit V1’s unit-space failure regardless of architectural improvements.
The V4 result is therefore not attributable to a single component but to the interaction between the encoder and the recipe — neither sufficient alone.
Findings
V4 establishes the architectural template inherited by all subsequent configurations. The encoder, decoder structure, neck dimensionality, skip-connection topology, and forward-pass normalization remain unchanged through V9. All variation in V5 through V9 is restricted to the data pipeline (augmentation, dataset mixture) and the training schedule (pretraining stages, fine-tuning protocols).
→ V5 reports the augmentation pipeline that closes the train/test distribution gap and produces the largest single-step accuracy improvement in the lineage.