Phase 3 — Progress dashboard

Standing snapshot of what’s in flight, what’s blocked, and what can run in parallel without me. Updated after each working session.

Last update: 2026-04-27 — Phase-4 sweep complete. Structural ceiling confirmed; no single config is “the headline” — the curve is the story. Six tracker variants run on RELLIS 2847 frames. Each trades false-merge tail vs distinct-ID count; none simultaneously hit the [150, 200] / [250, 320] target window because DBSCAN’s per-frame cluster instability on forest geometry (centroid jitter 3 m, cluster-count delta 96% of frame pairs) is the structural floor. Best false-merge result: K=3 eps=0.5 — 0 cascade revivals at >20m world drift (vs K=1 Mahal-v2’s 11) at the cost of cluster-fragmentation distinct-ID inflation (242 → 307). Best distinct-ID result: K=1 Mahal-v2 — 242 distinct with 11 surviving false-merges. Decision: ship neither as a single headline; ship the SWEEP as the M13.5 + Phase-4 production story. The curve is what survives interview scrutiny: “we measured the ceiling, here’s the trade-off shape, here’s why it sits where it does (DBSCAN cluster jitter on RELLIS forest is the floor), here’s the next step (learned 3D detector).” Phase-4 status: complete — Phase-5 (learned detector) is the documented Phase-3 stretch. Animation render deferred to HPC. Currently running: K=3 eps=0.7 (recover distinct count) and min_hits=3 (already done — null result on distinct). Full sweep + ceiling diagnosis in docs/m10-debug-log.md “Phase-4 sweep — the curve and the wall”.


Test count

Suite Count Status
test_kalman 5
test_hungarian 8
test_sort_tracker 7
test_dbscan 4
test_imm 7 ✅ (caveat: lock-in bias documented)
test_appearance_encoder 4 ✅ (4/4 — PyTorch ↔ Eigen match within 1e-5; ≥95% triplets satisfy d_pos < d_neg)
Total 35 35 passing

Latest RELLIS metrics (M12 IMM, identical detections)

CV-baseline reproducibility — RESOLVED

The earlier “regression flag” was a misattribution by me, not a code regression. Today’s post-refactor code reproduces the M10 blog’s 979 number exactly when run with the correct config:

Config CV distinct CV mean lifetime Rows
min_hits=3, process_noise=0.5 (matches M10 blog) 979 17.35 16,990
min_hits=1, process_noise=2.0 (current script default) 2131 22.23 47,375
min_hits=1, process_noise=0.5 2021 23.44 47,375

Finding: The M10 blog body claims “tuning to min_hits: 3 → 1 knocked the count down to 979.” That’s incorrect — the 979 number is the min_hits=3 config; loosening to min_hits=1 increases distinct IDs (more one-frame ghost tracks get published instantly). Blog narrative is wrong; numbers are real.

The Track refactor (KalmanFilter2D kfstd::unique_ptr<IFilter> filter) is verified algorithmically clean.

Headline IMM-vs-CV at script default (min_hits=1, process_noise=2.0)

Metric CV IMM Δ
Distinct IDs 2131 1678 −21%
Mean lifetime (frames) 22.2 28.2 +27%
Total publishable rows 47,375 47,375 identical

IMM lifetime distribution (script default config)

Lifetime band Count %
exactly 1 frame (DBSCAN ghost) 404 24.1%
2–5 frames 399 23.8%
6–15 frames 269 16.0%
16–30 frames 172 10.3%
31–100 frames (3–10 s) 295 17.6%
>100 frames (>10 s) 139 8.3%
Total 1678
Max lifetime 461 frames (46.1 s)

Reading: 48% of tracks live ≤ 0.5 s — upstream DBSCAN noise; M13’s appearance encoder cannot fix this band (the tracks die before any embedding update runs on them). 26% live > 3 s — these are the genuinely-tracked physical objects, plus the over-segmented or briefly-occluded fragments. M13 attacks the 2–30 frame band by re-associating fragments that DBSCAN over-segmented or that gating dropped after brief occlusion. Target: distinct 1678 → ≤ 250, mean lifetime ≥ 60.

IMM at the M10-blog baseline config — DONE

Config: min_hits=3, process_noise=0.5 CV IMM Δ
Distinct track IDs 979 808 −17.5%
Mean lifetime (frames) 17.35 19.87 +14.5%
Total publishable rows 16,990 16,056 −5.5%

Verdict: M12 ships. IMM win is real and reproducible against the M10 blog’s published number. Hard floor < 800 missed by 8 IDs (808). Original ≤ 600 target not hit; that’s a Phase-3.5 motivation for M13.

M13 SHIPS — final config and ablation — DONE

Final config (production): IMM filter, λ=0.2, max_misses=300, min_hits=3, process_noise=0.5, meas_noise=0.3, max_dist=5.0

Headline RELLIS numbers (2849 frames, ~285 sec drive):

Stack Distinct IDs Mean lifetime Δ vs M4 baseline Note
M4 (CV, max_misses=10) 979 17.35  
M12 (IMM, max_misses=10) 808 19.87 −17.5% / +14.5%  
M13 (IMM + app λ=0.2, max_misses=300) 237 57.79 −75.8% / +233%  
M13.5 (IMM + cascade, max_age=30, ego anchor) — Fix A defensive 202 ~80 −79.4% / +361% retired once Fix B shipped
M13.5 (IMM + cascade, max_age=300, WORLD anchor) — Fix B + audit caveat 127 373.0 −87.0% / +2050% shipped but audit-incomplete: 95 cascade revivals across drive still have > 20 m world drift, leaking through kLostPosGateScale=5.0 (25 m gate). Numbers PARTIALLY inflated by false-merges.
M13.5 (IMM + cascade, max_age=300, gate=10 m) — Fix C ATTEMPTED, NOT SHIPPED 299 158.4 −69.5% / +813% Fixed-gate tightening 5.0 → 2.0. Killed all > 20 m drift false-merges (0 stationary-window > 10 m drift) BUT over-rejected DBSCAN-noisy legitimate revivals (5–15 m drift on partial-tree clusters). Worse than M13 cascade-off (237). Reverted; data showed fixed-distance gate can’t separate “noisy stationary” from “different object”.
M13.5 Mahalanobis-v1 (combined IMM cov, χ²=5.99) — TESTED, SUPERSEDED 207 228.9 −78.9% / +1219% First Mahalanobis attempt. IMM combined covariance balloons via inter-mode spread term — admits 16 cascade revivals @ >20 m world drift on stationary ego segments. Combined cov is correct for estimation, wrong for gating.
M13.5 Mahalanobis-v2 (per-mode IMM cov, χ²=5.99) — knee of trade-off curve 242 195.8 −75.3% / +1027% Picks the more confident sub-model’s P_position for gating. 11 cascade revivals @ >20 m drift remain (vs 95 for Fix B) — 8.6× false-merge reduction. Sits at the trade-off knee on the tracker dimension.
M13.5 Mahalanobis-v3 (per-mode IMM cov, χ²=2.28) — TESTED, OVER-TIGHT 384 123.4 −60.8% / +611% 2.6× tighter χ² (1σ ellipse instead of 95% confidence). Drops false-merges to 4 @ >20 m, BUT distinct IDs balloon to 384 (worse than M13 cascade-off) and lifetime collapses. Empirical proof that the gate-tuning knob has no sweet spot — DBSCAN noise dictates the ceiling.
Phase-4: K=3 eps=0.5 + Mahal-v2 — knee of trade-off curve on detector dim. 307 238.2 −68.6% / +1273% Multi-frame point cloud accumulation: compose K=3 obstacle clouds in world frame using SLAM ego pose, re-DBSCAN. 0 cascade revivals @ >20 m world drift (vs 11 for K=1 Mahal-v2). Halves >10 m drift bucket (248→130). Cluster fragmentation pushes detections per frame 16.6→25.7 → distinct count rises. Lifetime up 22%. Cleaner per-cluster jitter (gap=5: 3.07 → 1.96 m, −35%).
Phase-4: K=3 eps=0.5 + min_hits=3 — TESTED, NULL RESULT 307 145.4 Stricter publication threshold. Did not reduce distinct count (the K=3 fragmented clusters are stable, not 1-frame ghosts). Strictly worse on lifetime than min_hits=1. Confirms cluster fragmentation is real, not transient.
Phase-4: K=3 eps=0.7 + Mahal-v2 — combined knee 272 194.8 −72.2% / +1023% Looser DBSCAN re-merges the K=3 accumulation fragmentation. 0 cascade revivals @ >20 m drift preserved (vs K=1’s 11), drive-wide >10 m count holds at 132 (vs K=3 eps=0.5’s 130). Distinct count recovers from 307 → 272 (36% above target window’s upper bound, vs K=3 eps=0.5’s 53% above). Lifetime drops from 238 → 195 because looser DBSCAN merges fragments → fewer cluster instances → cascade has fewer revival anchors per physical object. The “closest single point to target window” — but still NOT inside it. No DBSCAN-paradigm config simultaneously hits [150-200] distinct + [250-320] lifetime + 0 false-merges. The structural ceiling is fully mapped.
M13.5 max_age=300, ego anchor 99 163.66 −89.9% ARTIFACT: ~50% of long-gap revivals were false-merges of different physical objects sharing ego-relative position after ego motion (see docs/m10-debug-log.md “False revivals” STORY)

docs/m10-debug-log.md carries the full ablation tables and the “unsexy parameter mattered most” STORY entry.

Ablation summary — what actually moved the needle

Two independent variables manipulated:

(a) max_misses (track-memory length): 10 → 300 → 500. Going from M4’s max_misses=10 to max_misses=300 alone moved distinct-IDs from 808 to ~244 (IMM-only). This was the dominant lever.

(b) λ (appearance weight in cost matrix): 0.0 → 0.2 → 0.8. At max_misses=10: λ=0.1 gave −2% over IMM-only. At max_misses=300: λ=0.2 gave −3% over IMM-only (244 → 237). Encoder-only contribution is small but real. λ ≥ 0.5 regresses badly at any max_misses.

Why max_misses=10 was the wrong default (rebirth-gap analysis)

For M12’s 808 distinct IDs, found that 666 of them (82%) had a “parent track” within 2m of their birth position. Distribution of the death-to-birth gap:

gap band count % regime
1–10 frames 54 8.1% catchable by max_misses=10
11–60 frames 175 26.3% catchable by max_misses=60
61–300 frames 310 46.6% catchable by max_misses=300
301+ frames 127 19.1% needs full re-detection / cascade

Conclusion: M4’s max_misses=10 (= 1-second patience) was throwing away 73% of the recoverable signal. The encoder couldn’t help because the relevant tracks were already pruned before re-detection arrived.

Encoder ROI assessment

3 hours of hand-labeling + ~15 min HPC training + days of plumbing produced an encoder that contributes ~1–3% additional distinct-ID reduction beyond what max_misses alone delivers. The encoder is fully functional (4/4 unit tests, < 1e-5 PyTorch ↔ Eigen agreement, 99% val acc on hand labels) but in this regime the IMM’s position-uncertainty growth + 5m gating already discriminates most matches; the encoder’s tiebreaker contribution is small.

Honest interview framing: built end-to-end Deep SORT-style pipeline to spec; ran a proper ablation isolating max_misses from λ; found that the upstream pipeline parameter dominated. The encoder’s small win is real but not headline-grade in this configuration.

Phase-3.5 cascade matching — DONE

Implemented Deep SORT-style cascade matching: tracks transition Live → Lost when misses > max_misses (instead of immediate erase), Lost tracks freeze their position and stay in a retired pool for max_age frames, two-stage match (Live first, then Lost on unmatched dets with relaxed position gating). Lost track revives to Live on a successful match (filter re-init at the new measurement, embedding preserved).

Final RELLIS result: 99 distinct IDs, 163.66 mean lifetime.

max_age distinct mean lifetime
0 (cascade off) 794 20.57
30 202 80.47
100 129 126.03
300 99 163.66
500 97 168.07
1000 95 172.35

Diminishing returns past max_age=300 (~30 sec). Default in scripts/run_tracker_on_rellis.sh set to 300.

37/37 unit tests green (added two cascade tests: CascadeRevivesAfterLongOcclusion, CascadeRespectsMaxAgeBudget).

Phase-4 follow-ups (deferred, prioritized)

  • Fix B — ego-motion compensation for cascade lost_pos. SHIPPED 2026-04-27. Stored as Track.lost_pos_world; SLAM ego pose threaded through tracker via the --ego-poses CLI flag and a per-update Eigen::Isometry2f T_world_ego argument with default Identity (legacy / synthetic test compat). On Live→Lost the freeze position is mapped to world frame; on cascade match it’s projected back into current-ego before gating. Result on RELLIS: 127 distinct IDs, 373-frame mean lifetime at max_age=300 — inside the predicted 120–150 window from the false-revivals analysis. Regression test: CascadeRevivalSurvivesEgoMotion (and the paired CascadeRevivalWithoutEgoMotionAnchorsToDecoy to pin the bug in place). 2 hr 15 min including doc updates and full-RELLIS validation, vs the 2-3 hr estimate.

  • Upstream DBSCAN tuning. Even at the honest M13.5 ~202 distinct, ~24% are 1-frame ghost tracks. Stricter min_samples would directly reduce the upstream noise. KD-tree neighbor query for runtime.

  • Multi-frame point cloud accumulation. Stabilizes detector inputs (less DBSCAN flicker → fewer Lost transitions to recover from). Needs SLAM odometry production-ready (which we have from M2 — same prerequisite as Fix B above).


In flight

What Owner Where Blocking
✅ M12 IMM RELLIS validation (808 vs 979 baseline at the M10-blog config) done results_m4/ablation_g/ nothing
✅ M13 hand-labeling (500 pairs, 99% prior agreement) done python/appearance/labels.csv nothing
✅ M13 training pipeline + HPC submission + checkpoint done python/appearance/checkpoints/encoder.pt nothing
✅ M13 weight export to C++ + reference test data done include/appearance_model_weights.hpp + tests/data/*.csv nothing
M13 cost-matrix integration in sort_tracker.cpp:79-85 user TBD end-to-end appearance run + λ sweep

Parallelizable RIGHT NOW (no dependencies)

These can run in any terminal, in any order, without coordination.

User-side

  • Hand-label ~50 cluster pairs. Tk UI ready. ~1–2 hr.
    python python/appearance/build_pairs.py \
        --clusters-dir /media/nishant/SeeGayt2/terra_perceive/m4_perframe/clusters_sweetspot \
        --num-pairs 100 --out python/appearance/pair_candidates.csv
    
    python python/appearance/label_pairs_cli.py \
        --pair-csv     python/appearance/pair_candidates.csv \
        --clusters-dir /media/nishant/SeeGayt2/terra_perceive/m4_perframe/clusters_sweetspot \
        --camera-dir   /media/nishant/SeeGayt2/terra_perceive/m4_perframe/extracted_frames_camera \
        --out          python/appearance/labels.csv
    

    Resumable, auto-skips already-labeled pairs. Goal: ≥ 50 non-skip labels. Triggers M13 training.

  • Read Bar-Shalom §11.6.1 + §11.6.6 (~30 pages). Was Tue 04-28’s reading; can do anytime. Feeds m12-imm.md blog.

  • Read Wojke 2017 §3 + Hermans 2017 §4. ~40 pages combined. Required reading before M13 day 2 (cost-matrix integration).

My-side (queued, can scaffold whenever you ping)

  • python/appearance/extract_features.py — per-cluster 8-dim feature extractor. Reads clusters_sweetspot/, writes features_NNNNNN.csv. ~80 lines of glue. Pure mine.
  • python/appearance/train.py — PyTorch + batch-hard triplet loss. ~150 lines. Pure mine.
  • slurm/train_appearance.slurm — HPC submission script. ~30 lines.
  • python/appearance/torch_to_eigen_check.py — round-trip dump: trained weights → appearance_model_weights.hpp (overwrites placeholder)
    • tests/data/appearance_reference.csv + tests/data/appearance_triplets.csv.
  • scripts/run_phase3_endtoend.sh — M14 wrapper.
  • scripts/render_4panel_comparison.py — extends animate_tracker_vs_dbscan.py.

Blocked / sequential

These have hard prerequisites — can’t start until upstream lands.

Item Blocked on Unblocks when
Track-refactor CV regression RESOLVED — was a config attribution issue.
IMM run at M10-blog baseline config DONE — IMM 808 / 19.87 vs CV 979 / 17.35. M12 ships.
M10 blog narrative correction one paragraph fix in m10-sort-tracker.md re: which config produced 979 (the “min_hits 3→1” line is backwards).
m12-imm.md blog draft M10 narrative fix numbers are stable; can start drafting now in parallel.
HPC training submission extract_features.py + train.py + labels.csv (≥ 50) all three ready
Real appearance_reference.csv training run completes weights dumped on HPC
EncoderMatchesPyTorchReference test activation reference CSV exists torch_to_eigen_check.py runs
EmbeddingDistanceMonotoneOnAugmentation activation triplets CSV exists same
Cost-matrix integration in sort_tracker.cpp:79-85 encoder validated against PyTorch reference test green
RELLIS appearance run (--use-appearance) cost-matrix integration done tests still green
M14 4-panel re-render M12 + M13 ship metrics tables ready

Recently shipped

Date Item
2026-04-27 M13 trained encoder lands — 99.0% val acc (epoch 18) on 499 hand-labeled held-out pairs. ~1 min wall-clock on L40S. C++ Eigen forward pass matches PyTorch < 1e-5. 35/35 tests green.
2026-04-27 M13 hand-labeling: 500 pairs, 99% agreement with prior heuristic, 1 skip. Re-reviewed 4 disagreements; kept as-is (defensible DBSCAN over-segmentation of long structures).
2026-04-27 M13 HPC submission: 5 corrections fixed in sequence (data symlink rsync, slurm partition l40s_public, REPO=$SCRATCH path, slim conda env on scratch, /tmp tmpfs → $TMPDIR=$SCRATCH/tmp). All logged in m10-debug-log.md.
2026-04-27 M13 pipeline written: extract_features.py (8 features/cluster, corpus stats), train.py (8→64→32→32 MLP + batch-hard triplet, tqdm progress), slurm/train_appearance.slurm, torch_to_eigen_check.py (weight export + reference CSVs).
2026-04-27 M12 SHIPS: IMM 808 / 19.87 vs CV 979 / 17.35 at M10-blog config (min_hits=3, p_n=0.5). −17.5% distinct IDs. Earlier “regression” alarm was a config attribution mistake.
2026-04-27 M12 day 5: Track refactor → unique_ptr<IFilter>, --filter cv|imm flag
2026-04-27 M13 day 1 scaffold: appearance_encoder + 4 test scaffolds + placeholder weights
2026-04-27 M13 labeling pipeline: build_pairs.py + label_pairs_cli.py
2026-04-27 docs/m10-debug-log.md M12 section appended (5 entries + open items)
2026-04-27 scripts/run_tracker_on_rellis.sh: --filter plumbed via TP_M4_FILTER env
2026-04-26 M12 day 1-4: IMM filter implementation, 7/7 tests, Occam’s-razor diagnosis
2026-04-26 HPC data/ symlink fix in sync_to_hpc.sh + docs/hpc-access.md

Maintenance rule

Update this file at the end of every working session. Keep it under ~150 lines so it loads fast. Recently-shipped older than ~10 days drops off.

If a row in the “blocked” table unblocks: move it to “in flight” or “parallelizable” depending on who’s the doer.

If something breaks: log to docs/m10-debug-log.md (existing log), don’t duplicate here. This file is forward-looking; the log is post-hoc.