Phase 3 — Tracking Robustness on Real Data
Phase 3 — Tracking Robustness on Real Data
Why this exists as its own phase (and not as M4 add-ons)
Phase 2’s tracker (P2-M4) was scoped to “implement SORT from scratch and prove algorithmic correctness via unit + integration tests.” That ships. What Phase 2 did NOT scope was: make the tracker robust enough on real RELLIS LiDAR data that a viewer can identify persistent objects across frames.
This wasn’t a blind oversight in M4 — plan_p2m4.md:222 listed IMM, Deep
SORT, and KD-tree as “what I’d improve” footer items. But running the M4
tracker end-to-end on RELLIS revealed that those footer items are actually
load-bearing for portfolio-grade demonstration:
M4-on-RELLIS quantitative finding: 979 distinct track IDs assigned over 285 seconds of driving. Mean track lifetime: 17.4 frames. 43% of tracks lived ≤5 frames. DBSCAN cluster count varied 3 → 26 per frame on stationary scenes.
The algorithm itself is correct (20/20 tests pass). The instability is upstream and at the velocity-discontinuity edge case. Phase 3 closes that gap with techniques the literature already canonicalised (IMM, multi-frame accumulation, learned association). With Phase 3 in place, the same tracker on the same data should produce on the order of 50–80 distinct IDs — close to the count of physical objects in the scene.
Phase 3 is also where the blog post’s “production-grade tracking” claim becomes defensible. Phase 2 demonstrates I can build SORT; Phase 3 demonstrates I can make it work on real data.
Bridge from Phase 2
What Phase 3 inherits:
tracker::SORTTrackerclass with greedy + Munkres dispatchers (P2-M4).tracker::dbscanclusterer with brute-force O(N²) neighbor search (P2-M4).obstacle_extractor+dbscan_cliCLI binaries (P2-M4 / Ablation G).- Per-frame
clusters_NNNNNN.csvandtracks.csvartifacts on/media/.../m4_perframe/(Ablation G output, ~13 GB). animate_tracker_vs_dbscan.pyvisualisation harness with the side-by-side flicker→stable layout (P2-M4 closing-hero).- The 979-track baseline for measuring improvement.
What Phase 3 does NOT depend on:
- Phase 2’s M5 (YOLO + 3D-lift + safety + NATS) — Phase 3 can run in parallel with M5 because it operates on the LiDAR pipeline, not the camera pipeline.
- nuScenes (M7) — Phase 3 measures against RELLIS to keep the comparison fair to the M4 baseline. nuScenes evaluation lives in M8.
Phase 3 milestones
P3-M12 — IMM Kalman filter (Interacting Multiple Model)
Goal: Replace the single constant-velocity Kalman filter inside Track
with an IMM filter that runs constant-velocity (CV) and constant-position
(CP) models in parallel and switches between them per frame. Solves the
M4 stationary-segment flicker directly.
Sub-tasks:
include/imm_filter.hpp+src/imm_filter.cpp: two-mode IMM with Markov mode-transition matrix, per-mode predict/update, mode probability update via Bayes’ rule, mixed-output state estimate.tests/cpp/test_imm.cpp: at least 5 cases —ConstantVelocityTrackedByCV,StationaryTrackedByCP,ModeSwitchOnDeceleration,MixingProbabilitiesSumToOne,BothModelsAgreeOnSteadyState.- Modify
Trackstruct: replaceKalmanFilter2D kfwithIMMFilter kf(or feature-flag both for ablation). - Re-run M4’s RELLIS closing-hero pipeline with IMM. Measure new distinct-ID count; expect ≤200 (vs M4 baseline 979).
Exit criterion: RELLIS distinct-track-IDs reduced by ≥50% vs M4 baseline. Stationary-segment flicker (frames 1750–1830) shows the same SORT track ID throughout for at least 5 visible trees.
Reading:
- Bar-Shalom, Li, Kirubarajan, Estimation with Applications to Tracking and Navigation (2001), Ch. 11 — IMM derivation.
- Blackman, Popoli, Design and Analysis of Modern Tracking Systems (1999), Ch. 8 — multi-model approaches.
Wall-clock: 4–6 days.
P3-M13 — Multi-frame point cloud accumulation with ego-motion compensation
Goal: Smooth DBSCAN cluster centroids by accumulating LiDAR over a short sliding window, registered into a common frame using SLAM odometry from P2-M2. Reduces cluster fragmentation from sensor angular aliasing.
Sub-tasks:
- New
include/cloud_accumulator.hpp+src/cloud_accumulator.cpp: ring buffer of N most-recent clouds, SE(3) transforms via the same pose source used in M3 accumulator (SLAM by default). - Pre-DBSCAN stage in
obstacle_extractor: instead of clustering one frame’s obstacle points, cluster the union of N=3 frames’ points reprojected into the current frame. tests/cpp/test_cloud_accumulator.cpp: ring buffer correctness, transform consistency, cluster centroid stability test.- Re-run M4’s RELLIS pipeline with N ∈ {1, 3, 5} and measure:
- DBSCAN per-frame cluster-count variance (M4 baseline: 4.55 stddev).
- Mean cluster centroid drift between adjacent frames (currently 10–30 cm).
- Distinct track IDs over the recording.
Exit criterion: DBSCAN per-frame cluster-count standard deviation reduced by ≥40%. Mean cluster centroid drift reduced by ≥50%.
Reading:
- Yin, Du, “Dynamic Point-Cloud Accumulation for Robust Object Detection” (2020 IROS).
- Engelmann, Hess, “3D Object Tracking with LiDAR Point-Cloud Accumulation” (2021).
Wall-clock: 3–5 days.
P3-M14 — KD-tree neighbor search for DBSCAN
Goal: Drop region_query from O(N) per call to O(log N) so DBSCAN
scales to ≥50k-point clouds (currently choking at 10k in Debug mode).
Sub-tasks:
- Pick implementation: nanoflann (single-header, BSD license) is the
standard for LiDAR-domain projects. Vendor in
third_party/nanoflann. region_queryrewritten to query aKDTreeEigenAdapter. Build the tree ONCE per call todbscan()(not per query).- Re-run
test_dbscanLargeClusterPerformancetest at N = 50,000. Tighten the wall-clock assertion accordingly. - Add a benchmark: brute-force vs KD-tree at N ∈ {1k, 10k, 50k, 100k} with a chart for the blog.
Exit criterion: DBSCAN on 50k-point cloud completes in <1 sec Release. Brute-force baseline > 5 sec at the same N.
Reading:
- Friedman, Bentley, Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time” (TOMS 1977) — original KD-tree paper.
- nanoflann documentation.
Wall-clock: 2 days.
P3-M15 — Cluster appearance features in matching cost
Goal: Augment SORT’s Euclidean-distance cost matrix with cluster-size, point-density, and bounding-box features. A 200-point tree shouldn’t match a 15-point fragment that happens to be 0.4m away.
Sub-tasks:
- Extend
Trackstruct with running estimates of cluster size + bbox dimensions from past matches. - Modify
SORTTracker::match()cost matrix:cost(i, j) = α * euclidean(i, j) + β * size_diff(i, j) + γ * density_diff(i, j)with α, β, γ tuned on the RELLIS dataset. - New ablation: sweep (α, β, γ) and measure ID stability + false-merge rate.
- Tests: feature normalization correctness, feature-cost-matrix shape, fallback when track has no size estimate yet.
Exit criterion: ID-stability metric (MOTA-like) improves by ≥10% vs P3-M12 baseline (IMM-only, no appearance).
Wall-clock: 3–4 days.
P3-M16 — Deep SORT-style learned appearance embeddings
Goal: Train a small neural network that embeds each cluster’s point distribution into a fixed-length appearance vector. Use cosine similarity between embeddings as an additional term in the matching cost. This is the production AV-grade technique.
Sub-tasks:
- Encoder: PointNet-style architecture (or DGCNN), embedding dim 64. Train on a contrastive loss over RELLIS cluster pairs (same physical object across frames = positive; different objects = negative).
- Generate the contrastive training set from M4’s RELLIS tracker output: tracks with lifetime ≥30 frames are reliable positives.
- Cost matrix: weighted sum of (Euclidean position cost) + (cosine embedding cost), with the weight tuned by ablation.
- Compare to non-learned appearance (P3-M15) on the same RELLIS recording.
Exit criterion: Distinct-track-IDs reduced by an additional ≥30% vs P3-M15 baseline. Re-identification across occlusion ≥10 frames working on at least one demonstrable case in the recording.
Reading:
- Wojke, Bewley, Paulus, “Simple Online and Realtime Tracking with a Deep Association Metric” (ICIP 2017) — Deep SORT paper.
- Qi et al., “PointNet” (CVPR 2017) — point cloud encoder.
- Wang et al., “Dynamic Graph CNN for Learning on Point Clouds” (TOG 2019) — DGCNN; better for irregular-density LiDAR clusters.
Wall-clock: 8–14 days. The training data curation is the slow part, not the model code.
P3-M17 — End-to-end re-validation + numbers for the blog
Goal: Final re-run of the M4 closing-hero pipeline with all Phase 3 improvements stacked. Produce the before/after comparison figures that ship in the M10 blog post AND a new Phase-3 blog post.
Sub-tasks:
- Re-render
sort_vs_dbscan.mp4with the full Phase 3 stack. Same visual layout, but the right panel should show stable IDs throughout. -
A 4-panel comparison figure: M4 baseline + IMM + accumulation - appearance.
- Quantitative table for the blog:
| Pipeline stage | Distinct track IDs | Mean lifetime (frames) | Stationary-segment flicker |
|---|---|---|---|
| M4 baseline | 979 | 17.4 | severe |
| + IMM | (target ≤200) | (target ≥80) | minimal |
| + accumulation | (target ≤120) | (target ≥150) | none |
| + learned appearance | (target ≤80) | (target ≥250) | none |
- Save the original M4 closing-hero animation as
results_m4/ablation_g/sort_vs_dbscan_baseline.mp4for the comparison blog figures. (Don’t overwrite — the baseline IS data.)
Exit criterion: A blog-ready figure demonstrating each Phase 3 improvement with numerical evidence. The story “I diagnosed the failure mode on real data and fixed it stage by stage” is the headline.
Wall-clock: 2 days.
Phase 3 exit criteria
- Distinct-track-IDs on RELLIS reduced from 979 (M4 baseline) to ≤80 (P3-M16 target).
- Mean track lifetime increased from 17.4 frames to ≥250 frames.
- Stationary segment 1750–1830 shows persistent IDs on visible trees throughout, no flicker.
- KD-tree DBSCAN handles 50k-point clouds in <1 sec Release.
- One Phase-3 blog post (
docs/phase3-tracking-robustness.md) drafted with the 4-panel before/after figure.
Why this is a portfolio-defining sequence
Phase 2 demonstrates “I can implement classical algorithms from scratch and prove they work via tests.” That’s necessary but commoditised. LeetCode covers it.
Phase 3 demonstrates “I diagnosed a real-data failure mode that the unit tests didn’t catch, escalated through three production-grade fixes, and measured the improvement at each stage.” That’s the engineering signal that distinguishes a portfolio piece from a homework submission.
The before/after figure from P3-M17 — flickering 979-track baseline next to a stable 80-track Phase-3 result — is the single most defensible demonstration in the whole project for an AV / robotics interview.
References (consolidated)
- Bar-Shalom, Li, Kirubarajan, Estimation with Applications to Tracking and Navigation (Wiley 2001), Ch. 11.
- Blackman, Popoli, Design and Analysis of Modern Tracking Systems (Artech House 1999), Ch. 8.
- Wojke, Bewley, Paulus, “Simple Online and Realtime Tracking with a Deep Association Metric” (ICIP 2017).
- Yin, Du, “Dynamic Point-Cloud Accumulation for Robust Object Detection” (IROS 2020).
- Friedman, Bentley, Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time” (TOMS 1977).
- Qi et al., “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation” (CVPR 2017).
- Wang et al., “Dynamic Graph CNN for Learning on Point Clouds” (ACM TOG 2019).
- nanoflann library — github.com/jlblancoc/nanoflann.
Wall-clock summary
| Milestone | Days |
|---|---|
| P3-M12 IMM Kalman | 4–6 |
| P3-M13 Multi-frame accumulation | 3–5 |
| P3-M14 KD-tree DBSCAN | 2 |
| P3-M15 Appearance features (geometric) | 3–4 |
| P3-M16 Deep SORT (learned embeddings) | 8–14 |
| P3-M17 Re-validation + figures | 2 |
| Total | 22–33 days |
This is roughly 4–6 weeks of work, runnable in parallel with Phase 2 M5+ on different worktrees.