Closing the perception loop: NATS, a pipe-fed C++ tracker, and the question I left for nuScenes
Closing the perception loop: NATS, a pipe-fed C++ tracker, and the question I left for nuScenes

This milestone closes the perception loop. The earlier milestones each produced one algorithm in isolation: a ground segmenter, a traversability grid, a cam-LiDAR projector, a kinematic safety rule set, a multi-object tracker. None of them spoke to each other on a real bus. This post is the integration. A pretrained 2D detector pulls camera frames; a small Python sidecar bridges those detections through a long-running C++ tracker over a Unix pipe; the tracker’s outputs feed a kinematic safety supervisor; every safety event lands on a durable JetStream subject that survives a broker restart. (A sidecar is the deployment pattern of running a small auxiliary process next to a main process to handle one cross-cutting concern. Here that concern is the network: the C++ tracker has no NATS client of its own, so a Python process subscribes on its behalf, pipes payloads in on stdin, reads results back off stdout, and republishes them on NATS.) The deliverable is the joined pipeline and the numbers it produces on a known segment of RELLIS-3D, plus an explicit list of the questions this dataset cannot answer that the next milestone (nuScenes) is wired to answer.
The integration choices are shaped by what the previous milestone (m10-sort-tracker) taught me, and by the place where I broke from it. M10 reached its conclusion by trial. I built a tracker variant, ran it, read the metric, found the audit was wrong, ran it again. The arc was test-then-reason: the data taught me the geometry of the problem one iteration at a time, and the post is an honest reconstruction of what I learned in which order. The cost of that loop on a single algorithm was already real, and a five-component integration would multiply it. So the methodology this time is inverted. Before any code was written I named the data-fitness questions the integration was about to assume answers to, designed the smallest measurement that could answer each, ran those measurements, and only then committed to the wiring. Reason-then-test. The change is in the working order, not in the algorithms.
The way the previous milestone’s lesson lands here is concrete, not stylistic. M10 ended with the explicit hand-off that the residual identity flicker was the detector’s job, not the tracker’s. So this milestone keeps the M10 tracker as a black-box dependency on the boundary, refuses to add tracker layers, and instead asks whether a pretrained detector is good enough to drive it on this data at all. The first pre-flight measurement below is exactly that question, run before any tracker code was touched.
The clip at the top is the headline of what came out the other end. A pretrained 2D detector trained on COCO, with no fine-tuning, holds a single worker’s bounding box across all 158 frames of the demo segment, missing only two of the 79 frames that carry ground-truth annotations. The right panel paints the same worker red using the dataset’s own pixel-wise person class. The precision, recall, F1, and AP@0.5 underneath are computed against those exact frames and are reproduced in the detector-evaluation section below. The rest of the post walks the integration in order: the pieces I came in with, the pre-flight measurements, the architectural decision to put the C++ tracker on a Unix pipe instead of a ROS2 node, the in-loop tracker behavior on the demo segment, the empty safety-event log and why empty is the right answer, the detector evaluation against ground truth, the audit-trail replay, and the clean list of what gets validated on nuScenes next rather than here.
The pieces this milestone actually wires together
The integration is not a survey of the project. Five earlier pieces are imported as-is, and the work is in how they talk to each other. Each one is named here only because the integration depends on the specific behavior it ships with. The M10 tracker is the central dependency. The previous milestone closed with the explicit conclusion that the residual identity flicker is on the detector, not the tracker, so the present milestone keeps the M10 tracker on the boundary of the integration without modification and lets the detector question be answered by data. The M4 cam-LiDAR projector is the geometry the present milestone uses to lift a 2D bounding box into a 3D position in the LiDAR frame. The calibration T_cam_lidar and the pinhole intrinsics computed in M4 are loaded directly. The M5 kinematic safety rules (the Phase-1 milestone confusingly also called “M5”) are the rules the supervisor in this pipeline consumes once tracks exist in 3D. Stopping distance and time-to-collision are read from there without re-derivation. Ego speed in the demo window is read from the M7/M8 pose pipeline: the SLAM trajectory CSV is what makes the dynamics measurements later in this post reproducible. NATS as the data plane was bootstrapped in M3, and that bootstrap’s deferral pattern (the C++ binary does not speak NATS itself; a small Python sidecar owns the NATS sockets on its behalf and bridges them to the binary’s stdin/stdout) is the architectural choice the present milestone inherits rather than re-litigates. The Unix-pipe-vs-ROS2 section below is where that pattern’s consequences land.
Pre-flight: measure the data before wiring the loop
The previous milestone’s failure mode was iteration. The dataset surprised me at every step, and the only honest response was to keep measuring and keep adjusting. Reasonable response to a research problem; expensive response to a system-integration problem. For this milestone I wanted to invert that: before any code was written, I wanted enough numerical evidence to shape the integration’s choices. The questions worth asking ahead of any wiring were essentially four. Each one had a stated hypothesis, a stated end-result-of-interest, a method that could answer it cheaply, and a decision that depended on the answer.
Will pretrained YOLO fire often enough on this data to drive a tracker?
The question I needed to settle before touching any integration code was whether a pretrained off-the-shelf 2D detector, YOLOv8n trained on COCO, was a credible scaffold for the tracker on RELLIS imagery at all. RELLIS is an off-road forest sequence; COCO is mostly urban and indoor; a tracker with no detections to consume is a non-starter, and the safety supervisor downstream of the tracker has nothing to reason about either. The decision tree behind the question was concrete. If pretrained YOLO fires on most frames where a worker is present, the rest of the milestone is a wiring exercise: lift bounding boxes into the LiDAR frame through the M4 projector, feed them to the M10 tracker, hand tracks to the safety supervisor, and the integration’s behavior end-to-end can be measured on the dataset I already have. If it fires only occasionally, the wiring is still the right milestone but every later number gets compressed into whatever short window of activity exists, and I would need a pivot plan: either fine-tune YOLO on RELLIS person crops before going further (a multi-day detour to label, train, and validate, with its own failure modes around small-sample overfit), or substitute the pretrained backbone with a different family that has been trained on outdoor data closer to RELLIS (for example a SegFormer person mask threshold to bounding boxes, reusing the M4 inference path), or accept a degraded detector and validate the integration on a synthetic injected-detection stream to keep the safety and audit work meaningful while the detector question moves to a later milestone. Knowing which of those branches I was on before writing any wiring was what this measurement was for.
The cheapest measurement that could answer it was a sparse scan: two hundred frames evenly sampled from the 2,847-frame RELLIS-3D sequence 00, run through YOLOv8n at confidence 0.25 (loose, well below the operating point), and a count of how many of those frames had any detection of class person, bicycle, car, motorcycle, or truck clear confidence 0.5 (the operating threshold the rest of the pipeline uses). That run finished in a few minutes on the laptop and surfaced something more interesting than a single number. Of the 200 sampled frames, 23 cleared the threshold (about 11.5%), but the hits were not uniformly distributed. Frames near the start of the sequence saw a person fire about 72% of the time; frames late in the sequence fired under 5% of the time, and the hits sat in a contiguous early portion of the sequence. The pattern is not a detector failure: the worker is present during the early portion of the recording and is no longer in the scene afterward, so the dropout in firing rate is the dataset itself rather than COCO-vs-RELLIS domain mismatch on a present human. That is a useful distinction. It says the pretrained detector is not silently breaking on RELLIS-style imagery, it is honestly reporting that there is no person to detect, and that anchors the next question (locating the contiguous window where a worker is actually present) as the right next step rather than a workaround.
Where does the contiguous person-firing window actually start and end?
Next I wanted to know the exact frame range that contains a continuous person-firing. The tracker needs frame-to-frame coherence to do its job; a sparse window with intermittent firings would not exercise the tracker meaningfully. Ofcourse one could see this on a video player, but we need to frames. To record the dense YOLO scan over frames 0..500 (the first quarter of the sequence, where the sparse-sampling step had localized the hits) at the same operating threshold. For each frame the count of class-0 (person) detections at confidence 0.5 or higher was logged. A 157-frame stretch, frames 83 through 240 inclusive, with at least one person detection on every frame. The longest run of any kind in the broader sequence. Cartographer’s pose trajectory across the same range gave the ego speed: mean 0.60 meters per second, peak 1.10. The robot is moving slowly while the worker walks ahead. The slow ego speed is a notable property worth flagging because the kinematic safety supervisor’s stopping distance scales with the square of velocity, so on a window where ego is moving at 0.6 m/s the stopping distance will be small (more on this below). Every later run uses [83, 241) as the frame range. The Cartographer pose CSV across that range is the source of v_ego for the safety pass.
Does the broker tolerate the message sizes the pipeline will publish?
What I wanted to know. Whether the existing NATS broker config from M3 (which raised the default 1 MB max payload to 16 MB to fit the BEV grid) would accommodate the message sizes to be added, or whether further config tuning was needed. Message-size limits are a silent failure mode: a publish that exceeds the broker’s max payload raises an exception on the publisher and the subscriber sees nothing. Ahead of any wiring I wanted to know whether the broker’s existing config covered the new traffic. How I measured it. The protobuf message types DetectionList and TrackList are 50–500 bytes for typical scenes, and SafetyEvent is around 200 bytes. The largest published message in the project remains the BEVGrid from M3 at roughly 4 MB. The broker’s existing 16 MB max payload covers the new traffic by orders of magnitude. No tuning needed.
Why the tracker is a Unix pipe and not a ROS2 node
Block diagram of the joined pipeline. Upper grey lane is the NATS Core data plane; lower amber lane is the JetStream audit plane. The C++ tracker is the only non-Python process and it talks to NATS only through the bridge’s stdin and stdout pipes.
The C++ tracker that ships into this milestone (the one from the previous milestone) is a class with a public update_with_features method, an interacting-multiple-model Kalman filter underneath, a Mahalanobis cascade gate, and an ego-pose-aware world-frame anchor on the Lost state. It runs cleanly on this project’s data and and requires not much work. The natural way to wrap a class like that into the project’s ROS2 stack would be a rclcpp node that subscribes to a detection topic, calls the class, and publishes a track topic. The project already has six other ROS2 node stubs in ros2_nodes/ (ground segmentation, traversability, projection, safety, the costmap layer, and the tracker), and I later plan to group all six of them into a single later ROS2-live-pipeline milestone where the entire stack runs as a node graph. Wrapping just the tracker now would split that work across two milestones and leave the ROS2 milestone with five-and-a-half nodes to wire instead of six. The right place for rclcpp is the milestone where the whole node graph goes up at once.
The pattern this milestone follows instead came from M3. M3’s accumulator runner (src/accumulator_runner.cpp) is a C++ binary that produces protobuf-serialized BEV grids and publishes them to NATS. The C++ side of that binary did not implement NATS publishing directly. M3 made the explicit decision to defer C++ NATS clients (the cnats library is a real option but adds a system dependency) in favor of a Python sidecar that subscribes from the C++ binary’s output and publishes onto NATS subjects. The M3 post called this “shells out to a Python worker” and pinned the architectural choice for any later C++ binary that needed to talk to NATS. The present milestone applies the same pattern. The C++ binary that runs the tracker ros2_nodes/tracker_node.cpp has one stdin and one stdout. The wire format is a 4-byte big-endian unsigned length, then that many bytes of a protobuf message (DetectionList on stdin, TrackList on stdout). The Python sidecar (transport/tracker_bridge.py) connects that pipe to the two NATS subjects.
The framing helpers live in a single Python module (transport/frame_io.py) and a parallel C++ block inside the binary. Both use the same wire format described in one place. Centralizing the framing matters because the corruption mode of mismatched framing is silent: a misaligned reader successfully decodes garbage into a malformed protobuf, which the parser may then accept or reject in ways that look plausible until the downstream tracker output starts producing nonsense field values. Keeping the framing in one place makes any future fix land in one place. The protobuf bindings (perception.pb.h for C++, perception_pb2.py for Python) are auto-generated by protoc from a single .proto file. The build script runs protoc once with --cpp_out and once with --python_out. The CMake target for tracker_node links against a perception_proto static library that wraps the generated .pb.cc files. This wires out a tracker that runs as a long-running C++ process whose only contract with the rest of the stack is two pipes, with the NATS plumbing isolated to a Python sidecar that knows nothing about the tracker’s internals. The contract is small enough to test directly: feed five synthetic detection frames into the binary’s stdin and read five track frames from its stdout, assert the same track id appears across the run. That test is tests/python/test_tracker_node_integration.py and it runs in three seconds without needing a NATS broker.
Inside tracker_node: the per-frame loop and the bbox-to-3D step
The C++ binary’s main loop is one read, one transform, one write. This is where the cam-LiDAR projection from M4 earns its keep. For each DetectionList frame the binary parses the protobuf to recover the frame_id (a string like frame_000084) and the list of Detection2D messages. The numeric tail of the frame_id is the LiDAR scan index. The binary opens the matching .bin file from data/RELLIS-3D/Rellis_3D_os1_cloud_node_kitti_bin/Rellis-3D/00000/os1_cloud_node_kitti_bin/000084.bin, reads the KITTI binary point format (each point is four float32 values: x, y, z, intensity), and produces a std::vector<Eigen::Vector3f> of points in the LiDAR frame. The cam-LiDAR projection from M4 takes a LiDAR point and applies four transformations: a 4x4 SE(3) rigid transform from LiDAR to camera frame using the calibrated extrinsic from transforms.yaml, a depth check that drops points behind the camera (Z_cam less than or equal to zero), a perspective divide through the 3x3 intrinsic matrix from camera_info.txt, and a bounds check against the 1920x1200 image dimensions. About 7% of a typical 130k-point LiDAR scan survives all four filters; the rest are behind the robot or outside the camera’s narrow horizontal field of view (the Basler camera at the focal length the dataset uses sees about 36 degrees horizontally).
For each Detection2D bounding box the binary loops over the surviving projected pixels and collects the LiDAR-frame (x, y) of every point that falls inside the bbox. Three things can happen. If zero points fall inside, the binary drops the detection. The 2D bounding box without any LiDAR points has no defensible 3D position. The alternatives (NaN propagation, fallback constant depth) either push the problem downstream or risk creating phantom tracks. A counter (n_dets_dropped_no_depth) tracks how often this happens in a run. If one or more points fall inside, the binary computes the median x and median y independently using std::nth_element, an O(N) selection algorithm in the C++ standard library. The median is robust to long-tail LiDAR returns at bounding-box edges, where a single distant background return can drag a mean by meters but does not move the median. A handful of inside-bbox points is enough to localize the worker in 3D. The pair (median_x, median_y) becomes the tracker’s input for that detection. The binary builds parallel arrays of detection positions and class ids, and calls the tracker’s update method (the no-feature path, since the milestone does not exercise the appearance-encoder path on this segment). The tracker returns a std::vector<Track> of confirmed live tracks. Each track has an id, a position() accessor that returns the Kalman filter’s posterior estimate, a velocity() accessor for velocity components, a passthrough class_id, a passthrough z_3d, and a hits count. The binary serializes those fields into a TrackList protobuf, frames it with a 4-byte length prefix, and writes it to stdout. The bridge picks it up and publishes it on perception.objects.tracks. The whole loop sustains about 2.7 frames per second end-to-end on the laptop CPU. The bottleneck is YOLO; the C++ tracker side runs in microseconds.
![]()
In-loop tracker behavior across the 158-frame demo run. Download MP4 (5.2 MB, higher quality).
When zero events is the right answer
I wired the safety supervisor on the consumer side. The supervisor (built in m5-safety) takes four scalar inputs (vehicle velocity, distance to nearest worker, worker approach speed, terrain traversability score) and produces an intervention rule string and a velocity scale factor. The rule strings are "d_worker < d_stop", "TTC < 2s", "TTC < 5s", and "TTC >= 5s" (no intervention). The first three are the cases where the supervisor has fired.
The consumer takes a TrackList frame, picks the nearest track in the forward arc of the robot, computes the four inputs from the track’s position and velocity, the ego speed from the Cartographer pose CSV, and a fixed traversability score of 0.5 (mixed terrain default), runs the rules, and publishes a SafetyEvent if the rule is anything other than "TTC >= 5s". The events go to safety.events, which is the JetStream durable subject. Live-published events would also have been queryable through the broker.
The first version of the forward-arc filter was wrong, and the way it was wrong is worth reconstructing because it pinned down a small but real error and the data forced me to find it. I had written if track.x <= 0: continue, intended to mean “skip any track whose 3D position is in the half-space opposite the direction of travel, since the forward-collision rules only reason about objects the robot can hit by moving forward.” This has nothing to do with people physically standing behind the platform; it is a purely geometric filter on track coordinates. The buggy assumption was that on this rig “forward” lies along positive x in the frame the tracker emits. That convention is true on most LiDAR rigs but not on the RELLIS Warthog. M4’s blog (m4-fusion) had documented the camera mounting two milestones earlier: The camera’s optical axis points toward −X in LiDAR frame.
The Warthog camera’s forward direction maps to the LiDAR frame’s negative X. A worker visible to the camera (which is the only way YOLO sees them) lands at negative track.x in the SORT output. My filter was rejecting every actually-forward-arc track. The first run produced zero SafetyEvents, which I read as a sign of the supervisor not firing. It was actually a sign of my filter eating every input. The fix dropped the directional assumption entirely. Distance is hypot(track.x, track.y); radial velocity outward is the dot product of the track’s velocity vector with the unit vector from the robot to the worker, both of which are sign-agnostic:
The supervisor’s sign convention (worker_approach_speed positive when receding, per the m5-safety post’s TTC table) lines up with the radial-velocity sign as defined.
I re-ran. The supervisor still fired zero events. The second run was the one that mattered, because the bug was gone and the data was now telling me something real.
The discipline at this point was to trace the dynamics rather than tweak parameters. I sampled seven evenly spaced frames across the window and wrote down what each of the four supervisor inputs actually was:
| Frame | d_worker | v_ego | v_radial_out | v_relative | d_stop |
|---|---|---|---|---|---|
| 83 | 8.72 m | 0.07 m/s | -0.00 m/s | +0.07 m/s | 0.014 m |
| 100 | 10.28 m | 0.23 m/s | +0.96 m/s | -0.73 m/s | 0.051 m |
| 130 | 13.25 m | 0.48 m/s | +1.05 m/s | -0.57 m/s | 0.116 m |
| 160 | 16.23 m | 0.96 m/s | +1.48 m/s | -0.52 m/s | 0.276 m |
| 190 | 17.17 m | 1.00 m/s | +0.52 m/s | +0.48 m/s | 0.293 m |
| 220 | 19.47 m | 0.44 m/s | +0.63 m/s | -0.19 m/s | 0.107 m |
| 239 | 20.62 m | 0.96 m/s | +0.60 m/s | +0.36 m/s | 0.277 m |
The worker is walking faster than the robot is catching up. Most of the time v_relative is negative (worker pulling away, gap growing). On the few frames where it goes positive, the resulting time-to-collision works out to between 35 and 125 seconds, which is well above the supervisor’s 5-second threshold for any intervention. The robot’s stopping distance is in the centimeter range because the robot is moving so slowly that there is almost no kinetic energy for the brakes to dissipate before the platform comes to rest. The supervisor cannot fire because the physics does not warrant any intervention.
Zero events on this segment is the correct number, and there is nothing surprising about it once the dynamics table is on the page. A 0.6 m/s ego following a worker who walks faster than it does has no kinematic reason to fire stopping-distance or TTC rules; the kinetic energy is small and the gap is widening on most frames. I thought the next logical move to verify the supervisor was to check the supervisor’s wiring and the audit trail (the next section) against synthetic events that do trigger every rule.
How accurate is the detector when nothing else is going on?
The integration headline does not say anything about how good the detector is on its own. The 11.5% pre-flight number was the broader truth about pretrained YOLOv8n on the full RELLIS sequence. For the demo segment specifically, the relevant question is detection precision and recall on the 158-frame window.
RELLIS-3D ships pixel-wise semantic annotations for 6,235 of its frames. Sequence 00 is annotated at every other camera frame (the dataset paper [1] downsamples camera annotations to 5 Hz; RELLIS captures at 10 Hz). For the demo window 84..240 that gives 79 annotated frames out of 158. The annotation format is a single-channel PNG where each pixel value is the class id. Class 17 is person; class 8 is vehicle; the other 18 classes cover terrain types and other off-road categories that COCO does not have an analog for.
The evaluator (scripts/run_b_ext_gt_eval.py) runs the following per annotated frame: load the PNG mask, threshold to class 17, drop connected components smaller than 200 pixels as noise, run OpenCV’s connectedComponentsWithStats to extract bounding boxes per remaining component, run YOLOv8n on the matching camera frame and keep class-0 detections at confidence 0.5 or higher, then match predictions to ground truth greedily by intersection-over-union (IoU) at threshold 0.5. A matched pair is a true positive; an unmatched prediction is a false positive; an unmatched ground-truth box is a false negative.
The aggregate over the 79 annotated frames:
| Metric | Value |
|---|---|
| Frames evaluated | 79 |
| TP / FP / FN | 78 / 0 / 2 |
| Precision | 1.000 |
| Recall | 0.975 |
| F1 | 0.987 |
| AP@0.5 (Pascal-VOC 11-point) | 0.909 |
| Mean YOLO max-confidence per frame | 0.852 |
The two false negatives are worth tracing back to the pixels rather than guessing at, because the cause in each case is informative about what kind of error this evaluator actually surfaces.
Frame 90 is an annotation artifact, not a missed person. The ground-truth mask for class 17 has two connected components above the 200-pixel area filter: a large silhouette at bbox (0, 372, 213, 914) with area 63,638 (the worker), and a tiny 22×22 blob at (63, 916, 85, 938) with area 237, sitting two pixels directly below the large bbox’s bottom edge. Inspecting the frame confirms the small blob is a stray patch of pixels in the wet grass that the labeler accidentally tagged as person class 17; it does not correspond to a body part or a second person. YOLO on that frame returned one person box at (1, 368, 217, 957) with confidence 0.846, correctly covering the worker. The evaluator counts the spurious GT blob as an unmatched FN because the greedy IoU matcher does not know the GT itself is wrong. There is no actual person YOLO missed.

Frame 90, worker-feet crop. Green: YOLO bbox (conf 0.846). Red: GT components above the 200-pixel filter. The small red box on the wet ground is the spurious component the FN count blames YOLO for.
Frame 240 is a real confidence-drop case but not the boundary one I had assumed. The GT mask has one component at (563, 515, 665, 763), fully inside the 1920×1200 image (the worker is not at the edge). YOLO returned a person at (563, 515, 665, 760), almost an exact bounding-box match, but at confidence 0.356, which is well below the 0.5 operating threshold rather than just under it. The localization was right; the calibrated confidence on that crop was the issue.

Frame 240, worker crop. Green: YOLO bbox (conf 0.356, below the 0.5 cutoff). Red: GT component. Bbox alignment is essentially exact; the FN is a confidence drop on this specific crop, not a localization or framing failure.
Across the 79 frames the mean max-confidence on the matched detection was 0.852, so the operating threshold has plenty of margin on a typical frame and the frame-240 confidence drop is the outlier rather than the rule. The honest takeaway is that YOLO’s true error rate on this segment is even better than the 0.975 recall headline suggests, because the frame-90 miss is a label-quality artifact rather than a detection failure. Recomputed against only the legitimate GT components, recall is 1.000 (78 of 78) and precision stays at 1.000.
The 100% precision wants the right caveat. The window has a single dominant person and no other COCO-recognizable scene content. YOLO’s chances of hallucinating a wildly wrong “person” bounding box on a forest frame with no ambiguous shapes are small to begin with, so a false-positive count of zero on a single-person window is not as informative as it sounds. The recall figure (0.975) is the more interesting number; across 79 independent frames YOLO essentially never drops the worker, with no fine-tuning on RELLIS or anything else.
Recall is the metric that tells me whether the detector is gating the integration. On the demo window it is 0.975 (or 1.000 if you exclude the labeling misfire on frame 90), so the detector is not what is limiting the pipeline here. The 11.5% pre-flight figure is from a different question on a different slice: how often any COCO class fires anywhere in the 2,847-frame sequence, including the long tail of frames where the worker has left the scene. The two numbers don’t disagree; they answer different questions.
The audit trail that survives a broker restart
The transport layer M3 set up has two pieces. NATS Core handles ephemeral pub/sub for the data plane: detections flowing into the tracker, tracks flowing out. NATS JetStream is the durable-stream extension of the same broker, and it handles audit-trail subjects whose contents have to survive a broker restart. M3 enabled JetStream out of the box but did not declare any streams. The first stream gets declared here.
The stream is safety_events. Subject filter safety.events. Retention LIMITS, max_bytes 1 GB, max_age 0 (which in JetStream’s vocabulary means no time-based eviction; only the byte cap matters), storage FILE (durable). The provisioning script (transport/init_streams.py) is idempotent: invoking it against an existing matching stream returns no_change; against an existing mismatching stream it issues an update_stream to restore the desired config; against a missing stream it issues add_stream. The unit test (test_init_streams.py) spawns a transient nats-server -js on a free port, runs the provisioner, asserts the second invocation reports no_change, and cross-checks that the broker’s reported StreamConfig matches the spec field by field.
The replay harness (scripts/replay_safety_events.py) pull-subscribes to the durable stream, deserializes each persisted SafetyEvent protobuf, parses the input parameters out of the event’s details field (the producer encodes them as a JSON object with keys v, d, v_w, trav so the replay can reconstruct the supervisor’s inputs), runs them through a Python port of the supervisor’s rules (transport/safety_replay.py), and prints a per-event PASS/FAIL/SKIP table. PASS means the rule string the original event recorded matches what the current rules produce on the same inputs. FAIL means the current rules disagree (which would happen if the rule constants changed between recording and replay), or that the event is structurally malformed. SKIP means the event lacks the input encoding needed to re-evaluate.
The decision to port the supervisor’s rules to Python rather than calling the existing C++ class through pybind11 was a trade-off. Pybind11 would give the replay harness access to the canonical rules without duplication, but it adds a build dependency and constrains the harness to environments where the C++ shared library is present. The Python port is fifty lines and lets the replay harness be a self-contained tool. The risk is rule drift between the C++ class and the Python port, which a unit test (test_safety_replay_known_vectors) anchors against three hand-checked decision vectors: a safe-at-distance scenario, a stopping-distance-exceeded scenario, and a receding-worker scenario. If the C++ rules change, the test catches the drift.
The integration run produced zero live events because the dynamics were benign (per the previous section), so the durable stream from this milestone is empty and the demo replay against it would print an empty table. The replay path itself is verified by a separate test (test_replay_safety_events.py) that publishes 5 well-formed scenarios spanning all four rule strings plus 1 deliberately malformed event with an unknown rule string, runs the harness, and asserts 5 PASS plus 1 FAIL with reason unknown rule. The audit trail is real even when the live demo does not exercise it.
What gets validated on nuScenes next, not here
The fair criticism of everything above is that the demo segment is gentle. One worker, walking away from a robot that is barely moving, on a clear forest trail. The integration runs end-to-end on this segment and the numbers are clean, but most of what would actually stress the pipeline is just not in the data. That gap is intentional, not a gap I’d paper over by hunting for a worse RELLIS window or splicing in a second dataset to manufacture excitement. The plan has always been that the pieces this segment can’t exercise get exercised on nuScenes [2] in the next milestone, and there are concrete reasons it has to be nuScenes specifically, not “more RELLIS” or “any harder driving log.”
On the supervisor side, the more interesting thing is that the audit log finally fills up. RELLIS gives an empty safety.events stream because the dynamics are benign, and a JetStream replay over an empty stream is not a real test of anything. nuScenes routinely puts vehicles a few meters from pedestrians at speeds where the kinematic stopping rule has something to say, so the same supervisor code I tested against synthetic events here gets to fire on real ones, and the replay harness gets to produce a non-trivial PASS/FAIL table when I perturb a rule constant. The work in this milestone is what makes that possible: the supervisor is wired, the protobuf is in the durable stream, and the replay loop is byte-equal to the live loop. Without the empty-stream baseline I have today, I would not be able to tell whether a non-empty stream tomorrow was firing because the rules are right or because the wiring drifted.
The tracker’s M10 machinery has a similar character. The cascade matching, the Mahalanobis gate, the world-frame anchor on the Lost state — those exist because urban driving needs them. On a 0.6 m/s ego following a worker who never disappears for more than a frame, the world-frame anchor and the ego-frame anchor differ by centimeters across an occlusion, so the cascade essentially never fires its harder branches. I see this as a feature of the present demo, not a flaw: the integration is verified against a regime where the tracker is doing easy work, and the harder regime that exercises the design choices made in M10 is the natural job of the next milestone.
Where I would take this next
The two most natural follow-ups are the YOLOv8 fine-tune and the cross-dataset milestone.
The fine-tune path is direct. RELLIS-3D has pixel-wise person and vehicle annotations on disk for sequences 00 through 04. Converting them to YOLO-format bounding boxes per frame is a connected-components pass followed by a tight bounding-box extraction (the same operation the the detector evaluator evaluator already does, applied across the full annotated set rather than just the 79 frames in the demo window). The resulting train/val split feeds a YOLOv8n fine-tune on NYU’s HPC. The output is a domain-fine-tuned detector that should lift the 11.5% target-hit fraction substantially, and the integration pipeline this milestone built is the natural test harness.
The cross-dataset milestone is the larger effort. The mechanical work (write a nuScenes loader, point the cam-LiDAR projector at a different calibration YAML, run the same pipeline on the new sensor rig) is straightforward. The interesting work is wiring a learned 3D detector into the place where the present milestone projects 2D boxes into a LiDAR cloud, evaluating the tracker against nuScenes ground-truth tracks, and producing the first non-trivial JetStream replay run.
The tracker built in the previous milestone (interacting-multiple-model filter, cascade machinery, world-frame anchor, Mahalanobis gate, multi-frame accumulation) and the integration built in the present milestone (Unix-pipe contract, NATS bridge, JetStream audit, replay harness) all carry over to the cross-dataset milestone without modification. The detector is the only piece that changes.
What ships from this milestone
The code that lands in the repo from this milestone is small relative to the surface area of the integration, most of the work is in deciding where the boundaries go and which previous milestone owns each side. On the C++ side, tracker_node is one binary that reads detection protobuf off stdin, does cam-LiDAR projection through the M4 calibration, computes the median (x, y) of the LiDAR returns inside each YOLO bbox, hands those positions to the M10 tracker, and writes the resulting tracks back out on stdout. The binary has no NATS dependency at all. On the Python side, tracker_bridge.py owns the two NATS connections and pipes payloads in and out of the binary’s standard streams; a 4-byte big-endian length-prefixed framing helper is the contract on the wire and is shared by both ends. There is also a typed pack/unpack codec for the three protobuf message types the milestone introduces, and an idempotent JetStream stream provisioner for the durable safety_events subject so the broker config survives a fresh start without manual intervention.
The supervisor side has a small but important asymmetry. The kinematic rules from the original Phase 1 M5 are C++; the replay harness needs to re-evaluate persisted events long after the C++ supervisor has moved on, possibly with different rule constants. So the harness is Python and the rules have a Python port that I anchored against the C++ class with a small set of hand-checked decision vectors — same numerical inputs, same numerical outputs to within float tolerance. That is what makes the replay loop trustworthy instead of just re-running on the same events that produced the same outcomes the first time around.
The runner on top of all of this drives the full pipeline across the 158-frame demo window and writes out three things: a per-frame tracks NDJSON, a per-frame safety-events NDJSON (empty on this segment, for the kinematic reasons covered above), and a metrics JSON with the headline numbers. A separate detector-evaluation runner produces the YOLOv8n-vs-RELLIS-GT precision/recall/F1/AP table on the 79 annotated frames inside the same window. The animation renderer that produces the two blog clips lives behind a TP_M5_RENDER=1 flag, so the assets only get regenerated at the end of the milestone once the metrics they depict are stable.
Tests grew by 12 in this milestone, bringing the project to 31 Python tests total and all of them passing. The new ones cover the detection node (2), the proto codec (4), a single integration test that pipes five synthetic detection frames through the C++ binary and asserts the same track id survives every frame, two for the stream provisioner that spawn a transient broker and verify both idempotency and the resulting stream config, two for the replay harness that cover the PASS/FAIL/SKIP semantics on synthetic events as well as the Python-vs-C++ rule-port correctness on the hand-checked vectors, and one end-to-end test that wires a real broker, the bridge, the C++ child, and a consumer together over a live NATS connection. The integration test is the one I trust most because it exercises every join in the pipeline at once with no mocks.
The headline numbers on the demo segment, all sourced from tool runs preserved at results_m5/phase_b/:
| Metric | Value |
|---|---|
| Distinct track IDs over 158 frames | 1 |
| Longest track lifetime | 157 frames |
| Mean track lifetime | 157.0 frames |
| Live safety events fired | 0 (dynamics benign; documented above) |
| Pipeline end-to-end FPS (CPU) | 2.7 |
| YOLOv8n-COCO precision vs GT (79 frames) | 1.000 |
| YOLOv8n-COCO recall vs GT (79 frames) | 0.975 |
| F1 vs GT | 0.987 |
| AP@0.5 (Pascal-VOC 11-point) | 0.909 |
Two clips for the blog at docs/assets/m5/: a tracker-overlay animation showing the SORT track id on the worker across the 158-frame demo segment, and a continuous side-by-side animation across the same 158 frames showing YOLOv8n predictions on the left versus the dataset’s pixel-wise ground-truth person silhouettes (forward-filled between annotation frames) on the right, with the false-negative frames bordered in red.
References
[1] Jiang, P., Osteen, P., Wigness, M., Saripalli, S. (2021). RELLIS-3D Dataset: Data, Benchmarks and Analysis. IEEE Robotics and Automation Letters / IROS. The dataset paper. Source for the camera-LiDAR calibration (transforms.yaml), intrinsics (camera_info.txt), and the pixel-wise person/vehicle annotations used in the the detector evaluator evaluator. The paper documents the camera annotations as 5 Hz downsampled from the 10 Hz capture rate, which is why only 79 of the 158 demo-window frames are annotated.
[2] Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O. (2020). nuScenes: A multimodal dataset for autonomous driving. IEEE/CVF Conference on Computer Vision and Pattern Recognition. The cross-dataset evaluation partner for the next milestone.
[3] Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O. (2019). PointPillars: Fast Encoders for Object Detection from Point Clouds. IEEE/CVF Conference on Computer Vision and Pattern Recognition. The first widely adopted learned 3D LiDAR detector. The bbox-to-3D step in this milestone is the integration scaffold; PointPillars (or CenterPoint, [4]) is the production-quality replacement.
[4] Yin, T., Zhou, X., Krähenbühl, P. (2021). Center-based 3D Object Detection and Tracking. IEEE/CVF Conference on Computer Vision and Pattern Recognition. CenterPoint, an alternative learned 3D detector that ships its own simple tracker alongside the detector.
[5] Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B. (2016). Simple Online and Realtime Tracking. IEEE International Conference on Image Processing. The SORT paper. The tracker the present milestone integrates against was built directly from this paper across the previous milestones.
[6] ultralytics. YOLOv8 documentation. https://docs.ultralytics.com/models/yolov8/. The pretrained COCO checkpoint used as the integration-scaffold detector across this milestone.
[7] NATS Project. JetStream concepts. https://docs.nats.io/nats-concepts/jetstream. Source for the durable-stream semantics and the StreamConfig field set used in transport/init_streams.py.
[8] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., Luo, P. (2021). SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Advances in Neural Information Processing Systems. The pretrained semantic segmentation model used in m4-fusion. Cited here because the present milestone inherits M4’s projector class.