Hardware

The deployment platform. The student model and runtime fusion exist because of these specific sensors and this specific compute budget — knowing what they are and why they were chosen makes the rest of the docs make more sense.

This page covers the on-board hardware. The off-board training hardware (NYU Greene HPC, L40S GPUs) is documented in training.

The robot — Traxxas Maxx 4S

A 1/10 scale Ackermann-steered RC chassis. The body of the robot.

Property	Value
Scale	1/10
Steering	Ackermann (front-wheel)
Wheelbase	0.187 m
Front track	0.137 m
Rear track	0.145 m
Wheel radius	0.055 m
Max steering angle	0.489 rad (~28°)
Footprint	`[[0.20, 0.08], [0.20, -0.08], [-0.10, -0.08], [-0.10, 0.08]]`

The chassis isn’t a background detail. Its geometry shapes the navigation and control stack:

The Ackermann turning constraint forces the planner to produce drivable paths (no in-place rotation), which is why we use SmacPlannerHybrid + MPPI rather than the typical Nav2 differential-drive defaults
The footprint defines the inflation distances the costmap needs to respect
The wheelbase + max steering angle determine the minimum turning radius (~0.65 m), which determines what corridors the robot can actually negotiate
The drive motor responsiveness (0-60 in ~3 s with the stock 4S setup) is more than enough for indoor use, but it means we run with conservative MPPI velocity limits to keep behavior predictable

The platform was chosen for three reasons: enough space and payload to mount real sensors and compute, fast and sturdy enough to handle the indoor test environment, and a geometry that maps naturally to F1TENTH-style autonomy thinking. Not just “some RC car” — a deliberate physical commitment to treating the project like a small autonomous vehicle.

The depth camera — Orbbec Femto Bolt

The sensor whose failure mode gave the project its identity. RGB-D camera with Time-of-Flight depth and an onboard IMU.

Stream	Resolution	Rate	Format	Topic
RGB	1280 × 720	30 Hz	`Image`	`/camera/color/image_raw`
ToF depth	640 × 576	30 Hz	`Image` (16UC1, mm)	`/camera/depth/image_raw`
Confidence	640 × 576	30 Hz	`Image`	`/camera/depth/confidence`
IMU	—	200 Hz	`Imu`	`/camera/imu`

The ToF channel fails on 77.79% of pixels in our corridor (polished floors, glass walls, glossy furniture). This is the central failure mode the project was built to address — see Bootstrap Perception and ToF Failure Modes for the full discussion.

We kept this sensor despite the failure rate because the alternative isn’t “perfect sensor vs broken sensor.” It’s “no depth camera at all, vs a sensor that’s still highly valuable where it works and informative even where it doesn’t.” The Femto Bolt provides metric depth where valid, a confidence map that tells us which pixels to trust, RGB for the learned fallback, and IMU for the broader stack. That combination is what makes bootstrap perception possible.

The driver is the orbbec_camera ROS 2 package. Connection is USB 3.0. Camera intrinsics are available on /camera/depth/camera_info.

The LiDAR — RPLiDAR S2

2D rotating laser scanner. The reliable backbone of navigation. Always works, but only sees a single horizontal plane.

Property	Value
Range	18 m
Field of view	360°
Scan rate	~10 Hz
Scan mode	DenseBoost
Driver	`sllidar_ros2`
Connection	USB serial @ 1,000,000 baud
Frame	`lidar_link`
Topic	`/scan` (`sensor_msgs/LaserScan`)

Used by Nav2’s local costmap (primary obstacle source), Nav2’s global costmap (only source — no depth at the global scale), AMCL (localization), and SLAM Toolbox (mapping).

The LiDAR’s limitation isn’t quality — it’s dimensionality. It sees a thin slice of the world at scanner mount height. That’s excellent for wall geometry, planar occupancy, and localization against a 2D map. It’s weak for:

Anything above or below the scan plane (chairs, tabletops, torsos)
Glass walls (the laser passes through)
Low obstacles that miss the scan plane entirely

This is exactly why the Femto Bolt depth and learned depth from the student model are needed — they fill the vertical gap the LiDAR can’t see. See Four-Layer Sensing Hierarchy.

The mental model: RPLiDAR S2 is the reliable skeleton of the robot’s geometry understanding. Bootstrap perception doesn’t replace that skeleton. It adds missing body volume where the 2D scan is inherently incomplete.

The compute — Jetson Orin Nano 8GB

The robot’s main computer. The board that runs ROS 2, all perception nodes, navigation, and the hardware bridge.

Property	Value
RAM	8 GB (shared between CPU and GPU)
OS	Ubuntu 22.04 (JetPack 6)
ROS 2	Humble
GPU	NVIDIA, with TensorRT FP16 inference

What runs on it:

Component	What it does
Ackermann HW Node	Motor / steering control via Teensy serial bridge
Student TRT Node	EfficientViT-B1 inference (V9 student), ~5 ms
Depth Fusion Node	Confidence-gated fusion of ToF + student depth
YOLO TRT Node	YOLOv8 detection, ~6 ms
Class Costmap Node	Class-aware obstacle inflation (when wired)
Nav2	Full navigation stack (planner, controller, costmap)
EKF (robot_localization)	Odometry fusion

USB allocation:

USB0 → VESC MINI (drive ESC)
USB1 → Teensy 4.1 (steering, encoders)
USB2 → RPLiDAR S2
USB3 → Femto Bolt

Runtime perception performance:

DA3-Small zero-shot: 218 FPS, 4.6 ms, 2.7 GB RAM (TensorRT FP16, 308 × 308 input)
EfficientViT-B1 student: ~5 ms (TensorRT FP16, 240 × 320 input)

The Jetson matters because it’s where the project’s ambitions hit the deployment budget. It’s easy to design a system that works on a desktop GPU. It’s much harder to make the whole stack — perception, navigation, localization, and real sensors — fit on an embedded platform that still has to leave room for ROS 2. That budget is why the student model exists in the first place.

Known issue: TensorRT runtime

TensorRT on the Jetson uses a ctypes CUDA backend instead of pycuda, because pycuda is incompatible with JetPack 6. This is wired up in trt_utils.py and works correctly — but it’s a non-standard setup that confused us early in deployment. If you’re porting the runtime to a different Jetson configuration, this is the pitfall.

What this hardware setup forces

The hardware budget is tight enough that several design choices weren’t really choices:

The student must be small. ~5 MB ONNX, ~5 ms inference. EfficientViT-B1 sits at the upper end of what fits.
Teachers can’t run on the robot. DA3-Metric-Large alone exceeds the 8 GB RAM budget. Distillation on HPC is the only path.
Perception has to fit alongside Nav2. The student plus YOLO plus the fusion node plus class costmap is the full perception budget. Adding a fourth network would push the system into swap.
2D LiDAR is the navigation backbone. A 3D LiDAR would solve the vertical-extent problem but would cost more than the rest of the robot combined and wouldn’t fit the form factor.
The depth camera matters even when it fails. ToF gives metric anchor where it works and confidence info everywhere — both are load-bearing for the bootstrap pipeline.

If any of these constraints relaxed (more RAM, faster GPU, 3D LiDAR), the system architecture would look different. Bootstrap perception is a response to the specific shape of this hardware budget.