NYU Torch HPC — access notes

Quick reference for connecting to NYU Torch and the layout we use for Terra Perceive.

SSH

Aliases live in ~/.ssh/config (already configured):

Alias Host Purpose
torch login.torch.hpc.nyu.edu interactive login — compile, sbatch, monitor
dtn dtn.torch.hpc.nyu.edu data-transfer node — rsync/scp of large datasets

Auth is Microsoft device login (no SSH keys). ControlMaster keeps the master socket alive ~24h after last activity so MFA only fires once per host.

ssh torch                # interactive
ssh dtn                  # bulk data
ssh -O check torch       # is the master socket alive?
ssh -O exit torch        # force re-auth

User: np3129.

On-HPC layout

/scratch/np3129/
├── terra-perceive-p2m4/         # repo (rsynced from laptop)
│   ├── data/
│   │   └── RELLIS-3D -> /scratch/np3129/data/RELLIS-3D   (symlink)
│   ├── third_party -> /scratch/np3129/third_party        (symlink)
│   └── ...
├── data/RELLIS-3D/              # bag files (5 × ~6 GB)
│   ├── 00000_00.bag
│   ├── 00000_01.bag
│   ├── 00000_02.bag
│   ├── 00000_03.bag
│   └── 00000_04.bag
├── conda_envs/terra_perceive_m4/   # conda prefix env
├── conda_pkgs/                  # cache
├── third_party/                 # tinycolormap, stb headers
└── m4_perframe/                 # ablation outputs

Why $SCRATCH for everything: home quota is small; $SCRATCH is the only place big enough for bags + build artifacts + conda envs. $SCRATCH is not backed up — treat it as cache.

Sync flows

# Laptop → HPC (code + bags). Bags use rsync --partial; resumable.
bash scripts/sync_to_hpc.sh

# HPC → laptop (results, plots, logs).
bash scripts/sync_from_hpc.sh

The sync script self-heals a broken data/ entry on HPC (file or dangling symlink instead of dir).

One-time setup

ssh torch
cd /scratch/$USER/terra-perceive-p2m4
bash scripts/setup_hpc_p2m4.sh   # builds conda prefix env, links third_party

Conda env

source /scratch/np3129/conda_envs/terra_perceive_m4/etc/profile.d/conda.sh
conda activate /scratch/np3129/conda_envs/terra_perceive_m4

⚠️ Quota-safe install — DO THIS FIRST on a fresh login

Conda’s package cache defaults to $HOME/.conda/pkgs. NYU $HOME quota is 0.05 TB / 30k inodes — installing anything mid-sized (matplotlib + opencv + ffmpeg) blows the inode quota with [Errno 122] Disk quota exceeded. Hit twice (2026-04-26 and 2026-04-27); both times had to clean and redirect.

Permanent fix — run once per login session, idempotent:

# Redirect conda + pip caches to scratch (where there's 5 TB)
cat > ~/.condarc <<'EOF'
pkgs_dirs:
  - /scratch/np3129/conda_pkgs
  - /home/np3129/.conda/pkgs
envs_dirs:
  - /scratch/np3129/conda_envs
  - /home/np3129/.conda/envs
EOF

mkdir -p /scratch/np3129/conda_pkgs /scratch/np3129/pip_cache

export CONDA_PKGS_DIRS=/scratch/np3129/conda_pkgs
export PIP_CACHE_DIR=/scratch/np3129/pip_cache

# Persist across sessions
grep -q CONDA_PKGS_DIRS ~/.bashrc || echo 'export CONDA_PKGS_DIRS=/scratch/np3129/conda_pkgs' >> ~/.bashrc
grep -q PIP_CACHE_DIR  ~/.bashrc || echo 'export PIP_CACHE_DIR=/scratch/np3129/pip_cache'   >> ~/.bashrc

Recovery if you’ve already hit the quota error:

conda clean -a -y
rm -rf ~/.conda/pkgs/*
rm -rf ~/.cache/pip/*
myquota   # confirm $HOME is back under 80% inodes

Verify the redirect is active before installing:

conda config --show pkgs_dirs    # first entry must be /scratch/...
echo $CONDA_PKGS_DIRS            # must be /scratch/np3129/conda_pkgs

Common env recipes

Slim training/animation env (matplotlib + cv2 + ffmpeg + numpy/pandas — no ROS2):

mamba create -p /scratch/np3129/conda_envs/<env-name> \
    -c conda-forge -y \
    python=3.11 numpy matplotlib opencv ffmpeg pillow pyyaml pandas

This is what terra_perceive_m13 was rebuilt as on 2026-04-27 to handle the Phase-4 animation render. ~3 min install.

For the C++ tracker_runner build (needs ROS2 / colcon), use the heavier setup_hpc_p2m4.sh flow which provisions the ROS2 humble env separately.

Slurm

sbatch slurm/run_ablation_g.slurm
squeue -u $USER
scancel <jobid>
tail -f /scratch/$USER/p2m4_logs/<jobid>.out

If the sync script’s auto-repair ever fails, run on HPC:

cd /scratch/$USER/terra-perceive-p2m4
ls -la data 2>/dev/null            # diagnose: file? broken symlink? dir?
[ -e data ] && [ ! -d data ] && rm -f data
[ -L data ] && [ ! -e data ] && rm -f data
mkdir -p data
[ -L data/RELLIS-3D ] && [ ! -e data/RELLIS-3D ] && rm -f data/RELLIS-3D
[ ! -L data/RELLIS-3D ] && ln -s /scratch/$USER/data/RELLIS-3D data/RELLIS-3D
ls -la data/

Verify: ls data/RELLIS-3D/*.bag should list the 5 bags.