Staff GPU Systems Engineer, Space Computing

15 hours, 44 minutes ago
Full-time
Senior
DevOps and Infrastructure
Relativity Space

Relativity Space

Relativity Space is a cutting-edge rocket company using 3D printing and AI to provide cost-effective reusable rockets for commercial launches, with a vision to advance industrial capabilities on Earth and Mars.

Aerospace & Defense
251-1K
Founded 2015
$1333M raised

Description

  • Own the GPU compute environment, including setup, driver integration, container runtime, job scheduling, and performance optimization.
  • Profile and optimize compute performance across the full stack, including GPU utilization, memory bandwidth, I/O throughput, and storage interface performance.
  • Build power- and thermal-aware compute scheduling that aligns batch workloads with orbital constraints.
  • Develop compute health monitoring and upset recovery mechanisms such as checkpoint/restart, GPU fault detection, and automated recovery.
  • Integrate GPU drivers with the payload Linux image in coordination with the Platform RE team.
  • Manage the container runtime for compute workloads.
  • Ensure the platform reliably runs ML frameworks and SAR processing pipelines maintained by the broader operations team.

Requirements

  • BS or MS in Computer Science or Electrical Engineering.
  • 5+ years of relevant experience.
  • Hands-on experience with GPU programming and compute frameworks such as CUDA, ROCm, or OpenCL.
  • Real performance profiling and optimization experience with GPU workloads.
  • Strong Linux systems administration and performance tuning skills.
  • Experience with container technologies such as Docker, Podman, or lightweight alternatives.
  • Experience with HPC job scheduling concepts.
  • Working proficiency in Python for tooling, scripting, and ML framework integration.
  • C/C++ skills for performance-critical system components.
  • Experience with HPC cluster administration, ML infrastructure, or cloud GPU compute platforms at scale is preferred.
  • Deep familiarity with ML framework runtime requirements, including PyTorch or TensorFlow deployment, model serving, and inference optimization, is preferred.
  • Knowledge of GPU compute architectures at the hardware level is preferred.
  • Experience with high-throughput data movement and storage I/O optimization is preferred.
  • Background in power-managed computing, including duty cycling, thermal throttling, and workload scheduling under variable power constraints, is preferred.
  • Experience designing checkpoint/restart or fault-tolerant batch processing systems is preferred.

Benefits

  • Competitive salary with a hiring range of $181,000 to $248,500 USD.
  • Equity compensation.
  • Generous PTO and sick leave policy.
  • Parental leave.
  • Annual learning and development stipend.
  • Additional benefits and perks available through the company benefits program.
  • Reasonable accommodation support during the hiring process.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Systems Engineer (AV/VTC)

MetroStar 251-1K IT Services

MetroStar is hiring a Systems Engineer (AV/VTC) to support and maintain video teleconferencing systems through design, installation, administration, troubleshooting, and asset inventory management for regional field offices.

System Design
25 minutes ago

AI Security - AI Platform Team Lead

Cato Networks 251-1K Diversified Telecommunication Services

Cato Networks is hiring an AI Platform Team Lead to build and lead the runtime infrastructure for large-scale AI security models across its global cloud and physical points of presence.

C++ Docker Go Java Kubernetes MLOps PyTorch Rust System Design
54 minutes ago

Senior Linux Systems Engineer, Edge Compute and Communications - Active Clearance Required

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Senior Linux Systems Engineer to support sensitive classified defense programs by building and maintaining tactical edge computing infrastructure for UAS products.

Active Directory Bash Linux PowerShell
1 hour, 10 minutes ago

ML Tech Lead (GenAI, AWS)

Provectus 251-1K Professional Services

ML Tech Lead at an AI practice within the Engineering team, responsible for guiding the design and delivery of production GenAI and machine learning systems in a fully remote B2B setup.

AWS CI/CD Generative AI Git LLM Machine Learning MLOps PyTorch Scikit-learn TensorFlow
1 hour, 10 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers