Staff Software Engineer - AI Research Infrastructure

2 weeks, 4 days ago
Full-time
Lead
Software Development
Databricks

Databricks

Databricks is the pioneering data intelligence platform, empowering organizations worldwide to solve complex data challenges with AI-driven analytics solutions.

IT Services
1K-5K
Founded 2013
$4450M raised

Description

  • Design and implement infrastructure for large-scale experiments, data processing, and model training across HPC clusters, GPU fleets, and cloud-based systems.
  • Build services that schedule, orchestrate, and observe large-scale training and inference workloads across thousands of GPUs.
  • Create abstractions for job submission, scheduling, and monitoring so researchers can move from idea to experiment quickly.
  • Develop tooling that improves research developer productivity, including experiment management systems and CI/testing infrastructure for research code.
  • Build workflows that reduce iteration time while preserving reliability, efficiency, and security.
  • Partner with research scientists, ML engineers, and platform teams to turn experimental workloads into robust, repeatable pipelines.
  • Influence the long-term roadmap for research computation and model development.
  • Serve as a technical mentor and force multiplier for engineers working on compute, infrastructure, and AI systems.

Requirements

  • BS/MS or PhD in Computer Science or a related field.
  • 5+ years of software engineering experience, including substantial work on large-scale distributed systems or infrastructure.
  • Deep experience building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers.
  • Proficiency in one or more systems programming languages such as C++, Rust, Go, Java, or Scala.
  • Experience building or contributing to cluster schedulers, resource managers, or large-scale job orchestration systems such as Kubernetes, Slurm, Ray, or custom internal systems.
  • Understanding of modern ML training and inference workflows, including distributed training, model parallelism, fine-tuning, and evaluation.
  • Ability to move fast, be pragmatic, and drive systems from prototype to stable, well-owned services.
  • Strong communication skills with both researchers and engineers.
  • Experience translating between research needs and infrastructure realities.
  • Experience with operational excellence in production systems.

Benefits

  • Local pay range of $190,000 to $270,000 USD.
  • Eligibility for an annual performance bonus.
  • Equity may be included in the total compensation package.
  • Comprehensive benefits and perks offered to meet employee needs.
  • Compensation is reviewed based on experience, certifications, training, and work location.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Principle Engineer -In Bayesian, Large Foundational Systems, and Distributional Reinforcement Learning

Airbnb 5K-10K Hotels, Restaurants & Leisure

Airbnb is hiring a Principal AI/ML Researcher and Engineer to advance probabilistic, adaptive AI systems that improve personalization, ranking, and decision-making across guest and host experiences at scale.

Apache Spark C++ Java Kafka LLM Machine Learning Python PyTorch Scala Statistics TensorFlow
1 hour, 32 minutes ago

Member of Technical Staff, AI/ML

Curai Health 51-250 Health Care Providers & Services

Curai is hiring Members of Technical Staff to design and ship applied AI/ML systems that improve patient and clinician experiences in its virtual healthcare platform.

Generative AI LLM Machine Learning Python
1 hour, 47 minutes ago

AI Safety Argumentation Platform Research Engineer

Bluesky Internet Software & Services

CARMA is hiring a remote AI Safety Argumentation Platform Research Engineer to build the evidentiary and argumentation infrastructure used to structure, verify, and communicate AI risk arguments for policymakers, researchers, journalists, and the public.

1 hour, 53 minutes ago

Senior Simulation and Modeling Engineer

Relativity Space 251-1K Aerospace & Defense

Relativity Space is hiring a Guidance, Navigation, and Control and Performance engineer to develop simulation tools and models that support Terran R flight algorithm development, analysis, and testing.

C++ CI/CD Docker Python Rust
2 hours, 2 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers