Staff Software Engineer - AI Research Infrastructure

2 days, 8 hours ago
Full-time
Senior
DevOps and Infrastructure
Databricks

Databricks

Databricks is the pioneering data intelligence platform, empowering organizations worldwide to solve complex data challenges with AI-driven analytics solutions.

IT Services
1K-5K
Founded 2013
$4450M raised

Description

  • Design and implement infrastructure for large-scale experiments, data processing, and model training across HPC clusters, GPU fleets, and cloud-based systems.
  • Build abstractions for job submission, scheduling, and monitoring so researchers can launch large-scale experiments quickly.
  • Develop tooling to improve research productivity, including experiment management systems and CI/testing infrastructure for research code.
  • Create workflows that reduce iteration time while maintaining reliability, efficiency, and security.
  • Partner with research scientists, ML engineers, and platform teams to turn experimental workloads into robust, repeatable pipelines.
  • Influence the long-term roadmap for research computation and how models are trained, evaluated, and shipped.
  • Mentor and support other engineers working on compute, infrastructure, and AI systems.
  • Develop and run the research stack that powers Databricks AI Research.

Requirements

  • BS/MS or PhD in Computer Science or a related field.
  • 5+ years of software engineering experience, including substantial work on large-scale distributed systems or infrastructure.
  • Deep experience building and operating distributed systems, data pipelines, or large-scale backend services.
  • Experience with GPUs, clusters, or major cloud providers is strongly preferred.
  • Proficiency in one or more systems programming languages such as C++, Rust, Go, Java, or Scala.
  • Experience contributing to cluster schedulers, resource managers, or large-scale job orchestration systems such as Kubernetes, Slurm, Ray, or custom internal systems.
  • Understanding of modern ML training and inference workflows, including distributed training, model parallelism, fine-tuning, and evaluation.
  • Ability to move quickly and pragmatically while maintaining operational excellence.
  • Experience taking complex systems from prototype to stable, well-owned services.
  • Strong communication skills with the ability to translate between research needs and infrastructure realities.

Benefits

  • Local pay range of $199,000 to $270,000 USD.
  • Eligibility for annual performance bonus.
  • Equity as part of the total compensation package.
  • Comprehensive benefits and perks offered to employees, with region-specific details available.
  • Fair and equitable compensation practices based on skills, experience, certifications, training, and location.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Robotics & Simulation Engineer, Discovery

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Robotics & Simulation Engineer to build and own the simulation, training, deployment, and safety infrastructure that supports autonomous robotic systems for defense applications.

C++ Python
2 hours, 2 minutes ago

Licensed Civil Engineer - Data Center

Olsson 1K-5K Construction & Engineering

Olsson is hiring a Licensed Civil Engineer to support its Data Center Civil team on large hyperscale and colocation data center projects across the U.S., with a focus on designing critical infrastructure for complex engineering-driven developments.

7 hours, 12 minutes ago

Sr. Data Center Engineer II (6384)

MetroStar 251-1K IT Services

MetroStar is hiring a Sr. Data Center Engineer II to design and sustain secure, high-availability data center infrastructure supporting mission-critical federal government operations.

Agile
8 hours, 31 minutes ago

IT Infra Lead

Weekday 11-50 Construction & Engineering

Weekday’s UK-based life sciences technology client is hiring a remote IT Infrastructure Lead in India to own and strengthen the company’s global IT environment across cloud, security, compliance, and workplace systems.

Azure CI/CD Cisco DHCP DNS Fortinet JIRA macOS Palo Alto PowerShell Python
8 hours, 45 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers