Staff Software Engineer - AI Research Infrastructure

1 hour, 32 minutes ago
Full-time
Lead
Software Development
Databricks

Databricks

Databricks is the pioneering data intelligence platform, empowering organizations worldwide to solve complex data challenges with AI-driven analytics solutions.

IT Services
1K-5K
Founded 2013
$4450M raised

Description

  • Design and implement infrastructure for large-scale experiments, data processing, and model training across HPC clusters, GPU fleets, and cloud-based systems.
  • Build services that schedule, orchestrate, and observe large-scale training and inference workloads across thousands of GPUs.
  • Create abstractions for job submission, scheduling, and monitoring so researchers can move from idea to experiment quickly.
  • Develop tooling that improves research developer productivity, including experiment management systems and CI/testing infrastructure for research code.
  • Build workflows that reduce iteration time while preserving reliability, efficiency, and security.
  • Partner with research scientists, ML engineers, and platform teams to turn experimental workloads into robust, repeatable pipelines.
  • Influence the long-term roadmap for research computation and model development.
  • Serve as a technical mentor and force multiplier for engineers working on compute, infrastructure, and AI systems.

Requirements

  • BS/MS or PhD in Computer Science or a related field.
  • 5+ years of software engineering experience, including substantial work on large-scale distributed systems or infrastructure.
  • Deep experience building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers.
  • Proficiency in one or more systems programming languages such as C++, Rust, Go, Java, or Scala.
  • Experience building or contributing to cluster schedulers, resource managers, or large-scale job orchestration systems such as Kubernetes, Slurm, Ray, or custom internal systems.
  • Understanding of modern ML training and inference workflows, including distributed training, model parallelism, fine-tuning, and evaluation.
  • Ability to move fast, be pragmatic, and drive systems from prototype to stable, well-owned services.
  • Strong communication skills with both researchers and engineers.
  • Experience translating between research needs and infrastructure realities.
  • Experience with operational excellence in production systems.

Benefits

  • Local pay range of $190,000 to $270,000 USD.
  • Eligibility for an annual performance bonus.
  • Equity may be included in the total compensation package.
  • Comprehensive benefits and perks offered to meet employee needs.
  • Compensation is reviewed based on experience, certifications, training, and work location.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Researcher

STR 251-1K Aerospace & Defense

STR’s Sensors Division is hiring a Researcher for the SAAM Group to develop and analyze large-scale simulations that inform national security decisions and support prototype and operational defense systems.

C++ CI/CD Git Machine Learning MATLAB Python Statistics
2 minutes ago

Researcher

STR 251-1K Aerospace & Defense

STR’s Sensors Division is seeking a Researcher for the SAAM Group to develop, modify, and analyze large-scale defense simulations that inform national security decisions and support prototype and operational systems.

C++ CI/CD Git Machine Learning MATLAB Python Statistics
17 minutes ago

Researcher

STR 251-1K Aerospace & Defense

STR’s Sensors Division is seeking a Researcher in the SAAM Group to develop and analyze large-scale defense simulations that inform national security decisions and operational assessments.

C++ CI/CD Git Machine Learning MATLAB Python Statistics
32 minutes ago

Research Physicist - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking part-time physics specialists for project-based AI work focused on designing and validating optics and physics problems that support the testing and improvement of AI systems.

5 hours, 35 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers