Staff Software Engineer - AI Research Infrastructure

2 days, 15 hours ago
Full-time
Lead
Software Development
Databricks

Databricks

Databricks is the pioneering data intelligence platform, empowering organizations worldwide to solve complex data challenges with AI-driven analytics solutions.

IT Services
1K-5K
Founded 2013
$4450M raised

Description

  • Design and implement infrastructure for large-scale experiments, data processing, and model training across HPC clusters, GPU fleets, and cloud-based systems.
  • Build services that schedule, orchestrate, and observe large-scale training and inference workloads across thousands of GPUs.
  • Create abstractions for job submission, scheduling, and monitoring so researchers can move from idea to experiment quickly.
  • Develop tooling that improves research developer productivity, including experiment management systems and CI/testing infrastructure for research code.
  • Build workflows that reduce iteration time while preserving reliability, efficiency, and security.
  • Partner with research scientists, ML engineers, and platform teams to turn experimental workloads into robust, repeatable pipelines.
  • Influence the long-term roadmap for research computation and model development.
  • Serve as a technical mentor and force multiplier for engineers working on compute, infrastructure, and AI systems.

Requirements

  • BS/MS or PhD in Computer Science or a related field.
  • 5+ years of software engineering experience, including substantial work on large-scale distributed systems or infrastructure.
  • Deep experience building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers.
  • Proficiency in one or more systems programming languages such as C++, Rust, Go, Java, or Scala.
  • Experience building or contributing to cluster schedulers, resource managers, or large-scale job orchestration systems such as Kubernetes, Slurm, Ray, or custom internal systems.
  • Understanding of modern ML training and inference workflows, including distributed training, model parallelism, fine-tuning, and evaluation.
  • Ability to move fast, be pragmatic, and drive systems from prototype to stable, well-owned services.
  • Strong communication skills with both researchers and engineers.
  • Experience translating between research needs and infrastructure realities.
  • Experience with operational excellence in production systems.

Benefits

  • Local pay range of $190,000 to $270,000 USD.
  • Eligibility for an annual performance bonus.
  • Equity may be included in the total compensation package.
  • Comprehensive benefits and perks offered to meet employee needs.
  • Compensation is reviewed based on experience, certifications, training, and work location.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

R&D Engineer - AI and Innovation

ZoomInfo 1K-5K Professional Services

ZoomInfo is hiring an R&D Engineer to research emerging LLM and AI systems techniques, prototype practical solutions, and help turn validated ideas into production features for its platform team.

Neo4j Python Vertex AI
23 minutes ago

Scientific AI Evaluation & Computational Problem Designer

Weekday 11-50 Construction & Engineering

An independent contractor role for a client building a benchmark to evaluate advanced AI reasoning through original, research-grade computational problems across scientific and engineering domains.

Linux Python
45 minutes ago

Staff Simulation Engineer - Dexterity

Apptronik 51-250 Aerospace & Defense

Apptronik is hiring a Staff Simulation Engineer to own dexterous hand simulation end-to-end for its Apollo humanoid robot, ensuring simulation accurately predicts real-world hand behavior as the company brings the robot to market at scale.

C++ Python
2 hours, 4 minutes ago

Principal Algorithm & Signal Processing Engineer

STR 251-1K Aerospace & Defense

STR’s Electronic Warfare and Novel Capabilities Group is seeking a Principal Engineer to develop advanced signal processing and decision-making capabilities for next-generation radar and electronic warfare systems.

Machine Learning MATLAB MLflow NLP Python PyTorch Reinforcement Learning TensorFlow
11 hours, 39 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers