Staff Software Engineer - AI Research Infrastructure

4 days, 8 hours ago
Full-time
Senior
DevOps and Infrastructure
Databricks

Databricks

Databricks is the pioneering data intelligence platform, empowering organizations worldwide to solve complex data challenges with AI-driven analytics solutions.

IT Services
1K-5K
Founded 2013
$4450M raised

Description

  • Design and implement infrastructure for large-scale experiments, data processing, and model training across HPC clusters, GPU fleets, and cloud-based systems.
  • Build abstractions for job submission, scheduling, and monitoring that help researchers move from idea to large-scale experiment quickly.
  • Develop tooling that improves research developer productivity, including experiment management systems and CI/testing infrastructure for research code.
  • Turn experimental workloads into robust, repeatable pipelines in partnership with research scientists, ML engineers, and platform teams.
  • Improve reliability, efficiency, and security across research infrastructure and development workflows.
  • Influence the long-term roadmap for research computation, including how models are trained, evaluated, and shipped to customers.
  • Serve as a technical mentor and force multiplier for engineers working on compute, infrastructure, and AI systems.
  • Develop and run the research stack that supports Databricks AI Research workloads at scale.

Requirements

  • BS, MS, or PhD in Computer Science or a related field.
  • 5+ years of software engineering experience.
  • Substantial experience with large-scale distributed systems or infrastructure.
  • Deep experience building and operating distributed systems, data pipelines, or large-scale backend services.
  • Experience with GPUs, clusters, or major cloud providers is strongly preferred.
  • Proficiency in one or more systems programming languages such as C++, Rust, Go, Java, or Scala.
  • Experience with cluster schedulers, resource managers, or large-scale job orchestration systems such as Kubernetes, Slurm, Ray, or custom internal systems.
  • Understanding of modern ML training and inference workflows, including distributed training, model parallelism, fine-tuning, and evaluation.
  • Ability to move quickly and pragmatically while maintaining operational excellence.
  • Strong communication skills and ability to translate between research needs and infrastructure realities.

Benefits

  • Annual performance bonus eligibility.
  • Equity eligibility as part of total compensation.
  • Base salary range of $190,000 to $270,000 USD.
  • Comprehensive benefits and perks package.
  • Support for fair and equitable compensation practices.
  • Global company with offices around the world.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

R&D Engineer - AI and Innovation

ZoomInfo 1K-5K Professional Services

ZoomInfo is hiring an R&D Engineer to research emerging LLM and AI systems techniques, prototype practical solutions, and help turn validated ideas into production features for its platform team.

Neo4j Python Vertex AI
41 minutes ago

Cloud Reliability & Recovery Engineer

AlphaSense 251-1K Internet Software & Services

AlphaSense is hiring a Senior Cloud Engineer to build and operate AWS-based business continuity and disaster recovery capabilities that protect mission-critical systems and enable rapid recovery from disruptions.

API Gateway Argo CD AWS Bash CI/CD CodeBuild CodePipeline DNS DynamoDB GitHub Actions GitOps HIPAA Kubernetes OpsGenie PagerDuty PowerShell Python Serverless Terraform
44 minutes ago

Scientific AI Evaluation & Computational Problem Designer

Weekday 11-50 Construction & Engineering

An independent contractor role for a client building a benchmark to evaluate advanced AI reasoning through original, research-grade computational problems across scientific and engineering domains.

Linux Python
1 hour, 3 minutes ago

Systems & AI Cloud Architect

Endeavour. Inspired Infrastructure. 11-50 Electric Utilities

Endeavour is seeking a remote Systems & AI Cloud Architect to support its IT ecosystem by shaping enterprise architecture, modernizing infrastructure, and enabling scalable AI and cloud solutions for sustainable infrastructure initiatives.

AWS Azure CI/CD Cybersecurity GCP Generative AI Machine Learning Microservices MLOps
1 hour, 20 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers