Staff Software Engineer - AI Research Infrastructure

2 weeks, 5 days ago
Full-time
Senior
DevOps and Infrastructure
Databricks

Databricks

Databricks is the pioneering data intelligence platform, empowering organizations worldwide to solve complex data challenges with AI-driven analytics solutions.

IT Services
1K-5K
Founded 2013
$4450M raised

Description

  • Design and implement infrastructure for large-scale experiments, data processing, and model training across HPC clusters, GPU fleets, and cloud-based systems.
  • Build abstractions for job submission, scheduling, and monitoring that help researchers move from idea to large-scale experiment quickly.
  • Develop tooling that improves research developer productivity, including experiment management systems and CI/testing infrastructure for research code.
  • Turn experimental workloads into robust, repeatable pipelines in partnership with research scientists, ML engineers, and platform teams.
  • Improve reliability, efficiency, and security across research infrastructure and development workflows.
  • Influence the long-term roadmap for research computation, including how models are trained, evaluated, and shipped to customers.
  • Serve as a technical mentor and force multiplier for engineers working on compute, infrastructure, and AI systems.
  • Develop and run the research stack that supports Databricks AI Research workloads at scale.

Requirements

  • BS, MS, or PhD in Computer Science or a related field.
  • 5+ years of software engineering experience.
  • Substantial experience with large-scale distributed systems or infrastructure.
  • Deep experience building and operating distributed systems, data pipelines, or large-scale backend services.
  • Experience with GPUs, clusters, or major cloud providers is strongly preferred.
  • Proficiency in one or more systems programming languages such as C++, Rust, Go, Java, or Scala.
  • Experience with cluster schedulers, resource managers, or large-scale job orchestration systems such as Kubernetes, Slurm, Ray, or custom internal systems.
  • Understanding of modern ML training and inference workflows, including distributed training, model parallelism, fine-tuning, and evaluation.
  • Ability to move quickly and pragmatically while maintaining operational excellence.
  • Strong communication skills and ability to translate between research needs and infrastructure realities.

Benefits

  • Annual performance bonus eligibility.
  • Equity eligibility as part of total compensation.
  • Base salary range of $190,000 to $270,000 USD.
  • Comprehensive benefits and perks package.
  • Support for fair and equitable compensation practices.
  • Global company with offices around the world.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Mid-Senior IT Professional (Multiple Opportunities)

Hire Resolve US Internet Software & Services

Hire Resolve is assisting Australian IT organisations in hiring mid- to senior-level IT professionals for multi-disciplinary roles supporting infrastructure, cloud, cybersecurity, enterprise systems, and service delivery.

Active Directory AWS Azure Bash Cybersecurity DHCP DNS GCP PowerShell Python SIEM Terraform
5 hours, 12 minutes ago

Licensed Civil Engineer - Data Center

Olsson 1K-5K Construction & Engineering

Olsson is hiring a Licensed Civil Engineer to support its Data Center Civil team on large hyperscale and colocation data center projects across the U.S., with a focus on designing critical infrastructure for complex engineering-driven developments.

7 hours, 46 minutes ago

Head of Classified Infrastructure, Frontier Systems

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is seeking a senior security leader for its Frontier Systems team to shape and execute classified infrastructure and information security strategy for defense and intelligence programs.

Cybersecurity Penetration Testing
14 hours, 53 minutes ago

Senior Scraping Engineer (Web scraping & Anti-bot)

Infatica 1-10 Internet Software & Services

Infatica.io is seeking an experienced Tech Engineer to help build and lead the architecture of a high-load web scraping platform that delivers clean HTML or structured JSON outputs for cloud and on-premises deployments.

CI/CD Cloudflare Docker Go Grafana Helm HTTP Kubernetes Microservices Playwright Prometheus Puppeteer Python Redis Selenium TLS
20 hours, 17 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers