Staff Software Engineer - AI Research Infrastructure

12 minutes ago
Full-time
Senior
DevOps and Infrastructure
Databricks

Databricks

Databricks is the pioneering data intelligence platform, empowering organizations worldwide to solve complex data challenges with AI-driven analytics solutions.

IT Services
1K-5K
Founded 2013
$4450M raised

Description

  • Design and implement infrastructure for large-scale experiments, data processing, and model training across HPC clusters, GPU fleets, and cloud-based systems.
  • Build abstractions for job submission, scheduling, and monitoring that help researchers move from idea to large-scale experiment quickly.
  • Develop tooling that improves research developer productivity, including experiment management systems and CI/testing infrastructure for research code.
  • Turn experimental workloads into robust, repeatable pipelines in partnership with research scientists, ML engineers, and platform teams.
  • Improve reliability, efficiency, and security across research infrastructure and development workflows.
  • Influence the long-term roadmap for research computation, including how models are trained, evaluated, and shipped to customers.
  • Serve as a technical mentor and force multiplier for engineers working on compute, infrastructure, and AI systems.
  • Develop and run the research stack that supports Databricks AI Research workloads at scale.

Requirements

  • BS, MS, or PhD in Computer Science or a related field.
  • 5+ years of software engineering experience.
  • Substantial experience with large-scale distributed systems or infrastructure.
  • Deep experience building and operating distributed systems, data pipelines, or large-scale backend services.
  • Experience with GPUs, clusters, or major cloud providers is strongly preferred.
  • Proficiency in one or more systems programming languages such as C++, Rust, Go, Java, or Scala.
  • Experience with cluster schedulers, resource managers, or large-scale job orchestration systems such as Kubernetes, Slurm, Ray, or custom internal systems.
  • Understanding of modern ML training and inference workflows, including distributed training, model parallelism, fine-tuning, and evaluation.
  • Ability to move quickly and pragmatically while maintaining operational excellence.
  • Strong communication skills and ability to translate between research needs and infrastructure realities.

Benefits

  • Annual performance bonus eligibility.
  • Equity eligibility as part of total compensation.
  • Base salary range of $190,000 to $270,000 USD.
  • Comprehensive benefits and perks package.
  • Support for fair and equitable compensation practices.
  • Global company with offices around the world.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

AI Solutions Architect (IT)

Fortis Games 251-1K Internet Software & Services

Fortis Games is hiring an AI Solutions Architect to lead the design and modernization of internal IT systems and automation with a focus on secure, scalable, AI-driven operations.

AWS Azure Game Development GCP Generative AI LLM MLOps
48 minutes ago

Research Physicist - Freelance AI Trainer

Mindrift.ai: Be the “I” in AI Internet Software & Services

Mindrift is seeking part-time physics specialists for project-based AI work focused on designing and validating optics and physics problems that support the testing and improvement of AI systems.

4 hours, 50 minutes ago

DATA MASTER

Inter 51-250 Banks

Inter is seeking a technical leader for its DATA & AI FOUNDATIONS team to guide the company’s data and AI platform architecture and ensure its stability, security, and evolution on AWS.

Apache Airflow Apache Spark Argo CD AWS FastAPI Flux Grafana Helm Java Kafka Kubernetes LLM Python SQL Terraform Trino
5 hours, 4 minutes ago

Staff Threat Research Engineer

Sumo Logic 251-1K Internet Software & Services

Sumo Logic’s Threat Labs is hiring a staff-level threat researcher to turn threat intelligence and original adversary research into high-quality detections for its SIEM platform.

AWS Azure Cybersecurity GCP Machine Learning PowerShell Python SIEM SOC
5 hours, 17 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers