The Voleon Group

The Voleon Group focuses on the development and application of advanced machine learning technologies to enhance investment management, utilizing data-driven techniques and flexible statistical models for financial prediction.

Capital Markets

Financials

51-250 (65)

Founded 2007

24 open positions

Links

View All Jobs

Senior Cluster Site Reliability Engineer

2 hours, 1 minute ago

United States

Full-time

Senior

Site Reliability Engineer (SRE)

DevOps and Infrastructure

Ansible Apache Airflow Apache Spark AWS Docker GCP Grafana Kubeflow Kubernetes Machine Learning OpenTelemetry Podman Prometheus Python PyTorch Ruby TensorFlow Terraform

Apply Now

The Voleon Group

Capital Markets

51-250

Founded 2007

View All Jobs 24

Description

Serve as a first responder for cluster outages and urgent operational issues, triaging and resolving problems as they arise.
Ensure high cluster uptime and define, track, and manage SLAs for reliability.
Diagnose recurring system issues and implement targeted fixes in collaboration with engineering teams.
Develop and maintain metrics, telemetry, and observability tooling for cluster health.
Build custom observability mechanisms when existing tools are insufficient.
Help software and research teams define fair cluster usage policies and enforcement mechanisms.
Forecast cluster growth and help choose appropriate scale-up strategies.
Optimize cluster operations for cost and usability while maintaining reliability.
Support both on-prem and cloud infrastructure for the research compute environment.
Engineer systemic and architectural improvements to prevent repeat operational issues.

Requirements

5+ years of experience in SRE or DevOps roles, preferably as a senior engineer or tech lead.
Knowledge of HPC or batch compute frameworks such as Slurm, Kueue, AWS Batch, or GCP Batch.
Knowledge of machine learning training systems such as Kubeflow, MLflow, or Horovod.
Ability to develop scripts and utilities of moderate complexity in a common scripting language such as Python or Ruby.
Experience with infrastructure-as-code and configuration management tools such as Terraform and Ansible.
Experience with cloud infrastructure, preferably AWS or GCP.
Experience designing and implementing observability stacks such as Prometheus, Grafana, Loki, ELK, or OpenTelemetry.
Experience with distributed storage technologies such as Lustre, Ceph, or S3.
A system engineer mindset with a systematic, automation-oriented approach.
Bachelor’s degree in computer science.
Hands-on experience with HPC frameworks such as Slurm or Grid Engine.
Experience with Kubernetes-based job orchestrators such as Airflow, Kueue, or Kubeflow Pipelines.
Experience with distributed computing frameworks such as Ray, Modin, Dask, or Spark.
Familiarity with ML frameworks such as PyTorch, TensorFlow, JAX, Horovod, or DeepSpeed.
Experience with hybrid or on-prem environments.
Experience with containerization tools such as Docker, Podman, or Singularity, especially in HPC environments.
Experience with HPC networking such as InfiniBand or RDMA.
Solid security and IAM foundations, including identity management systems, AWS/GCP IAM, or Zero Trust.

Benefits

Base salary range of $205,000 to $235,000.
Medical, dental, and vision coverage.
Life and AD&D insurance.
20 days of paid time off.
9 sick days.
401(k) plan with company match.
$15,000 referral bonus through the Friends of Voleon Candidate Referral Program, subject to eligibility and terms.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

United States Full-time Mid Level Site Reliability Engineer (SRE)

$133k-$190k

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript

17 minutes ago

Apply

17 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

United States Contract Senior Site Reliability Engineer (SRE)

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana

1 hour, 1 minute ago

Apply

1 hour, 1 minute ago

Lead Engineer - Platform Performance & Reliability

HighLevel 251-1K Internet Software & Services

HighLevel is hiring a Lead Engineer for its Platform Performance & Reliability team to improve the speed, stability, and operational health of a high-traffic global SaaS platform.

India Full-time Senior Backend Engineer Site Reliability Engineer (SRE)

AWS ClickHouse Firestore GCP Grafana Kubernetes Microservices MongoDB MySQL Node.js OpenTelemetry PostgreSQL Prometheus Redis

1 hour, 47 minutes ago

Apply

1 hour, 47 minutes ago

Ingénieur fiabilité des infrastructures

Tecsys 251-1K Air Freight & Logistics

Tecsys recherche un ingénieur fiabilité des infrastructures pour son NOC afin d’assurer la fiabilité, la performance et l’évolution de ses plateformes SaaS critiques sur AWS et Kubernetes.

Canada Full-time Senior Site Reliability Engineer (SRE)

AWS Datadog Kubernetes Terraform

2 hours, 17 minutes ago

Apply

2 hours, 17 minutes ago

The Voleon Group

Tags

Links

Senior Cluster Site Reliability Engineer

The Voleon Group

Description

Requirements

Benefits

Similar Roles

Site Reliability Engineer - Backstage

Blockchain Site Reliability Engineer

Lead Engineer - Platform Performance & Reliability

Ingénieur fiabilité des infrastructures

You're on a roll! Sign up now to keep applying.