The Voleon Group

The Voleon Group

The Voleon Group focuses on the development and application of advanced machine learning technologies to enhance investment management, utilizing data-driven techniques and flexible statistical models for financial prediction.

Capital Markets
51-250
Founded 2007

Description

  • Serve as a first responder for cluster outages and urgent operational issues, triaging and resolving problems as they arise.
  • Ensure high cluster uptime and define, track, and manage SLAs for reliability.
  • Diagnose recurring system issues and implement targeted fixes in collaboration with engineering teams.
  • Develop and maintain metrics, telemetry, and observability tooling for cluster health.
  • Build custom observability mechanisms when existing tools are insufficient.
  • Help software and research teams define fair cluster usage policies and enforcement mechanisms.
  • Forecast cluster growth and help choose appropriate scale-up strategies.
  • Optimize cluster operations for cost and usability while maintaining reliability.
  • Support both on-prem and cloud infrastructure for the research compute environment.
  • Engineer systemic and architectural improvements to prevent repeat operational issues.

Requirements

  • 5+ years of experience in SRE or DevOps roles, preferably as a senior engineer or tech lead.
  • Knowledge of HPC or batch compute frameworks such as Slurm, Kueue, AWS Batch, or GCP Batch.
  • Knowledge of machine learning training systems such as Kubeflow, MLflow, or Horovod.
  • Ability to develop scripts and utilities of moderate complexity in a common scripting language such as Python or Ruby.
  • Experience with infrastructure-as-code and configuration management tools such as Terraform and Ansible.
  • Experience with cloud infrastructure, preferably AWS or GCP.
  • Experience designing and implementing observability stacks such as Prometheus, Grafana, Loki, ELK, or OpenTelemetry.
  • Experience with distributed storage technologies such as Lustre, Ceph, or S3.
  • A system engineer mindset with a systematic, automation-oriented approach.
  • Bachelor’s degree in computer science.
  • Hands-on experience with HPC frameworks such as Slurm or Grid Engine.
  • Experience with Kubernetes-based job orchestrators such as Airflow, Kueue, or Kubeflow Pipelines.
  • Experience with distributed computing frameworks such as Ray, Modin, Dask, or Spark.
  • Familiarity with ML frameworks such as PyTorch, TensorFlow, JAX, Horovod, or DeepSpeed.
  • Experience with hybrid or on-prem environments.
  • Experience with containerization tools such as Docker, Podman, or Singularity, especially in HPC environments.
  • Experience with HPC networking such as InfiniBand or RDMA.
  • Solid security and IAM foundations, including identity management systems, AWS/GCP IAM, or Zero Trust.

Benefits

  • Base salary range of $205,000 to $235,000.
  • Medical, dental, and vision coverage.
  • Life and AD&D insurance.
  • 20 days of paid time off.
  • 9 sick days.
  • 401(k) plan with company match.
  • $15,000 referral bonus through the Friends of Voleon Candidate Referral Program, subject to eligibility and terms.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript
17 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana
1 hour, 1 minute ago

Lead Engineer - Platform Performance & Reliability

HighLevel 251-1K Internet Software & Services

HighLevel is hiring a Lead Engineer for its Platform Performance & Reliability team to improve the speed, stability, and operational health of a high-traffic global SaaS platform.

AWS ClickHouse Firestore GCP Grafana Kubernetes Microservices MongoDB MySQL Node.js OpenTelemetry PostgreSQL Prometheus Redis
1 hour, 47 minutes ago

Ingénieur fiabilité des infrastructures

Tecsys 251-1K Air Freight & Logistics

Tecsys recherche un ingénieur fiabilité des infrastructures pour son NOC afin d’assurer la fiabilité, la performance et l’évolution de ses plateformes SaaS critiques sur AWS et Kubernetes.

AWS Datadog Kubernetes Terraform
2 hours, 17 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers