LLM Pre-training & Distributed Engineer (AI Infrastructure)

3 weeks, 4 days ago
Lead
DevOps and Infrastructure
Hyphen Connect

Hyphen Connect

Hyphen Connect is a Web3 and AI talent agency that specializes in recruitment and staffing solutions for the blockchain and artificial intelligence sectors. The company focuses on connecting talent with businesses in these industries, utilizing its deep expertise in Web3 to meet specific project needs. The agency offers a range of services, including recruitment and headhunting tailored to Web3 and AI requirements, as well as recruitment process outsourcing (RPO) solutions. Hyphen Connect also provides talent sourcing from a curated network of professionals, HR consulting for early-stage startups, and career development programs to help individuals succeed in the AI and Web3 fields. Additionally, the company assists organizations in building strong employer brands and enhancing internal engagement.

staffing & recruiting
1-10
Founded 2024

Description

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking and memory management to prevent out-of-memory errors during training.
  • Automate checkpointing and failure recovery for month-long training runs.
  • Work on distributed infrastructure that supports large-scale machine learning training.
  • Ensure training processes are efficient and reliable across GPU clusters.

Requirements

  • Deep expertise in 3D parallelism, including data, tensor, and pipeline parallelism.
  • Experience managing SLURM- or Kubernetes-based GPU clusters.
  • Strong systems engineering background with C++, CUDA, and Python.
  • Deep understanding of GPU clusters.
  • Extensive experience in system engineering for large-scale training environments.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior GenAI Integrated Designer

Brandtech+ 501-1000 Marketing services

Brandtech+ is hiring a Senior GenAI Integrated Designer to create and adapt digital, social, e-commerce, and motion content using GenAI workflows for high-profile brands.

After Effects Digital Marketing E-commerce Figma Generative AI Illustrator Instagram API Photoshop Social Media Marketing TikTok
53 minutes ago

Forward Deployed Engineer, US

Arize AI 51-250 IT Services

Arize AI is seeking a Forward Deployed AI Engineer to partner directly with enterprise clients and deliver customized AI observability and evaluation solutions that help teams ship reliable GenAI applications at scale.

AWS Azure Docker GCP Generative AI Java Kubernetes MLOps Python TypeScript
1 hour ago

Staff Software Engineer - Cloud Network Engineering

Toast 1K-5K Hotels, Restaurants & Leisure

Toast is hiring a Staff Software Engineer on Traffic Engineering to own the infrastructure that routes, secures, and operates traffic across its platform.

AWS CDN Cloudflare DNS Envoy Java Kotlin Scala Terraform TLS
2 hours, 34 minutes ago

AI Engineer, Special Programs - Top Secret Clearance

SpaceX 10K-50K Aerospace & Defense

SpaceX is hiring an AI Engineer for Special Programs to develop and deploy mission-critical AI solutions for U.S. federal agencies in a high-security environment.

HTTP Machine Learning Microservices Pulumi Python Terraform TypeScript
5 hours, 27 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers