LLM Pre-training & Distributed Engineer (AI Infrastructure)

2 hours, 33 minutes ago
Lead
DevOps and Infrastructure
Hyphen Connect

Hyphen Connect

Hyphen Connect is a Web3 and AI talent agency that specializes in recruitment and staffing solutions for the blockchain and artificial intelligence sectors. The company focuses on connecting talent with businesses in these industries, utilizing its deep expertise in Web3 to meet specific project needs. The agency offers a range of services, including recruitment and headhunting tailored to Web3 and AI requirements, as well as recruitment process outsourcing (RPO) solutions. Hyphen Connect also provides talent sourcing from a curated network of professionals, HR consulting for early-stage startups, and career development programs to help individuals succeed in the AI and Web3 fields. Additionally, the company assists organizations in building strong employer brands and enhancing internal engagement.

staffing & recruiting
1-10
Founded 2024

Description

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.
  • Support large-scale machine learning training infrastructure and GPU cluster operations.
  • Ensure training processes are efficient and reliable across distributed systems.

Requirements

  • Deep expertise in 3D parallelism, including data, tensor, and pipeline parallelism.
  • Experience managing SLURM- or Kubernetes-based GPU clusters.
  • Strong systems engineering background with C++, CUDA, and Python.
  • Deep understanding of GPU clusters and distributed infrastructure.
  • Extensive experience in system engineering for large-scale training environments.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

AI Application Architect

Resilient Co 11-50 Professional Services

AI Application Architect at an engineering-focused, remote contractor role for a company building Azure-based AI applications that turn complex ideas into scalable real-world solutions.

Azure C# CI/CD FastAPI Material UI Python React TypeScript
1 hour, 41 minutes ago

UNPAID VOLUNTEER - Principal Technology Officer - Distributed Ledger Technology

Blockchain & Climate Institute Diversified Consumer Services

Blockchain & Climate Institute is seeking a volunteer Principal Technology Officer to provide strategic and technical leadership for distributed ledger technology across its climate-focused projects.

Agile AWS Azure Blockchain Ethereum GCP GitHub JSON Machine Learning Python PyTorch R Scikit-learn TensorFlow XML
2 hours, 26 minutes ago

AI Enablement Engineer

Clover Health 251-1K Insurance

Counterpart Health is hiring an AI Enablement Engineer to deploy and support secure AI tools that improve workflows across the product organization in a HIPAA-compliant healthcare environment.

HIPAA LLM OAuth Secrets Management
2 hours, 37 minutes ago

Full-Stack AI Engineer

Pavago IT Services

A remote Full-Stack AI Engineer role for a client, focused on building and productionizing AI-powered applications by connecting machine learning models, back-end services, and front-end experiences.

Apache Airflow CI/CD Dagster Docker FastAPI Flask GCP HIPAA Hugging Face JavaScript Kubeflow Kubernetes Microservices MLflow MLOps Next.js Node.js Prefect Python PyTorch React SageMaker Serverless Snowflake SQL TensorFlow TypeScript Vertex AI Vue.js
2 hours, 56 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers