LLM Pre-training & Distributed Engineer (AI Infrastructure)

9 hours, 36 minutes ago
Senior
DevOps and Infrastructure
Hyphen Connect

Hyphen Connect

Hyphen Connect is a Web3 and AI talent agency that specializes in recruitment and staffing solutions for the blockchain and artificial intelligence sectors. The company focuses on connecting talent with businesses in these industries, utilizing its deep expertise in Web3 to meet specific project needs. The agency offers a range of services, including recruitment and headhunting tailored to Web3 and AI requirements, as well as recruitment process outsourcing (RPO) solutions. Hyphen Connect also provides talent sourcing from a curated network of professionals, HR consulting for early-stage startups, and career development programs to help individuals succeed in the AI and Web3 fields. Additionally, the company assists organizations in building strong employer brands and enhancing internal engagement.

staffing & recruiting
1-10
Founded 2024

Description

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.
  • Support large-scale machine learning training infrastructure and reliability.
  • Work on GPU cluster orchestration and distributed systems optimization.

Requirements

  • Deep expertise in 3D parallelism, including data, tensor, and pipeline parallelism.
  • Experience managing SLURM- or Kubernetes-based GPU clusters.
  • Strong systems engineering background in C++, CUDA, and Python.
  • Deep understanding of GPU clusters and distributed infrastructure.
  • Extensive experience ensuring efficient and reliable training processes.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

AI engineer

Weekday 11-50 Construction & Engineering

An AI Engineer role with one of Weekday's clients in India, focused on building and optimizing LLM-powered applications that solve business problems across product, data science, and engineering teams.

AWS Azure CI/CD Deep Learning Docker GCP Generative AI Kubernetes LLM Machine Learning Microservices MLOps NLP Python PyTorch REST API TensorFlow
1 hour, 16 minutes ago

Principal IT Engineer

K2 Space Corporation 51-200 Defense and Space Manufacturing

K2 Space is hiring an IT Systems Architect/Engineer to own and scale the core IT foundation supporting its large satellite development, testing, and mission operations.

2 hours, 29 minutes ago

AI Enablement Engineer

Clover Health 251-1K Insurance

Counterpart Health is hiring an AI Enablement Engineer to deploy and support secure AI tools that improve workflows across the product organization in a HIPAA-compliant healthcare environment.

HIPAA LLM OAuth Secrets Management
4 hours, 38 minutes ago

AI Application Architect

Resilient Co 11-50 Professional Services

AI Application Architect at an engineering-focused, remote contractor role for a company building Azure-based AI applications that turn complex ideas into scalable real-world solutions.

Azure C# CI/CD FastAPI Material UI Python React TypeScript
5 hours, 50 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers