LLM Pre-training & Distributed Engineer (AI Infrastructure)

54 minutes ago
Lead
DevOps and Infrastructure
Hyphen Connect

Hyphen Connect

Hyphen Connect is a Web3 and AI talent agency that specializes in recruitment and staffing solutions for the blockchain and artificial intelligence sectors. The company focuses on connecting talent with businesses in these industries, utilizing its deep expertise in Web3 to meet specific project needs. The agency offers a range of services, including recruitment and headhunting tailored to Web3 and AI requirements, as well as recruitment process outsourcing (RPO) solutions. Hyphen Connect also provides talent sourcing from a curated network of professionals, HR consulting for early-stage startups, and career development programs to help individuals succeed in the AI and Web3 fields. Additionally, the company assists organizations in building strong employer brands and enhancing internal engagement.

staffing & recruiting
1-10
Founded 2024

Description

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking and memory management to prevent out-of-memory errors during training.
  • Automate checkpointing and failure recovery for month-long training runs.
  • Work on distributed infrastructure that supports large-scale machine learning training.
  • Ensure training processes are efficient and reliable across GPU clusters.

Requirements

  • Deep expertise in 3D parallelism, including data, tensor, and pipeline parallelism.
  • Experience managing SLURM- or Kubernetes-based GPU clusters.
  • Strong systems engineering background with C++, CUDA, and Python.
  • Deep understanding of GPU clusters.
  • Extensive experience in system engineering for large-scale training environments.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Sr. Linux & OCI Administrator (Oracle Linux / RHEL)

Mitratech 1K-5K Professional Services

Mitratech is seeking a Sr. Linux & Oracle Cloud Infrastructure (OCI) Administrator to support a production application on OCI, with primary ownership of Linux administration, cloud networking and security, infrastructure automation, and operational support.

Ansible Azure Bash DHCP DNS Docker GitHub Actions GitLab CI HashiCorp Vault Jenkins Kubernetes Linux Load Balancing Packer RHEL SFTP SSH TCP/IP Terraform TLS
1 hour, 28 minutes ago

AI Senior Software Engineer

Nice Côte d'Azur Hotels, Restaurants & Leisure

NiCE is hiring an engineer to build production software for customer experience and contact center products, with a strong emphasis on using AI and LLMs to improve how the team works and ships.

LLM
1 hour, 43 minutes ago

AI Staff Software Engineer

Natera 1K-5K Pharmaceuticals

Natera is hiring a Staff Software Engineer to help build a new AI-native engineering team focused on automating accessioning workflows that turn requisition forms, samples, and kits into lab instructions.

AWS Azure CI/CD GCP Go gRPC HIPAA Java LLM Microservices MLOps Python React REST API TypeScript
3 hours, 4 minutes ago

Staff Engineer - Cloud Infrastructure & Security

HighLevel 251-1K Internet Software & Services

HighLevel is hiring a Staff Engineer to architect and strengthen its cloud infrastructure and security platform for a large-scale, remote-first SaaS environment.

Bash CI/CD Cloudflare GCP Go Kubernetes Load Balancing Microservices Python Terraform
4 hours, 2 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers