LLM Pre-training & Distributed Engineer (AI Infrastructure)

1 hour, 36 minutes ago
Lead
DevOps and Infrastructure
Hyphen Connect

Hyphen Connect

Hyphen Connect is a Web3 and AI talent agency that specializes in recruitment and staffing solutions for the blockchain and artificial intelligence sectors. The company focuses on connecting talent with businesses in these industries, utilizing its deep expertise in Web3 to meet specific project needs. The agency offers a range of services, including recruitment and headhunting tailored to Web3 and AI requirements, as well as recruitment process outsourcing (RPO) solutions. Hyphen Connect also provides talent sourcing from a curated network of professionals, HR consulting for early-stage startups, and career development programs to help individuals succeed in the AI and Web3 fields. Additionally, the company assists organizations in building strong employer brands and enhancing internal engagement.

staffing & recruiting
1-10
Founded 2024

Description

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during long-running training runs.
  • Support efficient and reliable training processes on large-scale distributed infrastructure.

Requirements

  • Deep expertise in 3D parallelism, including data, tensor, and pipeline parallelism.
  • Experience managing GPU clusters with SLURM or Kubernetes.
  • Strong systems engineering background with C++, CUDA, and Python.
  • Deep understanding of GPU clusters and distributed infrastructure.
  • Extensive experience in system engineering for machine learning training workloads.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Full-Stack AI Engineer

Pavago IT Services

A remote Full-Stack AI Engineer role for a client, focused on building and productionizing AI-powered applications by connecting machine learning models, back-end services, and front-end experiences.

Apache Airflow CI/CD Dagster Docker FastAPI Flask GCP HIPAA Hugging Face JavaScript Kubeflow Kubernetes Microservices MLflow MLOps Next.js Node.js Prefect Python PyTorch React SageMaker Serverless Snowflake SQL TensorFlow TypeScript Vertex AI Vue.js
20 minutes ago

Full-Stack AI Engineer

Pavago IT Services

Full-Stack AI Engineer needed for a remote role supporting a client’s production AI applications by connecting machine learning models, backend systems, and user-facing interfaces into scalable business solutions.

Apache Airflow CI/CD Dagster Docker FastAPI Flask GCP HIPAA Hugging Face JavaScript Kubeflow Kubernetes Microservices MLflow MLOps Next.js Node.js Prefect Python PyTorch React SageMaker Serverless Snowflake SQL TensorFlow TypeScript Vertex AI Vue.js
20 minutes ago

AI Application Architect

Resilient Co 11-50 Professional Services

AI Application Architect at an engineering-focused, remote contractor role for a company building Azure-based AI applications that turn complex ideas into scalable real-world solutions.

Azure C# CI/CD FastAPI Material UI Python React TypeScript
2 hours, 25 minutes ago

Backend Engineer - Video Intelligence

Wowza 51-250 IT Services

Wowza is seeking a Senior Backend Engineer to build AI-powered video intelligence services for real-time, high-scale video analysis at the intersection of backend engineering, streaming, and artificial intelligence.

Computer Vision Docker FastAPI Git GitHub Actions Hugging Face Load Balancing Machine Learning OpenCV Pytest Python PyTorch REST API WebRTC WebSockets
2 hours, 32 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers