Earth Species Project

Earth Species Project

Earth Species Project is a nonprofit organization focused on utilizing artificial intelligence to decode animal communication, aiming to enhance understanding of non-human languages and foster a transformative relationship with nature while contributin...

Professional Services
11-50
Founded 2018

Description

  • Design and optimize high-performance data pipelines for distributed training and storage.
  • Build and maintain scalable data storage layers for diverse species data.
  • Improve low-level system performance, including latency, throughput, reliability, and GPU utilization.
  • Develop monitoring and visualization tools for data quality, pipeline performance, and experiments.
  • Optimize distributed AI workloads for reliability, latency, and efficiency.
  • Collaborate closely with researchers, engineers, and external partners on complex AI workflows.
  • Scope and supervise projects so interns, PhD students, and post-docs can contribute effectively.
  • Support recruiting efforts and help shape the growth of the infrastructure team.

Requirements

  • 5+ years of backend or infrastructure engineering experience.
  • Strong Python programming skills; lower-level languages are a plus.
  • Experience with distributed systems and cloud platforms such as AWS, GCP, or Azure.
  • Hands-on experience with Docker, Kubernetes, and Terraform.
  • Experience building or supporting ML/AI infrastructure in production.
  • Experience with high-performance data tools such as DuckDB, Apache Spark, or Delta Lake.
  • Experience with GPU orchestration and large-scale model training.
  • Familiarity with ML platforms such as SageMaker or Vertex AI and frameworks such as PyTorch or JAX.
  • Experience mentoring junior engineers, interns, or researchers and breaking down complex projects into manageable tasks.
  • Experience participating in technical hiring processes and evaluating candidates.
  • Deep knowledge of training architectures, CUDA programming, or TPU optimization is preferred.
  • Full-stack development experience with frameworks like React is preferred.
  • Experience managing HPC infrastructure with tools like Slurm or Kubernetes clusters is preferred.
  • Background in monitoring stacks such as Prometheus or Grafana for ML pipeline observability is preferred.

Benefits

  • $225,500 - $235,500 annual salary.
  • Medical, dental, and vision insurance with 100% of premiums covered by ESP.
  • 401(k) plan with match for U.S.-based employees.
  • $2,000 home office stipend.
  • Unlimited paid time off with a recommended minimum of three weeks per year.
  • Flexible working hours.
  • Regular team retreats around the world.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Field Engineer

Serverfarm Data Centers 51-250 IT Services

Serverfarm is hiring a Senior Field Engineer to help standardize and support data center design, deployment, and operational engineering across global colocation facilities as the company scales its infrastructure for hyperscale, cloud, machine learning, and AI demand.

Machine Learning Transformers
32 minutes ago

Director, Information Security and Technology.

ghSMART 51-250 Professional Services

ghSMART is seeking a Director of Information Security and Technology to lead enterprise security, IT operations, and infrastructure for a fully remote firm serving global leadership clients.

Azure Cybersecurity
1 hour, 17 minutes ago

Remote Encryption Engineer (HSM)

WaveStrong, 51-250 Internet Software & Services

Wavestrong is seeking a remote Encryption Engineer for a 6-month contract to support customer deployments of encryption and key management solutions across cloud and virtualized environments.

Active Directory AWS Azure Cisco DHCP Docker Encryption GCP PowerShell SAML
3 hours, 17 minutes ago

Senior IT Operations Engineer II

Aledade 1K-5K Health Care Providers & Services

Aledade is hiring a Senior IT Operations Engineer II to keep its production IT environments reliable, secure, and scalable while improving automation, resilience, and compliance across cloud and identity infrastructure.

Active Directory Azure Datadog PowerShell Python
3 hours, 32 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers