Intetics

Intetics

Intetics is a top custom software development company with 28 years of experience, offering high-quality software applications and AI/ML integration. They excel in various industries and provide global talent solutions for exceptional project outcomes.

Internet Software & Services
1K-5K
Founded 1995

Description

  • Build, operate, and improve the infrastructure powering the distributed inference platform.
  • Own reliability, scalability, and operational excellence across AWS-based control planes and the multi-provider GPU fleet.
  • Design and maintain the networking layer connecting control planes, Kubernetes clusters, and geographically distributed GPU hosts.
  • Operate and improve Kubernetes-based inference orchestration, primarily on EKS.
  • Manage deployments and infrastructure changes using Helm, FluxCD, and Terraform.
  • Improve observability using metrics, logs, traces, dashboards, and alerting with Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry.
  • Tune alerts, improve runbooks, and strengthen operational readiness as the system scales.
  • Respond to production issues, perform root cause analysis, and implement durable fixes.
  • Collaborate with engineers across time zones using clear asynchronous communication and handoff practices.
  • Help expand Europe-based infrastructure coverage to support operations outside US business hours.

Requirements

  • 5+ years of experience in SRE, DevOps, platform engineering, or infrastructure engineering.
  • Strong production experience with networking and Kubernetes.
  • Experience operating AWS infrastructure in production, especially EKS.
  • Strong hands-on experience managing Linux hosts, clusters, and distributed systems in less abstracted cloud or hybrid environments.
  • Experience with Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry.
  • Experience with deployment and GitOps workflows using tools such as Helm and FluxCD.
  • Experience with infrastructure as code, ideally Terraform.
  • Familiarity with alert tuning, runbook development, and practical incident management in production systems.
  • Strong operational judgment with the ability to troubleshoot independently and respond calmly to incidents.
  • Comfort working in a fast-moving startup and communicating effectively in an async environment.
  • Experience with AI inference, ML infrastructure, or adjacent high-performance distributed systems (nice to have).
  • Experience operating heterogeneous GPU fleets, bare-metal infrastructure, or multi-provider compute environments (nice to have).
  • Experience using AI tools productively in engineering workflows (nice to have).

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

TextNow 51-250 Wireless Telecommunication Services

TextNow is hiring a remote Site Reliability Engineer in Canada to own infrastructure, monitoring, logging, CI/CD, and reliability for the systems supporting its free phone service platform.

Ansible AWS CI/CD GitHub System Design Terraform
4 hours, 35 minutes ago

Senior Application Engineer

Warner Music Group is hiring a Senior Application Engineer to support, improve, and modernize the software systems behind its global music operations.

Angular AWS CI/CD GitHub Actions Java Oracle PostgreSQL Python React SQL
4 hours, 50 minutes ago

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript
6 hours, 5 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana
6 hours, 50 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers