Parallel Domain

Parallel Domain

Parallel Domain is a synthetic data platform that helps machines see the world through 3D simulation and generative AI. Their API offers flexibility in data capture, enabling the development, training, and testing of autonomous systems efficiently and ...

Aerospace & Defense
51-250
Founded 2017
$44M raised

Description

  • Own and evolve AWS infrastructure to improve platform performance, availability, and support future enterprise deployment models.
  • Operate EKS clusters across production regions, including node pool strategy, AMI lifecycle management, autoscaling, and workload health.
  • Support and manage the GitOps deployment pipeline using infrastructure-as-code across multiple clusters.
  • Design and maintain complex networking components, including VPCs, cross-region connectivity, DNS, and load balancing.
  • Lead infrastructure deprecation and migration efforts with minimal disruption to services.
  • Own SLO measurement infrastructure and enable proactive issue triage before customer impact occurs.
  • Lead incident investigations, root cause analysis, and postmortems to drive systemic reliability fixes.
  • Design and improve automated remediation systems to reduce mean time to recovery.
  • Review platform architecture decisions through a security-conscious lens and own cloud IAM governance across accounts and services.
  • Support compliance-adjacent work, including audit readiness, partner certification requirements, and customer security questionnaires.

Requirements

  • 5+ years of experience in SRE, DevOps, or infrastructure engineering roles.
  • Strong infrastructure-as-code experience, including Terraform modules, state management, and multi-environment patterns.
  • Deep AWS experience with services including EKS, EC2, IAM, S3, Storage Gateway, VPC networking, Transit Gateway, CloudFront, KMS, and IRSA.
  • Strong Kubernetes expertise, including cluster operations, node pools, probes, cordoning, pod scheduling, RBAC, Helm, and node autoscaling.
  • Experience with GitOps workflows and CI/CD tooling such as ArgoCD, GitHub Actions, or Jenkins.
  • Solid networking fundamentals, including CIDR design, security groups, DNS, load balancing, VPNs, and cross-region connectivity.
  • Experience with monitoring and observability tools such as Prometheus, Grafana, and Elasticsearch.
  • Comfort with Python and Bash for tooling and automation.
  • Familiarity with Linux and Windows environments; operational experience with Windows Server is a meaningful advantage.
  • Preferred: experience with Karpenter, Windows-based workloads on EKS, GPU workloads on Kubernetes, NVIDIA and DirectX device plugins, AWS Storage Gateway or Transfer Family, Envoy Gateway, container-optimized OS images such as Bottlerocket or Packer, and cloud cost optimization at scale.

Benefits

  • Remote full-time work arrangement.
  • Opportunity to work on high-impact infrastructure for customer-critical autonomous vehicle simulation workloads.
  • High-trust, high-autonomy role with real influence over infrastructure architecture and cross-team process.
  • Work on technically challenging systems such as multi-region GPU scheduling and Windows workloads on Kubernetes.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer

Zeta Global 1K-5K Media

Zeta Global is hiring a Senior Site Reliability Engineer to help build and operate scalable observability and reliability systems for high-throughput distributed services processing millions of transactions daily.

Argo CD AWS Docker GitOps Go Grafana Honeycomb Jenkins Kubernetes Microservices OpenTelemetry Prometheus Python Terraform
12 minutes ago

Senior SRE Engineer / DevOps

Margo Bank Professional Services

Senior SRE Engineer / DevOps position at a consulting team in Warsaw focused on developing an internal developer platform and establishing CI/CD standards across multiple teams.

Bash CI/CD DevSecOps Git Kubernetes Python
12 minutes ago

Senior Site Reliability Engineer (SRE)

KOMOJU Internet Software & Services

KOMOJU is hiring a Site Reliability Engineer to own the reliability, performance, and developer experience of its cloud-based payment platform supporting merchants across cross-border integrations.

AWS CI/CD CircleCI Datadog GitHub Actions Go Jenkins Python Ruby Ruby on Rails Shopify TCP/IP Terraform
27 minutes ago

DevOps & Site Reliability Engineer

Oowlish 51-250 Internet Software & Services

Oowlish is hiring a DevOps & Site Reliability Engineer to support an AI-focused SaaS startup by maintaining, optimizing, and scaling the infrastructure behind its platform for high availability, performance, and reliability.

AWS Azure Azure Pipelines Bash CI/CD CircleCI Datadog Docker GCP Grafana Helm Jenkins Kubernetes New Relic Prometheus
42 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers