Parallel Domain

Parallel Domain

Parallel Domain is a synthetic data platform that helps machines see the world through 3D simulation and generative AI. Their API offers flexibility in data capture, enabling the development, training, and testing of autonomous systems efficiently and ...

Aerospace & Defense
51-250
Founded 2017
$44M raised

Description

  • Design, build, and maintain multi-region AWS infrastructure using Terraform.
  • Operate and scale EKS clusters across production regions, including autoscaling, node lifecycle management, and workload health.
  • Manage networking across environments, including VPC design, DNS, load balancing, and cross-region connectivity.
  • Support infrastructure changes, migrations, and expansions into new regions.
  • Improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize.
  • Build and run incident management processes, including severity definitions, escalation paths, and on-call practices.
  • Lead incident response, debugging, and root-cause analysis, and write postmortems that drive reliability improvements.
  • Improve observability through metrics, logging, tracing, and dashboards.
  • Support GPU and batch workloads running on Kubernetes.
  • Own cloud IAM governance across accounts and services, and support security and compliance-related requests.

Requirements

  • 5+ years of experience in SRE, DevOps, or infrastructure engineering roles.
  • Experience operating production systems across multiple regions.
  • Strong Terraform experience, including modules, state management, and multi-environment patterns.
  • Solid AWS experience across VPC, IAM, EKS, S3, and CloudWatch.
  • Kubernetes expertise, including cluster operations, autoscaling, RBAC, and Helm.
  • Experience with CI/CD and GitOps workflows such as GitHub Actions and ArgoCD.
  • Networking fundamentals including CIDR, DNS, load balancing, VPN, and cross-region connectivity.
  • Experience with observability tools such as Prometheus and Grafana.
  • Comfort with Python and Bash for tooling and automation.
  • Working knowledge of both Linux and Windows environments; Windows-based workload support is a meaningful advantage.
  • Experience with Windows node pools, Windows AMIs, or GPU-adjacent components on Kubernetes (preferred).
  • Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration (preferred).
  • Experience supporting simulation, ML, or rendering workloads in cloud infrastructure (preferred).
  • Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family (preferred).
  • Familiarity with service proxy or service mesh patterns (preferred).
  • Experience with container-optimized OS images such as Bottlerocket or Packer (preferred).
  • Experience with cloud cost optimization at scale (preferred).

Benefits

  • Base salary range of CAD $145,000–$185,000.
  • Equity package.
  • Full health, dental, and vision coverage.
  • Learning stipend.
  • Generous vacation.
  • Remote-friendly work arrangement across Canada and the U.S. Pacific Northwest.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Manager, Engineering

Sumo Logic 251-1K Internet Software & Services

Sumo Logic is hiring a Senior Manager, Engineering for Application Security to lead global programs that improve product security, reliability, and operational efficiency across its cloud platform.

Agile AWS C++ Docker GCP Java Kafka Kubernetes OWASP Ruby Scala SIEM
17 hours, 12 minutes ago

Staff Software Engineer - Databases SRE | Sweden | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer, SRE to improve the reliability and scalability of Grafana Cloud’s database products for high-value customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Microservices Python Terraform
1 day, 16 hours ago

Senior Site Reliability Engineer (SRE)

Oowlish 51-250 Internet Software & Services

Oowlish is hiring a Senior Site Reliability Engineer to own the reliability and operational excellence of business-critical production systems for international clients in a remote, collaborative environment.

AWS Datadog Go Heroku Kubernetes PostgreSQL Python SQL Server TypeScript
1 day, 16 hours ago

Staff Software Engineer - Databases SRE | Spain | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer - SRE to strengthen the reliability of its cloud database products for high-SLA customers across AWS, GCP, and Azure.

AWS Azure GCP Go Helm Java Kubernetes Linux Python Terraform
1 day, 16 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers