Parallel Domain

Parallel Domain

Parallel Domain is a synthetic data platform that helps machines see the world through 3D simulation and generative AI. Their API offers flexibility in data capture, enabling the development, training, and testing of autonomous systems efficiently and ...

Aerospace & Defense
51-250
Founded 2017
$44M raised

Description

  • Design, build, and maintain multi-region AWS infrastructure using Terraform.
  • Operate and scale EKS clusters across production regions, including autoscaling, node lifecycle management, and workload health.
  • Manage networking across environments, including VPC design, DNS, load balancing, and cross-region connectivity.
  • Support infrastructure changes, migrations, and expansions into new regions.
  • Improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize.
  • Build and run incident management processes, including severity definitions, escalation paths, and on-call practices.
  • Lead incident response, debugging, and root-cause analysis, and write postmortems that drive reliability improvements.
  • Improve observability through metrics, logging, tracing, and dashboards.
  • Support GPU and batch workloads running on Kubernetes.
  • Own cloud IAM governance across accounts and services, and support security and compliance-related requests.

Requirements

  • 5+ years of experience in SRE, DevOps, or infrastructure engineering roles.
  • Experience operating production systems across multiple regions.
  • Strong Terraform experience, including modules, state management, and multi-environment patterns.
  • Solid AWS experience across VPC, IAM, EKS, S3, and CloudWatch.
  • Kubernetes expertise, including cluster operations, autoscaling, RBAC, and Helm.
  • Experience with CI/CD and GitOps workflows such as GitHub Actions and ArgoCD.
  • Networking fundamentals including CIDR, DNS, load balancing, VPN, and cross-region connectivity.
  • Experience with observability tools such as Prometheus and Grafana.
  • Comfort with Python and Bash for tooling and automation.
  • Working knowledge of both Linux and Windows environments; Windows-based workload support is a meaningful advantage.
  • Experience with Windows node pools, Windows AMIs, or GPU-adjacent components on Kubernetes (preferred).
  • Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration (preferred).
  • Experience supporting simulation, ML, or rendering workloads in cloud infrastructure (preferred).
  • Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family (preferred).
  • Familiarity with service proxy or service mesh patterns (preferred).
  • Experience with container-optimized OS images such as Bottlerocket or Packer (preferred).
  • Experience with cloud cost optimization at scale (preferred).

Benefits

  • Base salary range of CAD $145,000–$185,000.
  • Equity package.
  • Full health, dental, and vision coverage.
  • Learning stipend.
  • Generous vacation.
  • Remote-friendly work arrangement across Canada and the U.S. Pacific Northwest.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Operations Engineer

Mozilla 251-1K Internet Software & Services

Mozilla is hiring a Staff Operations Engineer to lead the design, reliability, and evolution of hybrid-cloud and workplace infrastructure across teams.

Ansible DNS Linux Puppet Python TCP/IP Unix
8 hours, 14 minutes ago

Principal Site Reliability Engineer (SRE)

Symmetrio Professional Services

Symmetrio is recruiting a Principal Site Reliability Engineer for a rapidly growing healthcare technology company to own the reliability, scalability, security, and performance of a mission-critical SaaS platform used by healthcare providers across the United States.

Active Directory AWS CI/CD Datadog Django Grafana Kubernetes Python Terraform Windows Server
8 hours, 29 minutes ago

Performance Test Engineer Lead

PartnerOne 51-250 Media

An enterprise performance engineering role at a cloud-focused organization, responsible for validating the scalability, stability, and production readiness of distributed systems across Azure and hybrid environments.

Azure CI/CD Kubernetes PowerShell
8 hours, 44 minutes ago

Site Reliability Engineer

MLabs 11-50 Internet Software & Services

Remote UK-hours Site Reliability Engineering role at a financial technology company, focused on automating and operating the infrastructure that supports global integration services for financial institutions.

Active Directory Ansible AWS CI/CD GCP OAuth PostgreSQL SAML
8 hours, 59 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers