Parallel Domain

Parallel Domain

Parallel Domain is a synthetic data platform that helps machines see the world through 3D simulation and generative AI. Their API offers flexibility in data capture, enabling the development, training, and testing of autonomous systems efficiently and ...

Aerospace & Defense
51-250
Founded 2017
$44M raised

Description

  • Design, build, and maintain multi-region AWS infrastructure using Terraform.
  • Operate and scale EKS clusters across production regions, including autoscaling, node lifecycle management, and workload health.
  • Manage networking across environments, including VPC design, DNS, load balancing, and cross-region connectivity.
  • Support infrastructure changes, migrations, and expansions into new regions.
  • Improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize.
  • Build and run incident management processes, including severity definitions, escalation paths, and on-call practices.
  • Lead incident response, debugging, and root-cause analysis, and write postmortems that drive reliability improvements.
  • Improve observability through metrics, logging, tracing, and dashboards.
  • Support GPU and batch workloads running on Kubernetes.
  • Own cloud IAM governance across accounts and services, and support security and compliance-related requests.

Requirements

  • 5+ years of experience in SRE, DevOps, or infrastructure engineering roles.
  • Experience operating production systems across multiple regions.
  • Strong Terraform experience, including modules, state management, and multi-environment patterns.
  • Solid AWS experience across VPC, IAM, EKS, S3, and CloudWatch.
  • Kubernetes expertise, including cluster operations, autoscaling, RBAC, and Helm.
  • Experience with CI/CD and GitOps workflows such as GitHub Actions and ArgoCD.
  • Networking fundamentals including CIDR, DNS, load balancing, VPN, and cross-region connectivity.
  • Experience with observability tools such as Prometheus and Grafana.
  • Comfort with Python and Bash for tooling and automation.
  • Working knowledge of both Linux and Windows environments; Windows-based workload support is a meaningful advantage.
  • Experience with Windows node pools, Windows AMIs, or GPU-adjacent components on Kubernetes (preferred).
  • Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration (preferred).
  • Experience supporting simulation, ML, or rendering workloads in cloud infrastructure (preferred).
  • Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family (preferred).
  • Familiarity with service proxy or service mesh patterns (preferred).
  • Experience with container-optimized OS images such as Bottlerocket or Packer (preferred).
  • Experience with cloud cost optimization at scale (preferred).

Benefits

  • Base salary range of CAD $145,000–$185,000.
  • Equity package.
  • Full health, dental, and vision coverage.
  • Learning stipend.
  • Generous vacation.
  • Remote-friendly work arrangement across Canada and the U.S. Pacific Northwest.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer

Alpaca 51-250 Capital Markets

Alpaca is hiring a Site Reliability Engineer to keep its brokerage platform reliable and operable across cloud, Kubernetes, observability, messaging, and database systems, with a strong focus on PostgreSQL reliability on the trading-critical path.

DNS GitOps Go Kafka Kubernetes Linux Load Balancing PostgreSQL Python RabbitMQ Secrets Management TLS
2 hours, 27 minutes ago

Site Reliability Engineer

Kaseya 1K-5K IT Services

Kaseya is hiring a Site Reliability Engineer to own the reliability, automation, and production stability of its AWS-based services used by thousands of MSPs worldwide.

Ansible AWS Chef CloudFormation Datadog DevSecOps Elasticsearch Kibana Kubernetes MySQL PostgreSQL Puppet Secrets Management Serverless Terraform
6 hours, 27 minutes ago

SRE - DevOps Engineer - Argentina

Coderio 51-250 Internet Software & Services

Coderio is hiring a remote DevOps/SRE Engineer in Argentina to ensure the stability, scalability, and efficient operation of the infrastructure that supports its global digital solutions.

Argo CD CI/CD Flux GitHub Actions GitOps Helm Jenkins Kubernetes OpenShift Terraform
10 hours, 6 minutes ago

Senior Site Reliability Engineer

Cribl 251-1K IT Services

Cribl is hiring a Senior Site Reliability Engineer in Poland to help build and operate the telemetry infrastructure and observability platform that supports its cloud products and enterprise customers.

Ansible AWS Azure CI/CD Grafana JavaScript Kibana Linux New Relic Node.js PagerDuty Prometheus Splunk Terraform TypeScript
17 hours, 39 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers