Parallel Domain

Parallel Domain

Parallel Domain is a synthetic data platform that helps machines see the world through 3D simulation and generative AI. Their API offers flexibility in data capture, enabling the development, training, and testing of autonomous systems efficiently and ...

Aerospace & Defense
51-250
Founded 2017
$44M raised

Description

  • Design, build, and maintain multi-region AWS infrastructure using Terraform.
  • Operate and scale EKS clusters across production regions, including autoscaling, node lifecycle management, and workload health.
  • Manage networking across environments, including VPC design, DNS, load balancing, and cross-region connectivity.
  • Support infrastructure changes, migrations, and expansions into new regions.
  • Improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize.
  • Build and run incident management processes, including severity definitions, escalation paths, and on-call practices.
  • Lead incident response, debugging, and root-cause analysis, and write postmortems that drive reliability improvements.
  • Improve observability through metrics, logging, tracing, and dashboards.
  • Support GPU and batch workloads running on Kubernetes.
  • Own cloud IAM governance across accounts and services, and support security and compliance-related requests.

Requirements

  • 5+ years of experience in SRE, DevOps, or infrastructure engineering roles.
  • Experience operating production systems across multiple regions.
  • Strong Terraform experience, including modules, state management, and multi-environment patterns.
  • Solid AWS experience across VPC, IAM, EKS, S3, and CloudWatch.
  • Kubernetes expertise, including cluster operations, autoscaling, RBAC, and Helm.
  • Experience with CI/CD and GitOps workflows such as GitHub Actions and ArgoCD.
  • Networking fundamentals including CIDR, DNS, load balancing, VPN, and cross-region connectivity.
  • Experience with observability tools such as Prometheus and Grafana.
  • Comfort with Python and Bash for tooling and automation.
  • Working knowledge of both Linux and Windows environments; Windows-based workload support is a meaningful advantage.
  • Experience with Windows node pools, Windows AMIs, or GPU-adjacent components on Kubernetes (preferred).
  • Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration (preferred).
  • Experience supporting simulation, ML, or rendering workloads in cloud infrastructure (preferred).
  • Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family (preferred).
  • Familiarity with service proxy or service mesh patterns (preferred).
  • Experience with container-optimized OS images such as Bottlerocket or Packer (preferred).
  • Experience with cloud cost optimization at scale (preferred).

Benefits

  • Base salary range of CAD $145,000–$185,000.
  • Equity package.
  • Full health, dental, and vision coverage.
  • Learning stipend.
  • Generous vacation.
  • Remote-friendly work arrangement across Canada and the U.S. Pacific Northwest.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

1 hour, 12 minutes ago

Site Reliability Engineer (Remote) - #35039

Technology Stack is hiring a Technical Support / Ops Engineer to monitor and troubleshoot a legal operations platform built on cloud services, microservices, AI agents, and workflow orchestration.

GCP Go GPT Mailgun PostgreSQL Python React Redis SQL Terraform Twilio Vue.js
3 hours ago

Junior Site Reliability Engineer

Coalfire 251-1K Internet Software & Services

Coalfire is hiring a Junior Site Reliability Engineer to support managed cloud services for clients by operating and maintaining secure, resilient SaaS infrastructure across major public cloud environments.

Ansible AWS Azure Bash CI/CD Docker GCP HIPAA JIRA Kubernetes Linux Palo Alto PowerShell Python SOC Splunk Terraform TLS Windows Server
3 hours, 27 minutes ago

SRE Technical Project Manager

HHAeXchange 251-1K Health Care Providers & Services

HHAeXchange is hiring a remote SRE Technical Project Manager to help improve the stability, resiliency, and scalability of its homecare technology platform through project delivery, incident management, and operational reporting.

Agile Datadog JIRA Kanban OpsGenie PagerDuty
3 hours, 42 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers