Parallel Domain

Parallel Domain is a synthetic data platform that helps machines see the world through 3D simulation and generative AI. Their API offers flexibility in data capture, enabling the development, training, and testing of autonomous systems efficiently and ...

Aerospace & Defense

Industrials

51-250 (85)

Founded 2017

$44M raised

5 open positions

Links

View All Jobs

Senior Site Reliability Engineer

1 month, 4 weeks ago

United States, Canada

Full-time

Senior

Site Reliability Engineer (SRE)

DevOps and Infrastructure

Active Directory Argo CD AWS Bash DNS Docker GitHub Actions Grafana Helm Kubernetes Linux Load Balancing Packer Prometheus Python Terraform

Apply Now

Parallel Domain

Aerospace & Defense

51-250

Founded 2017

$44M raised

View All Jobs 5

Description

Design, build, and maintain multi-region AWS infrastructure using Terraform.
Operate and scale EKS clusters across production regions, including autoscaling, node lifecycle management, and workload health.
Manage networking across environments, including VPC design, DNS, load balancing, and cross-region connectivity.
Support infrastructure changes, migrations, and expansions into new regions.
Improve GitOps-based deployment workflows using GitHub Actions, Helm, and Kustomize.
Build and run incident management processes, including severity definitions, escalation paths, and on-call practices.
Lead incident response, debugging, and root-cause analysis, and write postmortems that drive reliability improvements.
Improve observability through metrics, logging, tracing, and dashboards.
Support GPU and batch workloads running on Kubernetes.
Own cloud IAM governance across accounts and services, and support security and compliance-related requests.

Requirements

5+ years of experience in SRE, DevOps, or infrastructure engineering roles.
Experience operating production systems across multiple regions.
Strong Terraform experience, including modules, state management, and multi-environment patterns.
Solid AWS experience across VPC, IAM, EKS, S3, and CloudWatch.
Kubernetes expertise, including cluster operations, autoscaling, RBAC, and Helm.
Experience with CI/CD and GitOps workflows such as GitHub Actions and ArgoCD.
Networking fundamentals including CIDR, DNS, load balancing, VPN, and cross-region connectivity.
Experience with observability tools such as Prometheus and Grafana.
Comfort with Python and Bash for tooling and automation.
Working knowledge of both Linux and Windows environments; Windows-based workload support is a meaningful advantage.
Experience with Windows node pools, Windows AMIs, or GPU-adjacent components on Kubernetes (preferred).
Familiarity with GPU scheduling on Kubernetes, including NVIDIA device plugin configuration (preferred).
Experience supporting simulation, ML, or rendering workloads in cloud infrastructure (preferred).
Exposure to AWS Storage Gateway, Active Directory integrations, or AWS Transfer Family (preferred).
Familiarity with service proxy or service mesh patterns (preferred).
Experience with container-optimized OS images such as Bottlerocket or Packer (preferred).
Experience with cloud cost optimization at scale (preferred).

Benefits

Base salary range of CAD $145,000–$185,000.
Equity package.
Full health, dental, and vision coverage.
Learning stipend.
Generous vacation.
Remote-friendly work arrangement across Canada and the U.S. Pacific Northwest.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Manager, Engineering

Sumo Logic 251-1K Internet Software & Services

Sumo Logic is hiring a Senior Manager, Engineering for Application Security to lead global programs that improve product security, reliability, and operational efficiency across its cloud platform.

India Lead Application Security Engineer Site Reliability Engineer (SRE)

Agile AWS C++ Docker GCP Java Kafka Kubernetes OWASP Ruby Scala SIEM

17 hours, 12 minutes ago

Apply

17 hours, 12 minutes ago

Staff Software Engineer - Databases SRE | Sweden | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Staff Software Engineer, SRE to improve the reliability and scalability of Grafana Cloud’s database products for high-value customers across AWS, GCP, and Azure.

Germany Spain Sweden United Kingdom Full-time Lead Site Reliability Engineer (SRE) Software Engineer

$103k-$123k

AWS Azure GCP Go Helm Java Kubernetes Linux Microservices Python Terraform

1 day, 16 hours ago

Apply

1 day, 16 hours ago