Site Reliability Engineer

6 hours, 4 minutes ago
Senior
DevOps and Infrastructure
Stack AV

Stack AV

Stack AV is a Pittsburgh-based autonomous trucking company focused on developing AI-powered self-driving technology for the freight and logistics industry. Founded in 2023 by a team with extensive experience in autonomous systems, Stack AV employs around 150 people and operates in 15 states. The company specializes in autonomous trucking solutions that address key challenges in transportation and supply chain management. Their offerings include self-driving truck technology that utilizes advanced AI, robotics, and machine learning, as well as supply chain optimization solutions aimed at enhancing efficiency and reliability in freight transportation. Stack AV prioritizes safety and operates with transparency and accountability, aiming to meet the critical needs of its customers in the transportation sector. Stack AV is supported by SoftBank Group Corp., which provides financial backing for its initiatives in autonomous trucking.

information technology & services
201-500
Founded 2023
$1000M raised

Description

  • Instrument systems that schedule and execute large-scale batch workloads across Kubernetes clusters.
  • Diagnose and triage job failures for internal customers.
  • Collaborate with teams across the company to understand workload requirements and improve platform capabilities.
  • Increase the reliability and velocity of systems and processes through automation.
  • Document operational actions and build runbooks as a knowledge base and foundation for automation.
  • Participate in an on-call rotation to uphold production service SLOs and SLAs.
  • Contribute to platform tooling, automation, and CI/CD workflows.

Requirements

  • Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems.
  • Strong experience with Kubernetes and container orchestration in production-grade environments.
  • Ability to understand engineering design limitations and guide teams on scaling services within budget and performance goals.
  • Strong experience implementing and debugging cloud-native and open source tools such as Kubernetes, etcd, Prometheus, and OpenTelemetry.
  • Strong communication skills and the ability to work effectively in a diverse and distributed team.
  • Ability to work in a role that may be subject to U.S. national security, residence, citizenship, and export control requirements.
  • Experience supporting high-scale batch compute systems and workflow orchestration systems is preferred.
  • Experience working at the intersection of infrastructure, distributed systems, and developer experience is preferred.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Manager, Software Engineering - Storage Platform

Figma 1K-5K Internet Software & Services

Figma is hiring an Engineering Manager to lead its Databases team, which owns the core data layer behind the company’s product and platform as it scales.

LLM MySQL PostgreSQL
5 hours, 49 minutes ago

Manager of Monitoring Operations

Ensono 1K-5K IT Services

BMC is hiring a Manager – Monitoring Operations to lead enterprise monitoring for IT infrastructure and applications across on-prem OpenShift, network, and OS monitoring platforms.

Grafana Kubernetes Linux Prometheus
1 day, 5 hours ago

Site Reliability Engineer II, tvScientific

Pinterest 5K-10K Internet Software & Services

Pinterest is hiring a Site Reliability Engineer to help operate and improve tvScientific’s cloud-native AWS and Kubernetes-based advertising platform.

Argo CD AWS Bash CI/CD GitHub Actions GitOps Helm Kubernetes Linux Python Secrets Management Terraform
1 day, 5 hours ago

Site Reliability Engineer, C2 Systems

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is seeking a Connected Warfare SRE – System Deployment engineer to deploy and maintain mission-critical hardware and software for customer operations in complex, security-sensitive environments.

C++ Computer Vision Cybersecurity Go Python Rust
1 day, 6 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers