Zscaler

Zscaler

Zscaler is a cybersecurity pioneer providing industry-leading CASB and SASE solutions, revolutionizing internet security with a cloud-based platform that protects users worldwide.

Internet Software & Services
1K-5K
Founded 2007

Description

  • Own the reliability of a large-scale cloud service across Linux/BSD, bare metal, Kubernetes, custom load balancing, and SD-WAN.
  • Partner with Engineering and Network teams early to define requirements, conduct operability reviews, and contribute code and design documentation for resilience.
  • Develop and operate end-to-end observability, including metrics, logs, traces, dashboards, alerting, and incident tooling.
  • Manage SLOs and error budgets while reducing noise and improving system detection and diagnosis.
  • Participate in on-call rotation and lead full-cycle incident response.
  • Perform deep cross-stack troubleshooting across operating systems, networking, distributed systems, packet captures, and core dumps.
  • Drive permanent software fixes and convert incident learnings into runbooks and tests.
  • Build and maintain infrastructure and service lifecycle automation using everything-as-code.
  • Drive provisioning, configuration, release automation, canary deployments, and rollout/rollback workflows.
  • Improve platform hygiene through OS and application upgrades, patching, capacity tuning, performance tuning, and CI/CD validation before production rollouts.

Requirements

  • US citizenship is required due to the nature of assigned customers.
  • 5+ years of industry experience in software engineering, infrastructure software, and/or platform engineering.
  • Proficiency in at least one programming language such as Python, Bash, or Go.
  • Demonstrated ability to write production-quality code, including testing, code reviews, CI, and maintainable design.
  • Strong Linux/Unix systems fundamentals, including processes, memory, filesystems, networking basics, and debugging/performance troubleshooting.
  • Solid understanding of networking protocols and concepts such as HTTP, DNS, TCP/IP, ICMP, the OSI model, subnetting, and load balancing.
  • Proven experience operating production services, including incident response, troubleshooting, and reducing toil.
  • Ability to participate in on-call rotations and support occasional after-hours or weekend deployments.
  • Experience managing BSD in production and driving systemic fixes through platform engineering.
  • Preferred: proven expertise in operating Kubernetes at scale.
  • Preferred: deep experience with Prometheus and OpenTelemetry, including golden signals, SLOs, and alert tuning.

Benefits

  • Base salary range of $119,000 to $170,000 USD.
  • Eligibility for commission, bonus, and equity, if applicable.
  • Various health plans.
  • Time off plans for vacation and sick time.
  • Parental leave options.
  • Retirement options.
  • Education reimbursement.
  • In-office perks and hybrid/remote work flexibility.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Database Reliability Engineer

Sporty Group 51-250 Media

Sporty is seeking a Database Reliability Engineer to own and improve its database infrastructure supporting multiple platforms and international expansion.

Ansible Argo CD Elasticsearch GitHub Actions Go Grafana Helm Jenkins Kubernetes MongoDB MySQL PostgreSQL Prometheus Python RabbitMQ Terraform
8 hours, 2 minutes ago

Senior Site Reliability Engineer

Moniepoint 1K-5K Diversified Financial Services

Moniepoint is hiring an experienced Site Reliability Engineer to improve the reliability, scalability, and observability of its highly distributed financial platform serving emerging markets.

AWS Azure Datadog GCP Go Java Kafka Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python RabbitMQ Rust
8 hours, 47 minutes ago

Senior Site Reliability Engineer, Identity Platform

Coinbase 1K-5K Capital Markets

Coinbase is hiring an experienced Site Reliability Engineer to build and scale identity and access management tooling for its IT Operations Corporate Engineering team supporting cloud-based, security-first systems.

Ansible AWS Azure C# CI/CD Docker GCP Go Java Kubernetes Python Ruby Secrets Management Terraform
9 hours, 17 minutes ago

Database Reliability Engineer - Core Team

ClickHouse 51-250 IT Services

ClickHouse is hiring a Site Reliability Engineering team member for ClickHouse Core to improve the reliability, availability, scalability, and performance of ClickHouse Cloud for customers worldwide.

AWS Azure C++ ClickHouse GCP Python SQL
9 hours, 47 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers