Zeta Global

Zeta Global provides an AI-powered marketing cloud that enables enterprises to acquire, grow, and retain customers through precision marketing, leveraging data science, advanced analytics, and machine learning to create optimized customer experiences.

Media

Consumer Discretionary

1K-5K (1434)

Founded 2007

16 open positions

Links

View All Jobs

Senior Site Reliability Engineer

1 hour, 55 minutes ago

United States

Full-time

Senior

Site Reliability Engineer (SRE)

DevOps and Infrastructure

Argo CD AWS Docker GitOps Go Grafana Honeycomb Jenkins Kubernetes Microservices OpenTelemetry Prometheus Python Terraform

Apply Now

Zeta Global

Media

1K-5K

Founded 2007

View All Jobs 16

Description

Design, implement, and manage SLIs, SLOs, and error budgets to align reliability with user expectations and business objectives.
Develop production-grade software that improves system reliability and reduces manual toil through automation.
Implement and optimize observability solutions using tools such as OpenTelemetry, focusing on high-cardinality metrics, distributed tracing, and actionable insights.
Drive postmortem processes and lead root cause analyses for incidents to prevent recurrence.
Define and monitor MTTx metrics, including MTTA, MTTR, and MTTF, to measure reliability progress.
Design and participate in Chaos Engineering exercises.
Collaborate with engineering teams to build reliable and scalable systems using capacity planning, resiliency patterns, and deployment strategies such as Canary and Blue-Green.
Lead design reviews for alerting strategies to improve signal-to-noise ratio in monitoring and incident management.
Advocate for and implement best practices in incident response and system design to improve uptime and performance.

Requirements

4+ years of experience as an SRE or in a similar role with hands-on coding.
3+ years of software development experience in Python or Golang, with a focus on maintainable, production-quality code.
Ability to code confidently in Python or Golang and solve real-world problems through automation, not only scripting.
Hands-on experience implementing SLIs, SLOs, and distributed tracing in production.
Deep understanding of SRE principles, including SLIs, SLOs, error budgets, and their real-world application.
Hands-on experience with postmortems, observability at scale, and chaos engineering exercises.
Expertise in designing and implementing observability solutions using OpenTelemetry, Prometheus, Grafana, or Honeycomb.
3+ years of experience with AWS and proficiency in Kubernetes, Terraform, and Infrastructure as Code (IaC) tools.
Strong understanding of distributed systems, microservices architectures, and containerization technologies such as Docker and Kubernetes.
Hands-on experience with CI/CD platforms such as GitOps, Jenkins, and ArgoCD, plus familiarity with incident management and operational automation tools.
Knowledge of modern deployment strategies such as Canary and Blue-Green, as well as resiliency patterns like circuit breakers and retries.
Strong analytical skills for statistical analysis of metrics to identify and resolve performance bottlenecks.
Experience with chaos engineering and anomaly detection is preferred.

Benefits

Unlimited PTO.
Excellent medical, dental, and vision coverage.
Employee equity and stock purchase plan.
Employee discounts, virtual wellness classes, and pet insurance.
Compensation range of $140,000 to $170,000, depending on location and experience.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer II (Santo Domingo)

InvestorFlow 51-250 Capital Markets

InvestorFlow is hiring a Site Reliability Engineer II in Santo Domingo to improve the reliability and operational readiness of its cloud-native, Salesforce-based platform and digital portals for alternative asset firms.

Dominican Republic Full-time Senior Site Reliability Engineer (SRE)

Azure Grafana OpenTelemetry Prometheus Salesforce Terraform

10 minutes ago

Apply

10 minutes ago

Site Reliability Engineer (SRE)

ProArch 251-1K Internet Software & Services

ProArch is seeking a Site Reliability Engineer to help ensure the reliability, availability, and performance of production systems and services for global clients.

India Full-time Lead Site Reliability Engineer (SRE)

Agile AWS Azure Bash CloudFormation ELK Stack GCP GitLab CI Go Grafana Jenkins Kubernetes Microservices Prometheus Python Snowflake Terraform

25 minutes ago

Apply

25 minutes ago

Senior Site Reliability Engineer

Multi Media 51-250 Internet Software & Services

Multi Media, LLC, the company behind Chaturbate, is hiring a remote Senior Site Reliability Engineer to strengthen the resilience, performance, and scalability of its high-traffic live streaming platform.

United States Full-time Senior Site Reliability Engineer (SRE)

$180k-$215k

Ansible Bash C C++ Django Docker Flask Go Java Kubernetes Laravel Linux Python Rust Terraform

40 minutes ago

Apply

40 minutes ago

Senior Site Reliability Engineer

Teikametrics 251-1K Media

Teikametrics is hiring a Senior Site Reliability Engineer in Bengaluru to build and maintain cloud infrastructure and DevOps systems that support its retail AI platform across AWS and third-party services.

India Full-time Senior Site Reliability Engineer (SRE)

AWS Bash CI/CD CircleCI Databricks Datadog Docker GCP GitHub Java JavaScript Kafka Kubernetes OpenSearch PostgreSQL Python Terraform

54 minutes ago

Apply

54 minutes ago

Zeta Global

Tags

Links

Senior Site Reliability Engineer

Zeta Global

Description

Requirements

Benefits

Similar Roles

Site Reliability Engineer II (Santo Domingo)

Site Reliability Engineer (SRE)

Senior Site Reliability Engineer

Senior Site Reliability Engineer

You're on a roll! Sign up now to keep applying.