Zeta Global

Zeta Global

Zeta Global provides an AI-powered marketing cloud that enables enterprises to acquire, grow, and retain customers through precision marketing, leveraging data science, advanced analytics, and machine learning to create optimized customer experiences.

Media
1K-5K
Founded 2007

Description

  • Design, implement, and manage SLIs, SLOs, and error budgets to align reliability with user expectations and business objectives.
  • Develop production-grade software that improves system reliability and reduces manual toil through automation.
  • Implement and optimize observability solutions using tools such as OpenTelemetry, focusing on high-cardinality metrics, distributed tracing, and actionable insights.
  • Drive postmortem processes and lead root cause analyses for incidents to prevent recurrence.
  • Define and monitor MTTx metrics, including MTTA, MTTR, and MTTF, to measure reliability progress.
  • Design and participate in Chaos Engineering exercises.
  • Collaborate with engineering teams to build reliable and scalable systems using capacity planning, resiliency patterns, and deployment strategies such as Canary and Blue-Green.
  • Lead design reviews for alerting strategies to improve signal-to-noise ratio in monitoring and incident management.
  • Advocate for and implement best practices in incident response and system design to improve uptime and performance.

Requirements

  • 4+ years of experience as an SRE or in a similar role with hands-on coding.
  • 3+ years of software development experience in Python or Golang, with a focus on maintainable, production-quality code.
  • Ability to code confidently in Python or Golang and solve real-world problems through automation, not only scripting.
  • Hands-on experience implementing SLIs, SLOs, and distributed tracing in production.
  • Deep understanding of SRE principles, including SLIs, SLOs, error budgets, and their real-world application.
  • Hands-on experience with postmortems, observability at scale, and chaos engineering exercises.
  • Expertise in designing and implementing observability solutions using OpenTelemetry, Prometheus, Grafana, or Honeycomb.
  • 3+ years of experience with AWS and proficiency in Kubernetes, Terraform, and Infrastructure as Code (IaC) tools.
  • Strong understanding of distributed systems, microservices architectures, and containerization technologies such as Docker and Kubernetes.
  • Hands-on experience with CI/CD platforms such as GitOps, Jenkins, and ArgoCD, plus familiarity with incident management and operational automation tools.
  • Knowledge of modern deployment strategies such as Canary and Blue-Green, as well as resiliency patterns like circuit breakers and retries.
  • Strong analytical skills for statistical analysis of metrics to identify and resolve performance bottlenecks.
  • Experience with chaos engineering and anomaly detection is preferred.

Benefits

  • Unlimited PTO.
  • Excellent medical, dental, and vision coverage.
  • Employee equity and stock purchase plan.
  • Employee discounts, virtual wellness classes, and pet insurance.
  • Compensation range of $140,000 to $170,000, depending on location and experience.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer II (Santo Domingo)

InvestorFlow 51-250 Capital Markets

InvestorFlow is hiring a Site Reliability Engineer II in Santo Domingo to improve the reliability and operational readiness of its cloud-native, Salesforce-based platform and digital portals for alternative asset firms.

Azure Grafana OpenTelemetry Prometheus Salesforce Terraform
10 minutes ago

Site Reliability Engineer (SRE)

ProArch 251-1K Internet Software & Services

ProArch is seeking a Site Reliability Engineer to help ensure the reliability, availability, and performance of production systems and services for global clients.

Agile AWS Azure Bash CloudFormation ELK Stack GCP GitLab CI Go Grafana Jenkins Kubernetes Microservices Prometheus Python Snowflake Terraform
25 minutes ago

Senior Site Reliability Engineer

Multi Media 51-250 Internet Software & Services

Multi Media, LLC, the company behind Chaturbate, is hiring a remote Senior Site Reliability Engineer to strengthen the resilience, performance, and scalability of its high-traffic live streaming platform.

Ansible Bash C C++ Django Docker Flask Go Java Kubernetes Laravel Linux Python Rust Terraform
40 minutes ago

Senior Site Reliability Engineer

Teikametrics 251-1K Media

Teikametrics is hiring a Senior Site Reliability Engineer in Bengaluru to build and maintain cloud infrastructure and DevOps systems that support its retail AI platform across AWS and third-party services.

AWS Bash CI/CD CircleCI Databricks Datadog Docker GCP GitHub Java JavaScript Kafka Kubernetes OpenSearch PostgreSQL Python Terraform
54 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers