Site Reliability Engineer

2 weeks, 6 days ago
Full-time
Mid Level
DevOps and Infrastructure
DEUNA

DEUNA

DEUNA is a payment orchestrator that optimizes transaction acceptance, boosts conversion rates, and minimizes fraud with over 80 methods in one integration.

Diversified Financial Services
51-250
Founded 2020

Description

  • Design, define, and maintain observability and monitoring for AWS infrastructure.
  • Define and track SLIs, SLOs, and SLAs for critical systems.
  • Improve system uptime, latency, and fault tolerance across the platform.
  • Provide internal libraries and toolsets to developers for diagnostics and debugging.
  • Manage scaling, performance, and resilience efforts related to system reliability.
  • Collaborate with technical teams on capacity planning, load testing, and scaling policies.
  • Improve production operations by defining and evolving deployment strategies.
  • Conduct disaster recovery testing and failure drills to validate system resilience.

Requirements

  • Experience with observability tools such as Prometheus, Grafana, OpenTelemetry, or AWS CloudWatch.
  • Experience designing dashboards, alerts, and log aggregation pipelines.
  • Deep understanding of AWS services including ECS, Lambda, RDS, and CodePipeline.
  • Strong proficiency in Go programming language.
  • Skilled at defining SLIs, SLOs, error budgets, and improving MTTR.
  • Experience conducting failure drills such as Chaos Monkey or Gremlin.
  • Excellent communication and collaboration skills.
  • Adaptability to thrive in dynamic, fast-paced environments.
  • Strong time management and task prioritization skills.
  • Proficiency in English.

Benefits

  • Vacations and additional PTO.
  • Remote work from anywhere.
  • Economic support for health insurance, internet, and cell phone line.
  • Stock options.
  • Learning and development platform.
  • Multidisciplinary, diverse, and dynamic team.
  • Growth and career path.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
4 hours, 41 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
4 hours, 50 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
5 hours, 38 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
7 hours, 54 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers