Senior Site Reliability Engineer (Azure)

2 weeks, 1 day ago
Full-time
Senior
DevOps and Infrastructure
MLabs

MLabs

MLabs is a Haskell, Rust, Blockchain, and AI consultancy specializing in mission-critical software development, cross-team collaboration, and cutting-edge value delivery for fintech, blockchain, and information technology sectors.

Internet Software & Services
11-50
Founded 2018

Description

  • Architect and deploy secure, scalable Azure infrastructure for production-grade distributed systems.
  • Develop and maintain Terraform-based infrastructure as code for repeatable multi-environment deployments.
  • Translate ambiguous product and customer requirements into technical architecture and execution plans.
  • Build and optimize platform services, APIs, and integrations to extend core system capabilities.
  • Partner with engineering, security, and product teams to deliver enterprise-ready infrastructure solutions.
  • Drive improvements in reliability, observability, and incident response.
  • Provide Tier 2 infrastructure support for customer deployments.
  • Establish operational excellence for a greenfield Azure environment.
  • Help achieve feature parity between Azure and the organization’s other cloud environments.

Requirements

  • Extensive experience designing and building production-grade systems on Azure.
  • Ability to transform high-level requirements into scalable, delivered systems.
  • Strong technical communication skills with both engineering and non-technical stakeholders.
  • High-ownership mindset with a strong bias for action and accountability.
  • Deep knowledge of Azure networking, compute, identity, security, and storage.
  • Advanced proficiency with Terraform at production scale.
  • Professional experience in Go and/or Python.
  • Background in distributed systems, high-availability architectures, or platform engineering.
  • Experience with automation tooling across the full infrastructure lifecycle and CI/CD.
  • Hands-on experience with Kubernetes and container orchestration (preferred).
  • Familiarity with observability tools such as Prometheus and Grafana (preferred).
  • Experience with workflow/orchestration platforms like Argo or Spacelift (preferred).

Benefits

  • Compensation of $150K–$200K.
  • Equity and tokens tied to long-term project growth.
  • Annual performance bonuses based on individual and company milestones.
  • Comprehensive health insurance for US-based employees.
  • 401(k) plan for US-based employees.
  • Remote, full-time role with US coverage and Europe considered if working hours overlap with EST.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
4 hours, 48 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
4 hours, 57 minutes ago

Site Reliability Engineer (Senior or Staff), Storage Layer Services (SLS)

MongoDB 1K-5K Internet Software & Services

MongoDB’s Storage Layer Services team is hiring a Site Reliability Engineer to help re-architect the cloud storage layer for Atlas and ensure the reliability and operational safety of its distributed storage infrastructure.

AWS Azure DNS GCP Go Kubernetes Linux Python TCP/IP TLS
5 hours, 45 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
8 hours, 1 minute ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers