Field Reliability Engineer- LATAM

4 hours, 2 minutes ago
Full-time
Senior
DevOps and Infrastructure
Honeycomb.io

Honeycomb.io

Honeycomb.io provides a comprehensive observability platform designed for engineers to effectively debug and monitor distributed services, including microservices and serverless applications, facilitating collaborative problem-solving and enhancing ove...

Internet Software & Services
51-250
Founded 2016
$149M raised

Description

  • Own and operate customer-facing managed infrastructure, including Refinery as a Service (RaaS) and Honeycomb Private Cloud (HnyPC) deployments across multiple AWS accounts and regions.
  • Build and maintain Terraform modules, Helm charts, and deployment automation for customer EKS clusters, collector pools, and Refinery instances.
  • Design and implement monitoring, alerting, and observability for managed service infrastructure.
  • Manage scaling, upgrades, incident response, capacity planning, and cost optimization for customer deployments.
  • Build autonomous deployment and management tooling for field-operated managed services.
  • Serve as the senior technical escalation point for complex customer incidents, collector configurations, Refinery tuning, and architecture reviews.
  • Diagnose and resolve infrastructure and observability issues across distributed systems, Kubernetes clusters, AWS networking, and polyglot service meshes.
  • Partner with customer SRE, platform, and engineering teams to troubleshoot real-time production issues.
  • Participate in an on-call rotation for managed services and provide Tier 2 escalation support.
  • Build SOPs, runbooks, diagnostic frameworks, internal tools, and UIs to improve operational efficiency and speed resolution.

Requirements

  • Experience owning and operating production infrastructure in AWS and Kubernetes environments.
  • Hands-on experience with Terraform, Helm, and deployment automation.
  • Experience troubleshooting distributed systems, Kubernetes clusters, and AWS networking components such as ALBs, PrivateLink, NLBs, and VPCs.
  • Experience with observability tooling and monitoring production services.
  • Ability to handle senior-level customer escalation work and architecture reviews.
  • Experience supporting incident response and on-call operations.
  • Familiarity with OpenTelemetry distributions, collectors, exporters, and instrumentation libraries.
  • Experience building or contributing to open source projects such as Refinery or collector distros is preferred.
  • Experience working with customer SRE, platform, or engineering teams in production environments.
  • Ability to work in a fully distributed, remote-first company; visa sponsorship and visa transfers are not available.

Benefits

  • Generous equity with an employee-friendly stock program.
  • Transparent pay based on levels relative to experience.
  • Unlimited PTO.
  • Distributed-first remote culture.
  • Home office, co-working, and internet stipend.
  • Full benefits coverage for employees, with additional coverage available for dependents.
  • Up to 16 weeks of paid parental leave, regardless of path to parenthood.
  • Annual development allowance.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Sr. IT Linux Site Reliability Engineer

SpaceX 10K-50K Aerospace & Defense

SpaceX is hiring a Sr. Linux Site Reliability Engineer to support its Linux Infrastructure team in designing, maintaining, scaling, and optimizing Kubernetes-based platforms for critical business operations.

Ansible Argo CD CI/CD CRI-O Docker Git GitOps Go Grafana Helm InfluxDB Jenkins Kubernetes Linux Prometheus Puppet Python REST API SVN Terraform Vagrant YAML
3 hours, 32 minutes ago

Security L2 Technical Support Team Lead

Genea 51-250 Internet Software & Services

Genea is hiring a Physical Security L2 Technical Support Team Lead to act as the senior working lead for its L2 support team, handling complex access control cases while owning daily operations, escalation quality, and team development.

3 hours, 32 minutes ago

Senior Site Reliability Engineer

Remote 251-1K Professional Services

Remote is hiring a Senior SRE to own reliability and platform work for its fully remote global HR platform, helping translate ambiguous infrastructure challenges into robust solutions.

AWS Bash CI/CD Docker Elixir GitHub Actions GitLab CI Go Grafana Kubernetes Linux Node.js OpenTelemetry Prometheus Python Terraform
3 hours, 47 minutes ago

Senior Data Engineer - Managed Services

3Cloud 251-1K Internet Software & Services

3Cloud is hiring a Senior Data Engineer for Managed Services to support and optimize client Microsoft Azure data and Power BI environments while delivering analytics solutions and resolving incidents across diverse customer accounts.

Azure Databricks Power BI
4 hours, 2 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers