hatch I.T.

hatch I.T.

hatch I.T. is a specialized technology recruiting firm that supports emerging tech startups in growing their engineering, data, and product teams. They focus on candidate-centric recruiting, bridging the gap between startups and the local tech communit...

Professional Services
11-50
Founded 2012

Description

  • Ensure high availability, scalability, and performance of production systems.
  • Implement and maintain SLIs, SLOs, and SLAs for critical services.
  • Conduct capacity planning and performance tuning.
  • Automate infrastructure provisioning using infrastructure-as-code tools such as Terraform, Terragrunt, and Ansible.
  • Develop automation to reduce manual operations and improve deployment workflows.
  • Build and maintain CI/CD pipelines to support rapid and reliable deployments.
  • Design and maintain monitoring, logging, and alerting systems using Datadog.
  • Participate in on-call rotations and lead incident response efforts.
  • Perform root-cause analysis and write postmortems to prevent recurring issues.
  • Manage cloud infrastructure and container orchestration platforms such as AWS, Azure, Kubernetes, and ECS.
  • Optimize system architecture for reliability and fault tolerance.
  • Implement best practices for security, networking, and service resilience.
  • Work with development teams to design reliable microservices and distributed systems.
  • Advocate for SRE principles and operational excellence across engineering teams.
  • Mentor engineers on reliability practices, tooling, and automation strategies.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
  • 3–7 years of experience in SRE, DevOps, or Systems Engineering roles.
  • Strong proficiency with Linux systems and shell scripting.
  • Experience with cloud platforms such as AWS and Azure.
  • Hands-on experience with Kubernetes/ECS and container technologies such as Docker.
  • Proficiency in at least one programming language: Python or Java.
  • Experience with CI/CD pipelines and DevOps tooling.
  • Strong understanding of distributed systems, networking, and security fundamentals.
  • Strong analytical and problem-solving skills.
  • Excellent communication and cross-team collaboration skills.
  • Ability to thrive in fast-paced, high-stakes environments.
  • A mindset focused on continuous improvement and operational excellence.
  • Experience with observability stacks such as OpenTelemetry (preferred).
  • Knowledge of database management, especially PostgreSQL (preferred).
  • Experience with configuration management tools such as Ansible, Chef, or Puppet (preferred).
  • Familiarity with zero-downtime deployments and chaos engineering practices (preferred).

Benefits

  • Competitive pay.
  • Medical, dental, and vision insurance.
  • 401(k) plan with company match for benefit-eligible employees.
  • PTO (Personal Time Off) for full-time employees.
  • Sick time for full-time employees.
  • Remote work environment.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

[Job-28557] Senior SRE, Brazil

CI&T 5K-10K Internet Software & Services

CI&T is hiring a Senior SRE in Brazil to support a cloud-based application project with a strong focus on reliability, observability, and proactive operational ownership.

Android AWS Datadog Docker GitHub GitHub Actions Go Google Analytics Grafana iOS Java Jenkins Kubernetes Linux Prometheus Python Splunk Terraform
4 hours, 11 minutes ago

Director of Cloud Operations

Firstup 251-1K Professional Services

Firstup is hiring a Director of Cloud Operations to lead the reliability, scalability, and efficiency of its globally distributed SaaS cloud platform across AWS, while partnering with engineering, security, and product teams.

AWS CI/CD CircleCI Datadog Kubernetes Microservices .NET Serverless Terraform
4 hours, 41 minutes ago

Staff Site Reliability Engineer

Caseware 251-1K Internet Software & Services

Caseware is hiring a Staff Site Reliability Engineer in Romania to help build and scale its AI platform by keeping AWS, Kubernetes, and GitOps-based production systems reliable, observable, and automated.

AWS AWS CDK CI/CD Docker GitHub GitHub Actions GitOps Kubernetes Linux Load Balancing Microservices Terraform
5 hours, 11 minutes ago

Senior Infrastructure Engineer - Postgres

ClickHouse 51-250 IT Services

ClickHouse is hiring a Senior SRE / Senior Infrastructure Engineer to own reliability, automation, and operations for its multi-cloud Postgres integration and cloud data platform as it scales globally.

AWS Azure CI/CD GCP Go Grafana Kubernetes OpenTelemetry PostgreSQL Prometheus Terraform
16 hours, 41 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers