Kaseya

Kaseya

Kaseya provides integrated IT management and security solutions for MSPs and SMBs, enabling centralized IT operations, remote management, cybersecurity, and automation.

IT Services
1K-5K
Founded 2000
$567M raised

Description

  • Set, monitor, and enforce SLOs, SLIs, and error budgets to maintain service reliability.
  • Lead incident response, troubleshooting, and blameless postmortems that drive permanent fixes.
  • Build and maintain automated deployment, configuration management, and infrastructure provisioning using Infrastructure as Code.
  • Manage cloud and hybrid infrastructure with Terraform or CloudFormation, balancing cost, scalability, and resilience.
  • Improve observability through proactive monitoring, alerting, and dashboards that surface issues early.
  • Partner with development teams to embed reliability into the SDLC, including deployment automation, capacity planning, and chaos engineering.
  • Reduce operational toil through automation and self-healing systems.
  • Support containerized and serverless workloads to keep production systems highly available and fault tolerant.
  • Stay current on SRE, cloud, and observability practices and bring improvements back to the team.

Requirements

  • 4 to 5 years of AWS production experience.
  • Experience owning infrastructure as code with Terraform or CloudFormation, including state management.
  • AWS ECS production experience, or a strong Kubernetes background with willingness to ramp up.
  • Active on-call rotation experience, including leading incidents and writing postmortems.
  • Working fluency with SLOs, SLIs, and error budgets in production.
  • Kubernetes production experience preferred.
  • Experience with observability tools such as Datadog, Dynatrace, CloudWatch, or Elasticsearch/Kibana preferred.
  • Experience with chaos engineering preferred.
  • Experience with AWS Lambda or other serverless workloads preferred.
  • Experience with Ansible, Chef, or Puppet preferred.
  • DevSecOps experience, including vulnerability scanning, secrets management, SOC2, or ISO 27001, preferred.
  • Production database support experience with RDS, PostgreSQL, or MySQL preferred.
  • Open source contributions or a public technical portfolio preferred.

Benefits

  • Annual base salary of CAD $115,000 to CAD $130,000.
  • Final offer considered based on experience, skills, and internal equity.
  • Equal employment opportunity across all protected characteristics.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer

Cribl 251-1K IT Services

Cribl is hiring a Senior Site Reliability Engineer in Poland to help build and operate the telemetry infrastructure and observability platform that supports its cloud products and enterprise customers.

Ansible AWS Azure CI/CD Grafana JavaScript Kibana Linux New Relic Node.js PagerDuty Prometheus Splunk Terraform TypeScript
1 hour, 20 minutes ago

Site Reliability Engineer

Obsidian Security 51-250 Internet Software & Services

Obsidian Security is hiring a Site Reliability Engineer in the UK to help ensure the reliability, scalability, and operational excellence of its multi-tenant SaaS platform for enterprise and financial customers.

Argo CD AWS Datadog GCP GitHub Actions GitOps Grafana Helm Kubernetes Microservices Prometheus
12 hours, 19 minutes ago

Senior Site Reliability Engineer (SRE) - (GCP)

Devsu 51-250 Internet Software & Services

Devsu is hiring a Site Reliability Engineer to own monitoring, observability, and reliability operations for systems running across on-premises infrastructure and Google Cloud Platform, with backup support for application incidents when needed.

Bash GCP Grafana Kubernetes Linux PagerDuty Prometheus Python
15 hours, 19 minutes ago

Vice President Site Reliability Engineering (Data Centers)

Galaxy 251-1K Capital Markets

Galaxy is hiring a Site Reliability Engineering leader to own enterprise automation and infrastructure platform reliability across a hybrid environment supporting digital assets, data center operations, and AI-related compute.

Active Directory Ansible AWS Azure Bash Git GitHub Actions GitLab CI Go Grafana Jenkins Linux Packer Palo Alto PowerShell Prometheus Python Splunk Terraform Windows Server
18 hours, 17 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers