Caseware

Caseware

CaseWare International Inc. provides cutting-edge software solutions for accounting firms, corporations, and governments, enabling users worldwide to work smarter and transform insights into impact.

Internet Software & Services
251-1K
Founded 1988

Description

  • Maintain reliable, high-performing AWS production systems.
  • Manage EKS clusters for configuration, scaling, and workload stability.
  • Set up and support Istio service mesh for traffic control and security.
  • Oversee GitOps workflows to ensure secure and consistent infrastructure changes.
  • Create automation tools and platform enhancements.
  • Design, implement, and manage monitoring, logging, and tracing solutions across AI workloads, microservices, and data pipelines.
  • Respond to incidents, analyze root causes, and recommend lasting fixes.
  • Work with developers and platform teams to improve deployments and system operations.
  • Support nx-based monorepos to enable scalable developer workflows.
  • Participate in an on-call rotation.

Requirements

  • Experience as a Site Reliability Engineer with solid software engineering skills.
  • Deep understanding of AWS services used in production, including EKS, EC2, IAM, networking, and load balancing.
  • Professional experience with Kubernetes, including autoscaling, networking, RBAC, and cluster operations.
  • Hands-on experience with Istio service mesh.
  • Expertise with GitHub, GitHub Actions, and modern CI/CD workflows.
  • Experience working with monorepos, especially nx.
  • Understanding of GitOps practices, preferably with Flux CD.
  • Strong grasp of Linux systems, networking, containers, and Docker.
  • Familiarity with infrastructure as code tools such as CDK and Terraform.
  • Knowledge of SLOs, error budgets, incident management, and production readiness best practices.
  • Strong English communication and collaboration skills.
  • Excellent communication, analytical thinking, problem-solving, and ownership mindset.
  • Fully remote position based in Colombia.

Benefits

  • Contrato a término indefinido with all legal benefits.
  • Prepaid medicine and life insurance, plus funeral assistance.
  • Internet allowance and home office stipend.
  • Competitive compensation above market average.
  • 100% remote work environment with excellent work-life balance.
  • Budget for training and professional growth.
  • 5 personal PTO days per year, plus sick leave top-up from day 3 to 90.
  • Recognition award with additional paid time off, plus upgraded vacation starting at 5 years of service.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer - Backstage

Spotify Media

Site Reliability Engineer for Spotify’s Backstage team in New York City, focused on building and operating cloud infrastructure for an external developer portal and internal AI-driven coding workflows.

AWS GCP Go Java LLM Microservices Python React Terraform TypeScript
15 minutes ago

Blockchain Site Reliability Engineer

InfStones 51-250 Internet Software & Services

InfStones is hiring a remote Blockchain Site Reliability Engineer in Dallas to ensure the reliability, availability, and performance of its blockchain node infrastructure.

Docker Ethereum Go Grafana JavaScript Kubernetes Linux Prometheus Python Rust Solana
1 hour ago

Lead Engineer - Platform Performance & Reliability

HighLevel 251-1K Internet Software & Services

HighLevel is hiring a Lead Engineer for its Platform Performance & Reliability team to improve the speed, stability, and operational health of a high-traffic global SaaS platform.

AWS ClickHouse Firestore GCP Grafana Kubernetes Microservices MongoDB MySQL Node.js OpenTelemetry PostgreSQL Prometheus Redis
1 hour, 45 minutes ago

Senior Cluster Site Reliability Engineer

The Voleon Group 51-250 Capital Markets

Senior Cluster Site Reliability Engineer at Voleon, responsible for scaling and operating the company’s research compute cluster that supports machine learning research and investment management workloads across on-prem and cloud environments.

Ansible Apache Airflow Apache Spark AWS Docker GCP Grafana Kubeflow Kubernetes Machine Learning OpenTelemetry Podman Prometheus Python PyTorch Ruby TensorFlow Terraform
1 hour, 59 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers