Tecsys

Tecsys

Tecsys is a global provider of transformative supply chain solutions, offering innovative warehouse management software and end-to-end visibility for various industries. With a focus on advancing supply chain management since 1983, Tecsys delivers effi...

Air Freight & Logistics
251-1K
Founded 1983
$52M raised

Description

  • Collaborate with engineering teams on system design consulting, capacity planning, launch reviews, and pre-production support.
  • Maintain live services by monitoring availability, latency, and overall system health.
  • Own and improve observability through Datadog dashboards, SLOs/SLIs, alerting, logging, and SLA reporting.
  • Develop and improve internal tooling, infrastructure-as-code, and CI/CD pipelines to reduce manual work and enable self-healing systems.
  • Drive automation and scale systems sustainably to improve reliability and delivery velocity.
  • Participate in on-call coverage and practice sustainable incident response and blameless postmortems.
  • Act as Incident Commander during incidents and coordinate cross-team response, communications, and service restoration.
  • Lead post-incident reviews and implement long-term fixes that improve stability, reliability, and developer experience.
  • Create and maintain technical documentation and mature SRE best practices.
  • Work cross-functionally with platform engineering, internal teams, and vendors to support global growth and high service performance.

Requirements

  • 5+ years of experience in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
  • Experience designing and deploying large-scale systems, multi-vendor platforms, and globally distributed infrastructure.
  • Hands-on experience managing AWS cloud infrastructure, including multi-account environments, VPC, EC2, and EKS, at scale.
  • Strong experience with Kubernetes at scale.
  • Strong hands-on experience with infrastructure as code and automation tools such as Terraform and Ansible.
  • Familiarity with CI/CD pipelines and release automation; GitLab is preferred and Jenkins is acceptable.
  • Deep understanding of monitoring and observability tools such as Datadog, including metric design, log pipelines, alerting, and dashboards.
  • Experience with incident management, on-call participation, escalation, and structured postmortems.
  • Scripting skills in Python, Bash, Java, or equivalent for automation and diagnostics.
  • Basic knowledge of Java- or .NET-based development.
  • Strong written and spoken English communication skills.
  • Experience with FedRAMP compliance is a strong asset.
  • Must be a Canadian citizen, permanent resident of Canada, or hold a valid Canadian work permit.
  • Availability for an escalation on-call rotation.
  • Ability to travel occasionally, including quarterly offsites and conferences, at less than 10%.

Benefits

  • Flexible digital-first and remote-friendly work environment.
  • Conveniently located offices and collaborative workspaces for in-person work when desired.
  • Opportunity to work on mission-critical SaaS infrastructure with real ownership over uptime and reliability.
  • Continuous learning opportunities in a fast-growing technology company.
  • Exposure to automation, resilience engineering, and platform reliability work at scale.
  • Inclusive, diverse, and equal-opportunity workplace with accommodation available during the interview process.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Database Reliability Engineer

Sezzle 251-1K Diversified Financial Services

Sezzle is hiring a Senior Database Reliability Engineer to design, build, and scale the database platform that supports its applications and helps teams use databases more reliably, securely, and efficiently.

AWS CI/CD Datadog Elasticsearch Encryption Git Go Grafana GraphQL Helm Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python React React Native REST API Secrets Management Terraform TypeScript
3 hours, 43 minutes ago

Senior Manager, Software Engineering

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is seeking a Senior Manager to lead CorpTech Platform software teams that build and operate AI-enabled production systems and improve how internal engineering work is designed, shipped, and maintained.

CI/CD Computer Vision ERP LLM Microservices
5 hours, 24 minutes ago

Senior Site Reliability Engineer - AWS

Filevine 251-1K Specialized Consumer Services

Filevine is hiring a Senior Site Reliability Engineer to embed with cross-functional teams and improve the reliability, automation, and scalability of its AWS-based legal technology platform.

AWS Bash CI/CD EC2 Kubernetes PowerShell Python
5 hours, 43 minutes ago

Senior Database Reliability Engineer

Sezzle 251-1K Diversified Financial Services

Sezzle is hiring a Senior Database Reliability Engineer to design, build, and scale the shared database platform and reliability controls that support its applications across production and development environments.

AWS CI/CD Datadog Elasticsearch Encryption Git Go Grafana Helm Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python React React Native REST API Secrets Management Terraform TypeScript
7 hours, 49 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers