Veeam Software

Veeam Software

Veeam Software is the global leader in Backup that delivers Modern Data Protection, offering solutions for virtual environments, enterprises, small businesses, and service providers worldwide.

Internet Software & Services
1K-5K
Founded 2006
$500M raised

Description

  • Own the reliability, performance, and operability of complex, business-critical production services and workflows.
  • Own escalated production issues from support and drive long-term fixes through code, configuration, and architecture changes.
  • Identify systemic risks during troubleshooting and convert them into long-term engineering improvements.
  • Lead production efficiency initiatives and maintain processes, runbooks, and knowledge base integrity across services or domains.
  • Define, build, and maintain production monitoring systems for critical services with strong visibility into system health and user experience.
  • Improve alerting, runbooks, SLIs/SLOs, and error budget usage to guide operational and product decisions.
  • Turn manual processes into robust automation and promote automation patterns and tooling adoption across teams.
  • Own the post-mortem review process and follow-up actions from incidents to drive measurable reliability improvements.
  • Collaborate with support, development, product managers, and security professionals to keep services production-ready, performant, and fault-tolerant.
  • Lead or contribute to design reviews, safe rollout practices, and documentation that reduce manual intervention in production.

Requirements

  • 5–8+ years of experience in software engineering, site reliability, production engineering, or senior technical support roles operating distributed systems.
  • Experience with log analysis and advanced troubleshooting in complex production environments.
  • Strong programming experience in JS, Go, TypeScript, Java, or C#.
  • Experience deploying and troubleshooting systems on public cloud platforms, with Azure preferred.
  • Strong familiarity with observability tools such as Elastic, Prometheus, Grafana, and OpenTelemetry.
  • Solid understanding of distributed systems, networking, automation, and CI/CD.
  • Prior on-call or incident response experience, including leading significant incidents or problem-management efforts, preferred.
  • Background in automation, performance testing, or service scalability at significant scale, preferred.
  • Familiarity with compliance or security best practices and applying them in production design and operations, preferred.

Benefits

  • Competitive compensation and benefits tailored to local markets in the US, Czechia, India, and Australia.
  • 18 paid vacation days plus 4 extra global VeeaMe Days for self-care.
  • 24 paid volunteer hours annually through Veeam Cares.
  • Private medical coverage for employees and up to four dependents.
  • Life, accident, and disability insurance with enhanced coverage.
  • Annual flexible wellbeing allowance for physical and mental wellness.
  • Free confidential counseling and coaching through the Employee Assistance Program, including legal and financial advice.
  • Meal, fuel, and transportation benefits, plus daycare reimbursement and a safe cab facility for eligible employees.
  • Access to professional development resources including internal mentorship, technical training platforms, workshops, learning events, and on-demand libraries such as LinkedIn Learning and O’Reilly.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Site Reliability Engineer, Production Engineering

Dropbox 1K-5K Internet Software & Services

Dropbox is hiring a Site Reliability Engineer to shape company-wide reliability strategy for AI-assisted and agentic software development while improving stability, observability, incident response, and operational excellence at scale.

1 hour, 6 minutes ago

Sr. Site Reliability Engineer III (6448)

MetroStar 251-1K IT Services

MetroStar is hiring a Sr. Site Reliability Engineer III to support mission-critical federal government systems by ensuring reliable, secure, and scalable application operations across modern infrastructure environments.

Ansible AWS Bash CI/CD Kubernetes Load Balancing
1 hour, 15 minutes ago

Senior Site Reliability Engineer

Honeycomb.io 51-250 Internet Software & Services

Honeycomb is hiring a Site Reliability Engineering professional to help scale backend systems, improve reliability, and support distributed engineering operations for a fast-growing observability platform.

AWS CI/CD Go Helm Kafka Kubernetes Terraform
1 hour, 15 minutes ago

[Job - 29712] Senior Devops / SRE

CI&T 5K-10K Internet Software & Services

CI&T is hiring a Senior DevOps/SRE to support remote delivery of scalable .NET and Next.js products with a strong focus on CI/CD, infrastructure reliability, observability, and incident response.

AWS AWS CDK Azure C# CI/CD Datadog Docker Gatling GitHub Actions GitLab CI Grafana Jaeger K6 Kubernetes .NET Next.js OpenTelemetry Prometheus Pulumi Terraform TypeScript WAF
2 hours, 15 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers