Strategic Ops Engineer III

1 hour, 36 minutes ago
Full-time
Senior
Artificial Intelligence and Machine Learning
Backblaze

Backblaze

Backblaze is a pioneer in robust, scalable low-cost cloud backup and storage services, offering enterprise hot storage, low-cost backup and archive solutions. With the easiest way to back up all files, Backblaze provides unlimited, unthrottled, and unc...

IT Services
251-1K
Founded 2007

Description

  • Lead and govern the end-to-end incident management lifecycle, including detection, triage, escalation, and resolution.
  • Drive major incident management processes and communications.
  • Improve mean time to resolution through automation and process optimization.
  • Establish and maintain incident response playbooks and runbooks.
  • Maintain and improve AI/ML-powered heatmaps to identify recurring technical themes and prioritize long-term remediation.
  • Use observability data and AI to perform trend analysis and proactively identify problems.
  • Track and manage problem records through closure.
  • Govern change management processes, including leading the CAB for safe and compliant deployments.
  • Define and enforce change policies, risk assessments, and approval workflows.
  • Partner with engineering teams to improve system resilience, performance, and monitoring quality.
  • Leverage AI/ML for anomaly detection, predictive alerting, and automated root cause analysis.
  • Analyze large-scale operational data to identify patterns and recommend operational improvements.

Requirements

  • 5+ years of experience in IT Operations, SRE, or similar roles.
  • Strong expertise in incident, problem, and change management using ITIL or similar frameworks.
  • Proven experience governing and optimizing operational processes.
  • Strong knowledge of AI/ML concepts, including anomaly detection, predictive analytics, and data modeling.
  • Hands-on experience with AIOps platforms or building AI-driven operational solutions, including event correlation and alert prioritization.
  • ITIL certification (Foundation or higher) is preferred.
  • Proficiency with tools such as Jira, SNOW, FireHydrant, or Moogsoft is preferred.
  • Experience working in high-availability, large-scale environments is preferred.
  • Strong analytical and problem-solving skills.
  • Excellent stakeholder communication and leadership skills.

Benefits

  • Expected salary range of $123,000 - $175,000.
  • Commitment to learning, development, and growth as part of the company culture.
  • Inclusive, belonging-focused workplace culture.
  • Equal Opportunity Employer status.
  • Transparency around compensation benchmarking and offer determination.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Machine Learning Engineer (Infra), Driver Understanding and Evaluation

Waymo Autonomous vehicles, robotics, AI, ride-hailing / mobility tech

Waymo is hiring a Machine Learning engineer or researcher for its DUE team to build scalable ML and data systems that improve evaluation, simulation workflows, and developer tooling for autonomous driving.

Machine Learning PyTorch TensorFlow
6 minutes ago

Staff Machine Learning Engineer, Content Quality Signals

Pinterest 5K-10K Internet Software & Services

Pinterest is hiring a senior Content Understanding modeler to build and productionize ML systems that turn images, text, and video into semantic signals powering search, recommendations, ads, and integrity at Pinterest scale.

Apache Spark Computer Vision Machine Learning NLP Python PyTorch SQL TensorFlow
6 minutes ago

Staff Applied AI Engineer

Bluefish 51-200 information technology & services

Bluefish is hiring a remote Staff Applied AI Engineer in Germany to shape how AI is adopted across the organization and drive practical AI initiatives that improve developer productivity, streamline workflows, and create new product opportunities.

Python System Design TypeScript
6 minutes ago

Senior AI/ML Engineer

Natera 1K-5K Pharmaceuticals

Natera is hiring a Senior AI/ML Engineer to design and scale enterprise generative AI and machine learning platforms that support internal operations and external products in a regulated healthcare environment.

Apache Spark AWS CI/CD Datadog Encryption Generative AI HIPAA Kubeflow Kubernetes LLM Machine Learning MLflow MLOps Python PyTorch TensorFlow Terraform
6 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers