AHEAD

AHEAD accelerates the impact of technology on clients by engineering customized data, developer, and infrastructure platforms that improve IT operations. By weaving together cloud infrastructure, intelligent operations, and modern applications, we help...

IT Services

Information Technology

1K-5K (2250)

$43M raised

83 open positions

Links

View All Jobs

Principal Observability & Reliability Architect

3 weeks, 2 days ago

United States

Full-time

Lead

Site Reliability Engineer (SRE)

DevOps and Infrastructure

CI/CD Datadog Elasticsearch Grafana Kafka Kubernetes New Relic OpenTelemetry OpsGenie PagerDuty Prometheus Splunk

Apply Now

AHEAD

IT Services

1K-5K

$43M raised

View All Jobs 83

Description

Lead client discovery, architecture workshops, and solution design for observability, telemetry, reliability, and operational intelligence initiatives.
Design enterprise observability architectures across monitoring, logging, metrics, tracing, alerting, event correlation, service visibility, and platform integrations.
Define standards for telemetry onboarding, naming, tagging, RBAC, service ownership, dashboards, alert governance, runbooks, and operational handoff.
Advise on telemetry governance, including data quality, retention, access control, sampling, cardinality, and cost optimization.
Lead modernization efforts such as tool rationalization, dashboard and alert rationalization, telemetry strategy, and migration from legacy monitoring platforms.
Guide SRE practices including SLIs, SLOs, error budgets, production readiness, and incident response maturity.
Design integration patterns across ITSM, CMDB, event management, and automation platforms.
Support pursuits by shaping solution strategy, validating scope, informing estimates, and building client-facing technical narratives.
Serve as a senior escalation point and provide architecture governance during delivery.
Build reusable reference architectures, playbooks, and accelerators while mentoring architects, consultants, and offshore teams.

Requirements

10+ years of experience in observability, monitoring, APM, platform operations, SRE, or related enterprise technology domains.
5+ years leading architecture and delivery strategy for enterprise observability or reliability initiatives.
Hands-on experience designing and implementing monitoring, logging, metrics, tracing, telemetry collection, and pipeline patterns in hybrid and multi-cloud environments.
Strong knowledge of telemetry governance, including routing, transformation, normalization, enrichment, retention, access control, and cost management.
Experience defining enterprise standards for dashboards, alerts, tagging, naming, service ownership, RBAC, and operating model adoption.
Strong command of incident response, event correlation, alert strategy, service health, and business-service visibility.
Applied SRE experience with SLIs, SLOs, error budgets, and production readiness.
Ability to lead executive and technical workshops and translate business needs into actionable architecture and delivery plans.
Consulting or professional services experience with client-facing communication, estimation, risk management, and cross-functional leadership.
Preferred experience with platforms such as Dynatrace, Splunk, Grafana, LogicMonitor, Datadog, New Relic, AppDynamics, Elastic, Prometheus, or OpenTelemetry.
Preferred experience with telemetry pipeline tools such as OpenTelemetry Collector, Grafana Alloy, Fluent Bit, Kafka, Cribl, or Vector, along with familiarity with cloud, Kubernetes, CI/CD, and infrastructure as code.
Preferred experience integrating with ServiceNow, Jira Service Management, PagerDuty, Opsgenie, BigPanda, or xMatters.
Preferred experience developing reusable consulting assets such as reference architectures, governance models, playbooks, POVs, and accelerators.
Relevant cloud, SRE, ITIL, or FinOps certifications are a plus.

Benefits

Compensation includes on-target earnings (OTE) with base salary plus any applicable target bonus, varying by experience, qualifications, and geography.
Medical, dental, and vision insurance.
401(k) plan.
Paid company holidays.
Paid time off.
Paid parental and caregiver leave.
Cross-department training and development opportunities.
Sponsorship for certifications and credentials for continued learning.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer

Counterpart Health 51-200 hospital & health care

Counterpart Health is hiring a Senior Site Reliability and Infrastructure Engineer to support and evolve the technology platform behind its primary care tool and maintain reliable infrastructure for domestic and international workloads.

United States Full-time Senior Site Reliability Engineer (SRE)

$160k-$208k

AWS Azure CI/CD Containerd DNS Docker GCP Go gRPC Helm Kubernetes Linux Load Balancing Prometheus Python Shell Scripting TCP/IP

19 hours, 2 minutes ago

Apply

19 hours, 2 minutes ago

Senior Test Platform & Reliability Engineer - Star Trek Fleet Command

Scopely 1K-5K Internet Software & Services

Scopely is hiring a Senior Test Platform & Reliability Engineer in Ireland to build validation, reliability, and developer enablement platforms for Star Trek Fleet Command’s large-scale live-service backend systems.

Ireland Full-time Senior SDET (Software Development Engineer in Test) Site Reliability Engineer (SRE)

AWS Bash CI/CD Docker GitLab Go Python Terraform

19 hours, 17 minutes ago

Apply

19 hours, 17 minutes ago

Senior Software Engineer - Databases, SRE | Canada | Remote

Grafana 1K-5K IT Services

Grafana Labs is hiring a Senior Software Engineer for its remote SRE team to improve reliability and operability of Grafana Cloud database services for high-SLA customers across AWS, GCP, and Azure.

Canada Full-time Senior Site Reliability Engineer (SRE) Software Engineer

$108k-$130k

AWS Azure GCP Go Helm Java Kubernetes Linux Microservices Python Terraform

1 day, 18 hours ago

Apply

1 day, 18 hours ago

Senior Site Reliability Engineer

Semios 51-250 Food Products

Semios Group is hiring a Senior Site Reliability Engineer to help scale, secure, and improve the reliability of its global agricultural technology platform.

Canada Full-time Senior Site Reliability Engineer (SRE)

$140k-$160k

AWS Azure Bash Buildkite CI/CD Datadog Docker Envoy GCP Git GitHub GitHub Actions GitLab Go Jenkins Kubernetes Linux NATS New Relic Prometheus Python Ruby Splunk Terraform

1 day, 19 hours ago

Apply

1 day, 19 hours ago

AHEAD

Tags

Links

Principal Observability & Reliability Architect

AHEAD

Description

Requirements

Benefits

Similar Roles

Senior Site Reliability Engineer

Senior Test Platform & Reliability Engineer - Star Trek Fleet Command

Senior Software Engineer - Databases, SRE | Canada | Remote

Senior Site Reliability Engineer

You're on a roll! Sign up now to keep applying.