Senior Observability & Telemetry Engineer - Radian Arc

3 hours, 23 minutes ago
Full-time
Senior
DevOps and Infrastructure
Submer

Submer

Submer offers end-to-end immersion cooling solutions for next-generation datacenters, optimizing operations and minimizing energy consumption.

IT Services
51-250
Founded 2015

Description

  • Design and implement scalable telemetry pipelines for metrics, logs, and traces across distributed GPU infrastructure.
  • Architect observability systems that ingest high-cardinality telemetry from thousands of nodes and services.
  • Build and operate telemetry storage systems for large-scale time-series and event data.
  • Contribute to observability standards across metrics, tracing, logging, and SLO implementation.
  • Instrument GPU clusters, inference workloads, and distributed training environments across compute, storage, and networking layers.
  • Develop visibility into infrastructure degradation such as GPU throttling, network congestion, storage latency, and hardware issues.
  • Build dashboards, monitoring tools, and performance analysis tools for internal teams and customers.
  • Develop and maintain network observability platforms and telemetry collectors/exporters.
  • Design alerting, anomaly detection, and automated detection systems, and integrate them with incident management workflows.
  • Collaborate with platform, networking, storage, compute, and operations teams, while participating in on-call support and mentoring others on observability best practices.

Requirements

  • Proven experience operating large distributed infrastructure platforms.
  • Strong background in observability systems and telemetry pipelines.
  • Experience building metrics, logging, tracing, alerting, and dashboards at production scale.
  • Strong programming skills in Go, Python, or Rust.
  • Experience with large-scale time-series data platforms.
  • Experience with large-scale GPU cloud platforms, HPC environments, or AI infrastructure.
  • Experience monitoring AI workloads such as training or inference clusters.
  • Deep understanding of distributed systems observability.
  • Familiarity with cloud-native infrastructure such as Kubernetes, automation, and CI/CD.
  • Experience monitoring complex networking environments and integrating network and system telemetry into centralized monitoring platforms.
  • Familiarity with telemetry protocols such as gNMI, SNMP, and streaming telemetry.
  • Strong data analysis capabilities and ability to diagnose performance issues across distributed systems.

Benefits

  • Attractive compensation package reflecting your expertise and experience.
  • Friendly, international, flexible work environment with a hybrid-friendly approach.
  • Remote work in EMEA.
  • Permanent, full-time contract.
  • Opportunity to join a fast-growing scale-up with career growth potential.
  • Equal opportunity employer with a diverse and inclusive environment.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer (DevTools)

Nebius 51-250 Internet Software & Services

Nebius is hiring an SRE for its DevTools team to maintain and improve large-scale developer infrastructure that supports builds, artifacts, and version control workflows for its AI cloud platform.

CI/CD GitLab Go Java Kotlin Python Ruby Spring TeamCity
2 hours, 30 minutes ago

Sustaining Engineering Lead

Actian 251-1K IT Services

Actian is hiring a remote Sustaining Engineering Lead in Europe to own end-to-end escalation handling for critical platform issues on its data intelligence platform.

CI/CD GitHub JIRA
3 hours, 56 minutes ago

Senior Cloud Resilience Architect

Blink Health 251-1K Health Care Providers & Services

Blink Health is hiring a disaster recovery and resilience architecture leader to strengthen the reliability of its healthcare technology platforms and critical patient-facing systems.

Ansible AWS Azure CloudFormation DNS GCP Kubernetes Load Balancing Pulumi Terraform
4 hours, 3 minutes ago

Network Reliability Engineer

Margo Bank Professional Services

Network Reliability Engineer at Warsaw Consulting – Polska Team, working remotely to build and operate AI infrastructure with a focus on monitoring, incident response, and service reliability.

Ansible Bash CI/CD Debian DNS Elasticsearch GitLab Go Grafana Linux Load Balancing MariaDB Prometheus Python SaltStack TCP/IP Ubuntu
4 hours, 29 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers