Submer

Submer offers end-to-end immersion cooling solutions for next-generation datacenters, optimizing operations and minimizing energy consumption.

IT Services

Information Technology

51-250 (101)

Founded 2015

6 open positions

Links

View All Jobs

Senior Observability & Telemetry Engineer - Radian Arc

3 hours, 23 minutes ago

Europe, Middle East, Africa

Full-time

Senior

Site Reliability Engineer (SRE)

DevOps and Infrastructure

CI/CD ClickHouse Go Grafana Kubernetes Linux OpenTelemetry Prometheus Python Rust WAF

Apply Now

Submer

Submer offers end-to-end immersion cooling solutions for next-generation datacenters, optimizing operations and minimizing energy consumption.

IT Services

51-250

Founded 2015

View All Jobs 6

Description

Design and implement scalable telemetry pipelines for metrics, logs, and traces across distributed GPU infrastructure.
Architect observability systems that ingest high-cardinality telemetry from thousands of nodes and services.
Build and operate telemetry storage systems for large-scale time-series and event data.
Contribute to observability standards across metrics, tracing, logging, and SLO implementation.
Instrument GPU clusters, inference workloads, and distributed training environments across compute, storage, and networking layers.
Develop visibility into infrastructure degradation such as GPU throttling, network congestion, storage latency, and hardware issues.
Build dashboards, monitoring tools, and performance analysis tools for internal teams and customers.
Develop and maintain network observability platforms and telemetry collectors/exporters.
Design alerting, anomaly detection, and automated detection systems, and integrate them with incident management workflows.
Collaborate with platform, networking, storage, compute, and operations teams, while participating in on-call support and mentoring others on observability best practices.

Requirements

Proven experience operating large distributed infrastructure platforms.
Strong background in observability systems and telemetry pipelines.
Experience building metrics, logging, tracing, alerting, and dashboards at production scale.
Strong programming skills in Go, Python, or Rust.
Experience with large-scale time-series data platforms.
Experience with large-scale GPU cloud platforms, HPC environments, or AI infrastructure.
Experience monitoring AI workloads such as training or inference clusters.
Deep understanding of distributed systems observability.
Familiarity with cloud-native infrastructure such as Kubernetes, automation, and CI/CD.
Experience monitoring complex networking environments and integrating network and system telemetry into centralized monitoring platforms.
Familiarity with telemetry protocols such as gNMI, SNMP, and streaming telemetry.
Strong data analysis capabilities and ability to diagnose performance issues across distributed systems.

Benefits

Attractive compensation package reflecting your expertise and experience.
Friendly, international, flexible work environment with a hybrid-friendly approach.
Remote work in EMEA.
Permanent, full-time contract.
Opportunity to join a fast-growing scale-up with career growth potential.
Equal opportunity employer with a diverse and inclusive environment.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer (DevTools)

Nebius 51-250 Internet Software & Services

Nebius is hiring an SRE for its DevTools team to maintain and improve large-scale developer infrastructure that supports builds, artifacts, and version control workflows for its AI cloud platform.

United Kingdom Europe Netherlands Germany Israel Full-time Senior Site Reliability Engineer (SRE)

CI/CD GitLab Go Java Kotlin Python Ruby Spring TeamCity

2 hours, 30 minutes ago

Apply

2 hours, 30 minutes ago

Sustaining Engineering Lead

Actian 251-1K IT Services

Actian is hiring a remote Sustaining Engineering Lead in Europe to own end-to-end escalation handling for critical platform issues on its data intelligence platform.

Europe Full-time Lead Site Reliability Engineer (SRE) Technical Support Engineer

CI/CD GitHub JIRA

3 hours, 56 minutes ago

Apply

3 hours, 56 minutes ago

Senior Cloud Resilience Architect

Blink Health 251-1K Health Care Providers & Services

Blink Health is hiring a disaster recovery and resilience architecture leader to strengthen the reliability of its healthcare technology platforms and critical patient-facing systems.

India Lead Site Reliability Engineer (SRE)

Ansible AWS Azure CloudFormation DNS GCP Kubernetes Load Balancing Pulumi Terraform

4 hours, 3 minutes ago

Apply

4 hours, 3 minutes ago

Network Reliability Engineer

Margo Bank Professional Services

Network Reliability Engineer at Warsaw Consulting – Polska Team, working remotely to build and operate AI infrastructure with a focus on monitoring, incident response, and service reliability.

Poland Contract Mid Level Site Reliability Engineer (SRE)

$416k-$520k

Ansible Bash CI/CD Debian DNS Elasticsearch GitLab Go Grafana Linux Load Balancing MariaDB Prometheus Python SaltStack TCP/IP Ubuntu

4 hours, 29 minutes ago

Apply

4 hours, 29 minutes ago

Submer

Tags

Links

Senior Observability & Telemetry Engineer - Radian Arc

Submer

Description

Requirements

Benefits

Similar Roles

Senior Site Reliability Engineer (DevTools)

Sustaining Engineering Lead

Senior Cloud Resilience Architect

Network Reliability Engineer

You're on a roll! Sign up now to keep applying.