CertifID

CertifID

CertifID is a digital identity verification solution that prevents wire fraud by validating credentials and securely sharing bank details, offering up to $1 million in insurance coverage on protected wires.

Diversified Financial Services
51-250
Founded 2017
$36M raised

Description

  • Own and improve the reliability, availability, and performance of production systems, including defining and operationalizing SLIs, SLOs, and error budgets.
  • Design and implement scalable infrastructure patterns and Infrastructure-as-Code (IaC) to support production services.
  • Design and implement autonomous and semi-autonomous AI agents that consume multi-source observability data (metrics, logs, traces) for monitoring distributed systems.
  • Build automation and automated workflows to eliminate manual operational work and improve runbook automation.
  • Participate in and help lead on-call rotations, serve as an escalation point for major incidents, and facilitate blameless postmortems.
  • Improve observability by enhancing metrics, logs, traces, and alerting (reduce noise and increase signal) using tools like Datadog or Prometheus/Grafana.
  • Partner with application and engineering teams to embed reliability best practices into system design and delivery.
  • Mentor and coach junior engineers to foster knowledge sharing and a culture of operational excellence.

Requirements

  • 5+ years of experience in SRE, DevOps, Platform Engineering, or Infrastructure Engineering.
  • Proven experience supporting production SaaS systems in Azure (preferred), AWS, or GCP.
  • Strong Linux, networking, and distributed-systems troubleshooting skills.
  • Strong experience with containers and orchestration (Kubernetes/EKS/AKS).
  • Expertise with Infrastructure-as-Code, with Terraform strongly preferred.
  • Proficient scripting/programming skills in Python, Go, Bash, or C#/.NET.
  • Hands-on experience with observability tooling such as Datadog, Prometheus/Grafana, or OpenTelemetry.
  • Experience with on-call incident response, escalation, and conducting blameless postmortems; experience defining SLIs/SLOs and managing error budgets.
  • Comfort in startup environments, with ability to move fast, influence technical direction, and mentor others; experience with AI/automation for operations is a plus.

Benefits

  • Flexible vacation
  • 12 company-paid holidays
  • 10 paid sick days
  • Health, dental, and vision insurance (including a $0 option)
  • 401(k) with matching and no waiting period
  • Equity (stock)
  • Generous parental paid leave
  • Wellness reimbursement ($300/year), remote worker reimbursement ($300/year), and professional development reimbursement

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Infrastructure Engineer - Postgres

ClickHouse 51-250 IT Services

Senior SRE / Senior Infrastructure Engineer at ClickHouse responsible for owning reliability, automation, and operations for the company’s Postgres integration across AWS, GCP, and Azure to ensure scalable, secure, and dependable cloud data platform services.

AWS Azure CI/CD ClickHouse Docker GCP Go Grafana Kubernetes OpenTelemetry PostgreSQL Prometheus Terraform
1 month ago

Senior Field Engineer | UK | Remote

Grafana 1K-5K IT Services

Senior Field Engineering Infrastructure role at Grafana Labs responsible for maintaining and developing the pre-sales demo kit and backend infrastructure, creating technical demos and training, and enabling the Solution Engineering team to scale adoption and close deals.

AWS Azure CI/CD Datadog Elasticsearch GCP Grafana Kubernetes Prometheus Splunk Terraform
1 month ago

Cloud / Platform Engineer (Remote)

Alex Staff Agency 11-50 Professional Services

Cloud/Platform Engineer at a U.S.-based EdTech company operating a global, high-load digital learning platform, responsible for maintaining production reliability and operating multi-region cloud and Kubernetes infrastructure.

AWS Bash CI/CD GCP Go Kubernetes Python Terraform
1 month ago

Customer Reliability Engineer

Sysdig 251-1K IT Services

Customer Reliability Engineer at Sysdig (remote, flexible for Italy/Spain) delivering senior-level technical support and escalation management to ensure customers run and secure cloud/container environments reliably.

AWS Azure Bash Cassandra Elasticsearch GCP Kafka Kubernetes Linux PostgreSQL Python Shell Scripting
1 month ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers