CertifID

CertifID

CertifID is a digital identity verification solution that prevents wire fraud by validating credentials and securely sharing bank details, offering up to $1 million in insurance coverage on protected wires.

Diversified Financial Services
51-250
Founded 2017
$36M raised

Description

  • Own and improve the reliability, availability, and performance of production systems, including defining and operationalizing SLIs, SLOs, and error budgets.
  • Design and implement scalable infrastructure patterns and Infrastructure-as-Code (IaC) to support production services.
  • Design and implement autonomous and semi-autonomous AI agents that consume multi-source observability data (metrics, logs, traces) for monitoring distributed systems.
  • Build automation and automated workflows to eliminate manual operational work and improve runbook automation.
  • Participate in and help lead on-call rotations, serve as an escalation point for major incidents, and facilitate blameless postmortems.
  • Improve observability by enhancing metrics, logs, traces, and alerting (reduce noise and increase signal) using tools like Datadog or Prometheus/Grafana.
  • Partner with application and engineering teams to embed reliability best practices into system design and delivery.
  • Mentor and coach junior engineers to foster knowledge sharing and a culture of operational excellence.

Requirements

  • 5+ years of experience in SRE, DevOps, Platform Engineering, or Infrastructure Engineering.
  • Proven experience supporting production SaaS systems in Azure (preferred), AWS, or GCP.
  • Strong Linux, networking, and distributed-systems troubleshooting skills.
  • Strong experience with containers and orchestration (Kubernetes/EKS/AKS).
  • Expertise with Infrastructure-as-Code, with Terraform strongly preferred.
  • Proficient scripting/programming skills in Python, Go, Bash, or C#/.NET.
  • Hands-on experience with observability tooling such as Datadog, Prometheus/Grafana, or OpenTelemetry.
  • Experience with on-call incident response, escalation, and conducting blameless postmortems; experience defining SLIs/SLOs and managing error budgets.
  • Comfort in startup environments, with ability to move fast, influence technical direction, and mentor others; experience with AI/automation for operations is a plus.

Benefits

  • Flexible vacation
  • 12 company-paid holidays
  • 10 paid sick days
  • Health, dental, and vision insurance (including a $0 option)
  • 401(k) with matching and no waiting period
  • Equity (stock)
  • Generous parental paid leave
  • Wellness reimbursement ($300/year), remote worker reimbursement ($300/year), and professional development reimbursement

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
55 minutes ago

Intermediate Site Reliability Engineer - OP02119

Dev.Pro 251-1K Internet Software & Services

Dev.Pro is hiring an IT Specialist for its SRE team to support company and client environments by maintaining infrastructure, monitoring services, and automating operations across cloud and on-premises systems.

Ansible Apache AWS Bash CI/CD DHCP DNS Docker ELK Stack GCP Git Grafana Jenkins Linux MySQL Nginx PostgreSQL Prometheus Puppet Python SQL SQL Server SSH TCP/IP TeamCity Terraform TLS Ubuntu Windows Server Zabbix
2 hours, 56 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring an Engineering Manager to lead its Resilience Engineering team in building production load testing and chaos engineering capabilities that improve the safety and reliability of its production systems.

AWS Java Kotlin Kubernetes Python
3 hours, 56 minutes ago

Manager, Software Engineering (Resilience Engineering)

Affirm 1K-5K Diversified Financial Services

Affirm is seeking an Engineering Manager to lead its Resilience Engineering team, building production load testing and chaos engineering capabilities that improve the safety and reliability of production systems.

AWS Java Kotlin Kubernetes Microservices Python
11 hours, 52 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers