Site Reliability Engineer I

2 hours, 37 minutes ago
Full-time
Lead
DevOps and Infrastructure
Zafin

Zafin

Zafin is a provider of relationship banking software solutions to the financial services industry. Their transformative solutions range from core modernization to innovative platforms in billing, analytics, and rates & fees to quote to cash, empowering...

Internet Software & Services
251-1K
Founded 2002
$47M raised

Description

  • Manage the resolution of complex technical issues involving Zafin’s products and Azure cloud environment.
  • Design and implement operational enhancements to improve resiliency and system reliability.
  • Conduct root cause analysis for high-severity incidents and reduce repeat failures.
  • Represent the organization on external client escalation calls and provide expert guidance and solutions.
  • Optimize cloud infrastructure for performance, scalability, and cost-effectiveness.
  • Provide leadership in managing and scaling container orchestration platforms such as AKS and OpenShift.
  • Implement advanced monitoring solutions and use predictive analytics for proactive issue resolution.
  • Develop and execute automation strategies for operational workflows and incident response.
  • Create and maintain documentation for cloud architectures, processes, and incident management strategies.
  • Mentor and coach junior engineers while collaborating with cross-functional teams on strategic initiatives.

Requirements

  • Bachelor’s degree in computer science, engineering, or a related field; master’s degree preferred.
  • 8+ years of experience in cloud support, operations, or a related role.
  • Advanced expertise in Microsoft Azure, or equivalent cloud platforms.
  • Experience designing and scaling container orchestration systems such as AKS or OpenShift.
  • Proven leadership managing automated deployment pipelines, including Azure DevOps.
  • Experience with enterprise monitoring platforms such as Azure Insights and Grafana, plus predictive analytics tools.
  • Advanced scripting skills with PowerShell, Python, or similar languages.
  • Extensive experience in incident management and defining SLAs for global production environments.
  • In-depth knowledge of database management, particularly Postgres.
  • Preferred: advanced cloud certifications such as Azure Solutions Architect Expert.
  • Preferred: experience with ITSM tools and processes such as ServiceNow.
  • Preferred: strong understanding of security and compliance in cloud environments.
  • Strong analytical and problem-solving abilities.
  • Strong leadership, mentoring, communication, and collaboration skills.

Benefits

  • Competitive salaries.
  • Annual bonus potential.
  • Generous paid time off.
  • Paid volunteering days.
  • Wellness benefits.
  • Robust opportunities for professional growth and career advancement.
  • Accommodations available for candidates with disabilities during the selection process.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Database Reliability Engineer

Sezzle 251-1K Diversified Financial Services

Sezzle is hiring a Senior Database Reliability Engineer to design and scale the database platform behind its applications, with a focus on making database usage safer, more reliable, and easier for developers across the company.

AWS CI/CD Datadog Elasticsearch Encryption Git GitLab Go Grafana Helm Kubernetes Microservices MySQL New Relic OpenTelemetry PostgreSQL Prometheus Python React React Native Secrets Management Terraform TypeScript
43 minutes ago

Operations Reliability Engineer - Automations

Alpaca 51-250 Capital Markets

Alpaca is hiring an Operations Reliability Engineer to embed within brokerage operations and build software that replaces manual work with durable, auditable systems at global scale.

Agile Argo CD CI/CD Docker GCP Go gRPC Kubernetes Microservices PostgreSQL React REST API Scrum SQL Terraform TypeScript
46 minutes ago

Staff Site Reliability Engineer

AlphaSense 251-1K Internet Software & Services

AlphaSense is hiring a Staff Site Reliability Engineer to architect reliability platforms and drive SRE practices that help its global SaaS systems meet mission-critical uptime and performance goals.

AWS Azure Datadog DNS GCP Go Grafana Kubernetes Load Balancing OpenTelemetry Prometheus Python TCP/IP
1 hour, 52 minutes ago

[Job-29357] Senior Devops, Brazil

CI&T 5K-10K Internet Software & Services

CI&T is hiring a Mid/Senior DevOps/SRE in Brazil to support and evolve a scalable cloud platform, with both business-hours coverage and on-call responsibility.

Apache Airflow Argo CD AWS Bash CI/CD Datadog EC2 GitHub Actions GitLab CI GitOps Helm Kubernetes Python Snowflake Terraform
3 hours, 22 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers