Site Reliability Engineer I

4 weeks ago
Full-time
Lead
DevOps and Infrastructure
Zafin

Zafin

Zafin is a provider of relationship banking software solutions to the financial services industry. Their transformative solutions range from core modernization to innovative platforms in billing, analytics, and rates & fees to quote to cash, empowering...

Internet Software & Services
251-1K
Founded 2002
$47M raised

Description

  • Manage the resolution of complex technical issues involving Zafin’s products and Azure cloud environment.
  • Design and implement operational enhancements to improve resiliency and system reliability.
  • Conduct root cause analysis for high-severity incidents and reduce repeat failures.
  • Represent the organization on external client escalation calls and provide expert guidance and solutions.
  • Optimize cloud infrastructure for performance, scalability, and cost-effectiveness.
  • Provide leadership in managing and scaling container orchestration platforms such as AKS and OpenShift.
  • Implement advanced monitoring solutions and use predictive analytics for proactive issue resolution.
  • Develop and execute automation strategies for operational workflows and incident response.
  • Create and maintain documentation for cloud architectures, processes, and incident management strategies.
  • Mentor and coach junior engineers while collaborating with cross-functional teams on strategic initiatives.

Requirements

  • Bachelor’s degree in computer science, engineering, or a related field; master’s degree preferred.
  • 8+ years of experience in cloud support, operations, or a related role.
  • Advanced expertise in Microsoft Azure, or equivalent cloud platforms.
  • Experience designing and scaling container orchestration systems such as AKS or OpenShift.
  • Proven leadership managing automated deployment pipelines, including Azure DevOps.
  • Experience with enterprise monitoring platforms such as Azure Insights and Grafana, plus predictive analytics tools.
  • Advanced scripting skills with PowerShell, Python, or similar languages.
  • Extensive experience in incident management and defining SLAs for global production environments.
  • In-depth knowledge of database management, particularly Postgres.
  • Preferred: advanced cloud certifications such as Azure Solutions Architect Expert.
  • Preferred: experience with ITSM tools and processes such as ServiceNow.
  • Preferred: strong understanding of security and compliance in cloud environments.
  • Strong analytical and problem-solving abilities.
  • Strong leadership, mentoring, communication, and collaboration skills.

Benefits

  • Competitive salaries.
  • Annual bonus potential.
  • Generous paid time off.
  • Paid volunteering days.
  • Wellness benefits.
  • Robust opportunities for professional growth and career advancement.
  • Accommodations available for candidates with disabilities during the selection process.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Operations Engineer

Mozilla 251-1K Internet Software & Services

Mozilla is hiring a Staff Operations Engineer to lead the design, reliability, and evolution of hybrid-cloud and workplace infrastructure across teams.

Ansible DNS Linux Puppet Python TCP/IP Unix
11 hours ago

Principal Site Reliability Engineer (SRE)

Symmetrio Professional Services

Symmetrio is recruiting a Principal Site Reliability Engineer for a rapidly growing healthcare technology company to own the reliability, scalability, security, and performance of a mission-critical SaaS platform used by healthcare providers across the United States.

Active Directory AWS CI/CD Datadog Django Grafana Kubernetes Python Terraform Windows Server
11 hours, 15 minutes ago

Performance Test Engineer Lead

PartnerOne 51-250 Media

An enterprise performance engineering role at a cloud-focused organization, responsible for validating the scalability, stability, and production readiness of distributed systems across Azure and hybrid environments.

Azure CI/CD Kubernetes PowerShell
11 hours, 30 minutes ago

Site Reliability Engineer

MLabs 11-50 Internet Software & Services

Remote UK-hours Site Reliability Engineering role at a financial technology company, focused on automating and operating the infrastructure that supports global integration services for financial institutions.

Active Directory Ansible AWS CI/CD GCP OAuth PostgreSQL SAML
11 hours, 45 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers