Margo Bank

Margo Bank

Unlock excellence with MARGO Consulting: where ambition, expertise, and innovation drive the most complex tech challenges.

Professional Services
$8M raised

Description

  • Build and support large-scale AI infrastructure with monitoring, diagnosis, and remediation of production incidents.
  • Troubleshoot high-impact production issues in collaboration with other engineering teams.
  • Participate in an on-call rotation to handle incidents and ensure service continuity.
  • Implement and maintain observability solutions to monitor AI infrastructure and application health.
  • Contribute to AI infrastructure lifecycle management across different environments and countries.
  • Promote and apply best practices for stability, resiliency, scalability, and security.
  • Maintain clear technical documentation for tools and procedures.
  • Contribute to the evolution of systems and tools based on production feedback.
  • Collaborate closely with development teams to ensure infrastructure readiness.
  • Participate in team rituals and knowledge-sharing initiatives.

Requirements

  • Experience with Go or Python.
  • Strong scripting skills in Bash and Python.
  • Hands-on experience with Linux systems, especially Ubuntu/Debian.
  • Preferred hands-on experience with GPU and HPC infrastructure.
  • Knowledge of networking concepts such as VLAN/LAN, TCP/IP, DNS, BGP, load-balancing, and IPv6.
  • Familiarity with monitoring and logging tools such as Prometheus, Grafana, and Elastic.
  • Comfort with Infrastructure-as-Code tools such as Ansible, Salt, and AWX.
  • Experience managing relational databases, especially MariaDB.
  • Understanding of CI/CD pipelines, especially GitLab.
  • Comfortable communicating in English, both written and spoken.
  • Proactive and solution-oriented mindset.
  • Passion for automation and continuous improvement.
  • Strong collaboration and communication skills.
  • Ability to work independently and as part of a team.
  • Willingness to mentor others and share knowledge.

Benefits

  • Remote work arrangement.
  • Permanent contract or B2B contract option.
  • Hourly rate of 200 zł - 250 zł.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer (DevTools)

Nebius 51-250 Internet Software & Services

Nebius is hiring an SRE for its DevTools team to maintain and improve large-scale developer infrastructure that supports builds, artifacts, and version control workflows for its AI cloud platform.

CI/CD GitLab Go Java Kotlin Python Ruby Spring TeamCity
24 minutes ago

Senior Site Reliability Engineer (SRE)

The Investigo Group Professional Services

The Investigo Group is hiring a Senior Site Reliability Engineer to operate and mature its production Kubernetes and OpenShift platforms across secure on-premises and hybrid environments.

Ansible Argo CD CI/CD Flux GitHub Actions GitOps Go Grafana Helm Juniper Kubernetes Linux Load Balancing Machine Learning OpenID Connect OpenShift OpenTelemetry Palo Alto Prometheus Python SAML Shell Scripting Terraform
5 hours, 20 minutes ago

Staff Site Reliability Engineer, Production Engineering

Dropbox 1K-5K Internet Software & Services

Dropbox is hiring a Site Reliability Engineer to define and drive company-wide reliability strategy for an AI-enabled engineering environment, with the goal of strengthening stability, observability, incident response, and operational excellence at scale.

5 hours, 28 minutes ago

Senior Cloud Resilience Architect

Blink Health 251-1K Health Care Providers & Services

Blink Health is hiring a disaster recovery and resilience architecture leader to strengthen the reliability of its healthcare technology platforms and critical patient-facing systems.

Ansible AWS Azure CloudFormation DNS GCP Kubernetes Load Balancing Pulumi Terraform
5 hours, 41 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers