HPC Cluster Architect

7 hours, 58 minutes ago
Full-time
Senior
DevOps and Infrastructure
NexGen Cloud

NexGen Cloud

NexGen Cloud is Europe's leading sustainable cloud Infrastructure as a Service (IaaS) provider, specializing in high-performance computing (HPC) and GPU infrastructure. With a focus on sustainability and innovation, NexGen Cloud offers GPU as a Service...

IT Services
11-50
Founded 2020

Description

  • Own end-to-end cluster architecture for large-scale NVIDIA GPU deployments from customer requirements through production handover.
  • Design high-performance network fabrics across compute, storage, and WAN, including topology, oversubscription, and scaling strategies.
  • Define rack layouts, BOMs, and power and cooling designs for production-ready deployments.
  • Engage with OEMs and vendors to validate hardware configurations, review quotes, and optimize designs commercially and technically.
  • Provide technical oversight during deployment and bring-up, including hardware validation and performance testing.
  • Act as escalation point for complex integration issues during implementation.
  • Serve as a senior technical leader across Solutions Architecture, Cloud Engineering, and data centre partners.
  • Contribute to standardised reference designs and help build out the HPC engineering function.

Requirements

  • Proven experience designing and delivering GPU-based HPC or AI clusters at scale across the full lifecycle from design through procurement, deployment, and validation.
  • Deep hands-on knowledge of NVIDIA GPU platforms such as H100, H200, or B-series, and NVIDIA reference architectures.
  • Strong InfiniBand/RDMA design experience, including topology, performance tuning, and high-performance Ethernet fabrics.
  • Solid grounding in Linux systems, PCIe topology, NUMA alignment, and server-level performance considerations.
  • Background from an OEM, hyperscaler, neo-cloud, or enterprise/research HPC environment with full design-to-deployment exposure.
  • Confidence engaging with customers, vendors, OEMs, and internal engineering teams as a technical authority.
  • Experience with Spectrum-X or next-generation Ethernet fabrics is preferred.
  • Prior involvement in large-scale cluster deployments of 1,000+ GPUs and benchmarking with NCCL or MLPerf is preferred.
  • Exposure to air-cooled and liquid-cooled HPC environments and/or automation or infrastructure-as-code is preferred.

Benefits

  • Competitive salary and annual discretionary bonus scheme.
  • Employee wellbeing benefits.
  • 25 days of holiday plus public holidays.
  • Flexible working arrangements, including remote or hybrid work depending on role and location.
  • Real ownership and autonomy with the trust to take initiative and experiment.
  • Clear career progression and growth opportunities in a fast-growing company.
  • A collaborative, international culture built on trust, transparency, and ownership.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Telecommunications Infrastructure Analyst

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is seeking a Telecommunications Infrastructure Analyst to assess and plan IT infrastructure for facilities and site builds supporting reliable, secure, and scalable network environments.

28 minutes ago

Senior Mechanical Engineer - Data Center (Remote)

Olsson 1K-5K Construction & Engineering

Olsson is seeking a Senior Mechanical Engineer to support mission-critical projects for major technology clients, with a focus on mechanical design, documentation, and coordination on data center work.

1 hour, 43 minutes ago

Sr. Solution Architect - Enterprise Networking

Your Business Internet Software & Services

NRI North America is hiring a Senior Solution Architect for its Hybrid Infrastructure Enterprise Networking practice to design and oversee complex remote-delivery network solutions for enterprise clients.

Cisco Juniper Network Security TCP/IP
2 hours, 12 minutes ago

Sr. Linux & OCI Administrator (Oracle Linux / RHEL)

Mitratech 1K-5K Professional Services

Mitratech is hiring a Sr. Linux & Oracle Cloud Infrastructure (OCI) Administrator to support a production application on OCI, with ownership of Oracle Linux, infrastructure automation, networking, security, and operational support.

Ansible Azure Bash DHCP DNS Docker GitHub Actions GitLab CI HashiCorp Vault Jenkins Kubernetes Linux Load Balancing Packer SFTP SSH TCP/IP Terraform TLS
2 hours, 28 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers