HPC Cluster Architect

3 weeks, 3 days ago
Full-time
Senior
DevOps and Infrastructure
NexGen Cloud

NexGen Cloud

NexGen Cloud is Europe's leading sustainable cloud Infrastructure as a Service (IaaS) provider, specializing in high-performance computing (HPC) and GPU infrastructure. With a focus on sustainability and innovation, NexGen Cloud offers GPU as a Service...

IT Services
11-50
Founded 2020

Description

  • Own end-to-end cluster architecture for large-scale NVIDIA GPU deployments from customer requirements through production handover.
  • Design high-performance network fabrics across compute, storage, and WAN, including topology, oversubscription, and scaling strategies.
  • Define rack layouts, BOMs, and power and cooling designs for production-ready deployments.
  • Engage with OEMs and vendors to validate hardware configurations, review quotes, and optimize designs commercially and technically.
  • Provide technical oversight during deployment and bring-up, including hardware validation and performance testing.
  • Act as escalation point for complex integration issues during implementation.
  • Serve as a senior technical leader across Solutions Architecture, Cloud Engineering, and data centre partners.
  • Contribute to standardised reference designs and help build out the HPC engineering function.

Requirements

  • Proven experience designing and delivering GPU-based HPC or AI clusters at scale across the full lifecycle from design through procurement, deployment, and validation.
  • Deep hands-on knowledge of NVIDIA GPU platforms such as H100, H200, or B-series, and NVIDIA reference architectures.
  • Strong InfiniBand/RDMA design experience, including topology, performance tuning, and high-performance Ethernet fabrics.
  • Solid grounding in Linux systems, PCIe topology, NUMA alignment, and server-level performance considerations.
  • Background from an OEM, hyperscaler, neo-cloud, or enterprise/research HPC environment with full design-to-deployment exposure.
  • Confidence engaging with customers, vendors, OEMs, and internal engineering teams as a technical authority.
  • Experience with Spectrum-X or next-generation Ethernet fabrics is preferred.
  • Prior involvement in large-scale cluster deployments of 1,000+ GPUs and benchmarking with NCCL or MLPerf is preferred.
  • Exposure to air-cooled and liquid-cooled HPC environments and/or automation or infrastructure-as-code is preferred.

Benefits

  • Competitive salary and annual discretionary bonus scheme.
  • Employee wellbeing benefits.
  • 25 days of holiday plus public holidays.
  • Flexible working arrangements, including remote or hybrid work depending on role and location.
  • Real ownership and autonomy with the trust to take initiative and experiment.
  • Clear career progression and growth opportunities in a fast-growing company.
  • A collaborative, international culture built on trust, transparency, and ownership.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Database Administrator - Cloud Platform / Infrastructure

3Cloud 251-1K Internet Software & Services

3Cloud is seeking an experienced Database Administrator to support multiple customer database migration and Azure data services projects across development, test, and production environments.

Azure Oracle SQL Server Terraform
33 minutes ago

Senior Technical Consultant - Network SDN

AHEAD 1K-5K IT Services

AHEAD is hiring a Senior Technical Consultant for Network SDN to lead client-facing software-defined networking deployments and migrations in enterprise data center environments.

Cisco
47 minutes ago

Systems Administrator - Factory Systems

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Systems Administrator to support the deployment, maintenance, and security of software, hardware, and networks across its manufacturing lines.

Ansible Azure Cybersecurity DNS ERP Linux Terraform
1 hour, 20 minutes ago

Senior Machine Learning Infrastructure Engineer

Unity 5K-10K Internet Software & Services

Unity is hiring a Senior Machine Learning Infrastructure Engineer to build and operate real-time ML serving infrastructure for its global advertising platform, helping production ranking, bidding, and targeting systems run at scale.

Go Grafana Kubernetes OpenTelemetry Prometheus Python Terraform
2 hours, 22 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers