Principal HPC Platform Engineer – Digital Infrastructure

A fast-growing company in the digital infrastructure domain is seeking a Principal Engineer to guide the development of its orchestration and provisioning capabilities. This team builds systems that simplify how teams access and manage high-performance GPU compute environments, with a focus on reliability, security, and multi-tenant operations. The role is suited for an engineer who enjoys shaping foundational platform components that support modern HPC and AI workloads.
Role
You will lead the design and evolution of an orchestration layer that operates above HPC schedulers, enabling users to provision clusters, manage environments, and run workloads through streamlined, API-driven workflows. The scope includes developing control-plane services, automation pipelines, policy frameworks, and governance models that span identity, tenancy, quota management, and usage attribution. The position involves working closely with systems that integrate deeply with Slurm, supporting operational automation, scheduling policies, and multi-cluster scenarios. You will also contribute to operational practices such as SLO definition, observability, incident response, and lifecycle management. You will help set engineering standards, mentor peers, and collaborate with product, security, and infrastructure teams.
Requirements
Candidates should bring over a decade of experience across infrastructure, platform engineering, or distributed systems, with strong hands-on proficiency. Practical expertise operating and extending Slurm in production-covering policies, accounting, quotas, troubleshooting, and automation-is essential. You should have a background building self-serve platform capabilities, including APIs, guardrails, and provisioning workflows. A solid foundation in Linux, networking, system performance, storage, and observability is expected, along with experience in areas such as Kubernetes operators, workflow engines, service meshes, identity systems, or multi-cluster control planes. The role requires an understanding of secure, multi-tenant design principles and the ability to communicate architectural decisions clearly. Familiarity with GPU environments, HPC networking, and ecosystem tools is beneficial.
To Apply
To apply, please submit your resume to Yien Quek at yq@kerryconsulting.com. We regret to inform that only successful shortlisted candidates will be notified. Licence No: 16S8060 | Registration no: R1109830
![]()
