Principal HPC Platform Engineer - Digital Infrastructure Jobs in Singapore

    Principal HPC Platform Engineer – Digital Infrastructure

      Perunding:
      No. Rujukan Kerja
      No Pendaftaran
      1109830
      No. Lesen
      16S8060
      Fungsi
      Kejuruteraan Perisian, Seni Bina, DevOps & SRE
      industri
      IT & Telco

      A fast-growing company in the digital infrastructure domain is seeking a Principal Engineer to guide the development of its orchestration and provisioning capabilities. This team builds systems that simplify how teams access and manage high-performance GPU compute environments, with a focus on reliability, security, and multi-tenant operations. The role is suited for an engineer who enjoys shaping foundational platform components that support modern HPC and AI workloads.

      Role

      You will lead the design and evolution of an orchestration layer that operates above HPC schedulers, enabling users to provision clusters, manage environments, and run workloads through streamlined, API-driven workflows. The scope includes developing control-plane services, automation pipelines, policy frameworks, and governance models that span identity, tenancy, quota management, and usage attribution. The position involves working closely with systems that integrate deeply with Slurm, supporting operational automation, scheduling policies, and multi-cluster scenarios. You will also contribute to operational practices such as SLO definition, observability, incident response, and lifecycle management. You will help set engineering standards, mentor peers, and collaborate with product, security, and infrastructure teams.

      Keperluan

      Candidates should bring over a decade of experience across infrastructure, platform engineering, or distributed systems, with strong hands-on proficiency. Practical expertise operating and extending Slurm in production-covering policies, accounting, quotas, troubleshooting, and automation-is essential. You should have a background building self-serve platform capabilities, including APIs, guardrails, and provisioning workflows. A solid foundation in Linux, networking, system performance, storage, and observability is expected, along with experience in areas such as Kubernetes operators, workflow engines, service meshes, identity systems, or multi-cluster control planes. The role requires an understanding of secure, multi-tenant design principles and the ability to communicate architectural decisions clearly. Familiarity with GPU environments, HPC networking, and ecosystem tools is beneficial.

      Untuk Memohon

      To apply, please submit your resume to Yien Quek at yq@kerryconsulting.com. We regret to inform that only successful shortlisted candidates will be notified. Licence No: 16S8060 | Registration no: R1109830

      Mohon jawatan ini