Senior Site Reliability Engineer - AI Infrastructure Jobs in Singapore

    Senior Site Reliability Engineer – AI Infrastructure

      顾问:
      职位编号
      注册编号
      R1109830
      许可证编号
      16S8060
      功能
      软件工程、架构、DevOps 和 SRE
      行业
      信息技术与电信

      Our client operates large-scale GPU cloud platforms across Asia-Pacific. As part of their expansion, they are looking for experienced platform engineers to build and scale their next-generation data center operations. This role offers direct impact in a well-funded technology company working at the forefront of sustainable AI infrastructure.

      Role

      You’ll drive the technical foundation for MLOps capabilities and platform infrastructure supporting cutting-edge NVIDIA GPU clusters. This position demands expertise in designing and operating Kubernetes environments for high-performance computing, implementing Infrastructure-as-Code frameworks, and building world-class observability platforms. You’ll collaborate directly with founders and engineering leadership to establish DevOps standards, enhance CI/CD pipelines, and integrate enterprise-grade monitoring across distributed systems. The role requires ownership of incident response, active participation in on-call rotation, and leading root cause analysis to elevate operational maturity. You’ll work with technologies including Terraform, Ansible, Prometheus, Grafana, Loki, and OpenTelemetry while managing infrastructure supporting thousands of servers across multiple data centers.

      要求

      We seek candidates with 7+ years of platform engineering, SRE, or DevOps experience who have built observability and infrastructure platforms from first principles. Deep proficiency with containerization, Kubernetes cluster management, Infrastructure-as-Code tools, and the LGTM observability stack (Loki, Grafana, Tempo, Prometheus/Thanos) is essential. You must demonstrate hands-on expertise with Linux internals, networking stacks, distributed storage, and scripting languages such as Python, Go, or Bash. Experience with telemetry solutions (Redfish, gNMI, SNMP, eBPF) and compliance frameworks (SOC 2, ISO 27001) is highly valued. Bachelor’s degree in Computer Science or related field required.

      申请

      To apply, please submit your resume to Yien Quek at yq@kerryconsulting.com. We regret to inform that only successful shortlisted candidates will be notified. Licence No: 16S8060 | Registration no: R1109830

      申请此职位