Senior Release Engineer, AI Infrastructure

Our client operates within the AI infrastructure space, supporting large-scale, GPU-accelerated environments that power advanced machine learning workloads. Their platforms are built to run distributed training and inference across high-performance compute clusters, enabling scalable and reliable delivery of AI features. This role focuses on strengthening release engineering practices to support consistent, secure, and efficient deployments in compute-intensive environments.
Role
As a Senior Release Engineer, AI Infrastructure, you will lead the development of release engineering capabilities, with a focus on CI/CD pipelines and deployment workflows designed for GPU-accelerated, distributed systems. You will design and manage automated pipelines, testing gates, and release processes that support multi-node workloads and distributed ML training use cases. This includes building test infrastructure for long-running, compute-intensive jobs and ensuring deployment practices align with GPU scheduling and orchestration standards within Kubernetes environments. You will implement structured release practices such as GitOps workflows, automated rollback mechanisms, and change controls, while defining standards for code quality, repository management, and dependency governance. Working closely with infrastructure teams, you will ensure deployments meet security, compliance, and performance requirements across large-scale GPU clusters. The role also involves mentoring engineers on deployment safety, incident response, and embedding consistent release practices across the organization.
Requirements
You should bring 5-7 years of experience in DevOps or release engineering, with hands-on exposure to GPU-accelerated or high-performance computing environments. Strong expertise in CI/CD tools such as GitHub Actions, GitLab CI, Jenkins, or ArgoCD is required, including experience designing multi-stage pipelines with robust testing gates. Proficiency in scripting languages such as Python, Go, or Bash is essential for automation and orchestration. A solid understanding of Kubernetes is critical, particularly in managing and troubleshooting GPU-enabled workloads, including scheduling, scaling, and rollout strategies in distributed environments. Experience with configuration and secret management, secure deployment practices, and artifact/version control is important. You should also have experience integrating testing frameworks into pipelines, including validation of long-running or resource-intensive workloads. Familiarity with optimizing reliability, performance, and cost efficiency in GPU-accelerated clusters will be advantageous.
To Apply
To apply, please submit your resume to Yien Quek at yq@kerryconsulting.com. We regret to inform that only successful shortlisted candidates will be notified. Licence No: 16S8060 | Registration no: R1109830
![]()
