Project Director - HPC / AI / GPU Infrastructure Deployment
Larsen Toubro Vyoma
Job Description
Role OverviewWe are seeking a seasoned Project Director - HPC / AI Infrastructure Deployment to lead large-scale, high-density compute programs involving GPU clusters, HPC workloads, and AI infrastructure. The role demands end-to-end ownership of deploying 10+ MW IT load data center environments, ensuring delivery of high-performance GPU-based compute platforms with cutting-edge networking and storage architectures.
Roles & ResponsibilitiesLead and deliver large-scale HPC / AI GPU cluster deployments (e.g., NVIDIA B200 / B300 GPU platforms) within defined timelines and budgetsDrive execution of AI stack deployment (e.g., NVIDIA NVAIE) across hybrid/cloud/on-prem environmentsManage multi-vendor ecosystems including OEMs, SI partners, and hyperscale technology providersDeploy and scale high-density GPU racks with liquid/air-cooled thermal strategiesDesign and oversee InfiniBand (IB) and high-speed Ethernet networksExperience with NVIDIA/Mellanox InfiniBand fabricsConfiguration and optimization using UFM (Unified Fabric Manager)Strong understanding of BCM (Broadcom Ethernet switching) platformsArchitect and implement Leaf-Spine network topology for ultra-low latency AI workloadsEnsure effective integration of storage systems (parallel file systems, NVMe-based storage)Oversee deployment of Kubernetes-based GPU orchestration platformsExperience with containerized AI workloads and distributed training clustersExposure to NVIDIA AI Enterprise (NVAIE), CUDA, and GPU virtualization frameworksManage data center design, build, and repurposing for HPC workloadsOversee MEP (Mechanical, Electrical, Plumbing) systems implementationEnure optimized thermal management (liquid cooling, rear door heat exchangers, immersion cooling where applicable)Ensure optimized power density (kW/rack) planningEnsure optimized energy efficiency (PUE optimization)Establish robust governance frameworks aligned to:
a. HLD/LLD design validation b. SOP adherence c. Quality assurance benchmarks
Implement risk mitigation strategies for large-scale deployments (supply chain, OEM dependencies, technology integration risks)Monitor program milestones and ensure SLA-based deliveriesDrive structured cabling design (fiber-heavy HPC fabric, spine-leaf connectivity)
Qualifications & ExperienceB.E/B.Tech in Electrical / Electronics / Computer Science Engineering15-25 years of experience in Data center infrastructure deployment, HPC / AI workload environments, large-scale IT infrastructure programs
Mandatory / Preferred CertificationsPMP / PRINCE2 (mandatory for program governance)CDCP / CDCS / CDCPM certificationsStrongly preferred:NVIDIA AI Infrastructure / DGX / AI Factory certificationsOEM certifications (Dell, HPE, Lenovo HPC systems)