Job Description
NVIDIA is hiring a Senior AI Infrastructure Engineer to design, build, and operate large-scale DGX Cloud platforms. The role involves distributed systems, automation, Kubernetes, and GPU infrastructure, ensuring reliability, performance, and scalability for AI training and inference services across production environments globally at enterprise scale.
Apply: Apply Now
Main Duties
- Create and establish internal tools will be implemented to manage artificial intelligence training and inference systems that can operate on cloud-based infrastructure.
- Conducts performance evaluation tests to enhance system efficiency across multiple GPU devices and complete node clusters.
- Provides complete service management starting from product development through system implementation until system maintenance and system improvement.
- Assists team with pre-launch activities through system design consulting services and development of essential tools and capacity assessment.
- Establishes system health monitoring procedures which track system uptime and response time and system performance.
- Implement automated infrastructure processes which boost system performance through increased system capacity and improved system stability and operational effectiveness.
- Handle incident management processes through their active participation in emergency response procedures and work on post-incident reviews.
Essential Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or related technical discipline.
- 5+ years of professional experience in infrastructure engineering or distributed systems.
- Experience with infrastructure automation and large-scale cloud systems in production environments.
- Programming experience in Python, Go, C/C++, or Java.
- Strong knowledge of Linux, networking, storage systems, and container technologies.
- Experience with public cloud platforms, Infrastructure as Code (IaC), and Terraform.
- Hands-on distributed systems experience.
Preferred Qualifications:
- Experience in designing distributed systems which function across multiple locations and has developed solutions for system failures.
- Demonstrates strong problem-solving abilities which he combines with their ownership mentality and communication skills.
- Debugging skills to analyze system performance while developing automated solutions for their recurring tasks.
- Operational experience with both private and public cloud environments which he managed through Kubernetes and Slurm.