Inference Optimization Architect, Speech AI, NVIDIA

June 17, 2026

Apply for this job

Email *
Executive Name *

Job Description

NVIDIA is seeking an Inference Optimization Architect for Speech AI who will enhance speech models and develop scalable systems which will improve real-time conversational AI capabilities. The main responsibilities of this position entail decreasing inference delays while simultaneously boosting processing efficiency and maximizing GPU resource allocation throughout extensive AI operational environments. The architect will work together with researchers and engineers to develop efficient production-grade systems from advanced research models.

Apply: Apply Now

Main Duties:

  • Optimize inference performance through batching strategies, caching, and multi-threaded pipeline improvements.
  • Implement model compression techniques including quantization, pruning, and knowledge distillation.
  • Profiling and benchmarking models using GPU tools to identify and eliminate performance bottlenecks.
  • Develop hardware-accelerated solutions using CUDA, TensorRT, and custom kernel optimizations.
  • Design scalable infrastructure and optimize deployment across data center and edge GPU platforms.

Essential Qualifications:

  • 10 years experience in deep learning and 5 years dedicated to optimizing inference systems.
  • Knowledge of inference pipelines which support large language models and speech recognition and synthesis systems.
  • Practical skills in CUDA programming and memory management and parallel computing.
  • Experience in model serving tools which include Triton, TorchServe, TensorRT and vLLM.
  • Complete understanding of model architectures which include Transformers and CNNs and RNNs.

Preferred Skills:

  • Experience contributing to open-source projects which include PyTorch, JAX and Triton.
  • Possesses expertise in both embedded systems and the implementation of AI models onto edge devices.
  • Capability to create automated systems which handle both model optimization and deployment processes.
  • Demonstrates effective teamwork abilities which enable him to collaborate with international teams from different departments.
  • Demonstrates expertise in managing resource usage while achieving cost reduction for production inference operations.