Share this job
Principal Engineer – High-Performance AI Infrastructure
San Jose, CA
Apply for this job

As a Principal Engineer for HPC and AI Infrastructure, you’ll take a lead role in designing the low-level systems that maximize GPU utilization across large, mission-critical workloads.

Working within our GPU Runtime & Systems team, you’ll focus on device drivers, kernel-level optimizations, and runtime performance to ensure GPU clusters deliver the highest throughput, lowest latency, and greatest reliability possible. Your work will directly accelerate workloads across deep learning, high-performance computing, and real-time simulation.


This position sits at the intersection of systems programming, GPU architecture, and HPC-scale computing—a unique opportunity to shape infrastructure used by developers and enterprises worldwide.


Key Responsibilities


  • Build and optimize device drivers and runtime components for GPUs and high-speed interconnects.
  • Collaborate with kernel and platform teams to design efficient memory pathways (pinned memory, peer-to-peer, unified memory).
  • Improve data transfers across NVLink, InfiniBand, PCIe, and RDMA to reduce latency and boost throughput.
  • Enhance GPU memory operations with NUMA-aware strategies and hardware-coherent optimizations.
  • Implement telemetry and observability tools to monitor GPU performance with minimal runtime overhead.
  • Contribute to internal debugging/profiling tools for GPU workloads.
  • Mentor engineers on best practices for GPU systems development and participate in peer design/code reviews.
  • Stay ahead of evolving GPU and interconnect architectures to influence future infrastructure design.


Minimum Qualifications


  • Bachelor’s degree in a technical field (STEM), with 10+ years in systems programming, including 5+ years in GPU runtime or driver development.
  • Experience developing kernel-space modules or runtime libraries (CUDA, ROCm, OpenCL).
  • Deep familiarity with NVIDIA GPUs, CUDA toolchains, and profiling tools (Nsight, CUPTI, etc.).
  • Proven ability to optimize workloads across NVLink, PCIe, Unified Memory, and NUMA systems.
  • Hands-on background in RDMA, InfiniBand, GPUDirect, and related communication frameworks (UCX).
  • Strong C/C++ programming skills with systems-level expertise (memory management, synchronization, cache coherency).


Preferred Qualifications


  • Expertise in HPC workload optimization and GPU compute/memory tradeoffs.
  • Knowledge of pinned memory, peer-to-peer transfers, zero-copy, and GPU memory lifetimes.
  • Strong grasp of multithreaded and asynchronous programming patterns.
  • Familiarity with AI frameworks (PyTorch, TensorFlow) and Python scripting.
  • Understanding of low-level CUDA/PTX assembly for debugging or performance tuning.
  • Experience with storage offloads (NVMe, IOAT, DPDK) or DMA-based acceleration.
  • Proficiency with system profiling/debugging tools (Valgrind, cuda-memcheck, gdb, Nsight Compute/Systems, perf, eBPF).
  • An advanced degree (PhD) with research in GPU systems, compilers, or HPC is a plus.


Apply for this job
Powered by