As a Principal Engineer for HPC and AI Infrastructure, you’ll take a lead role in designing the low-level systems that maximize GPU utilization across large, mission-critical workloads.
Working within our GPU Runtime & Systems team, you’ll focus on device drivers, kernel-level optimizations, and runtime performance to ensure GPU clusters deliver the highest throughput, lowest latency, and greatest reliability possible. Your work will directly accelerate workloads across deep learning, high-performance computing, and real-time simulation.
This position sits at the intersection of systems programming, GPU architecture, and HPC-scale computing—a unique opportunity to shape infrastructure used by developers and enterprises worldwide.
Key Responsibilities
- Build and optimize device drivers and runtime components for GPUs and high-speed interconnects.
- Collaborate with kernel and platform teams to design efficient memory pathways (pinned memory, peer-to-peer, unified memory).
- Improve data transfers across NVLink, InfiniBand, PCIe, and RDMA to reduce latency and boost throughput.
- Enhance GPU memory operations with NUMA-aware strategies and hardware-coherent optimizations.
- Implement telemetry and observability tools to monitor GPU performance with minimal runtime overhead.
- Contribute to internal debugging/profiling tools for GPU workloads.
- Mentor engineers on best practices for GPU systems development and participate in peer design/code reviews.
- Stay ahead of evolving GPU and interconnect architectures to influence future infrastructure design.
Minimum Qualifications
- Bachelor’s degree in a technical field (STEM), with 10+ years in systems programming, including 5+ years in GPU runtime or driver development.
- Experience developing kernel-space modules or runtime libraries (CUDA, ROCm, OpenCL).
- Deep familiarity with NVIDIA GPUs, CUDA toolchains, and profiling tools (Nsight, CUPTI, etc.).
- Proven ability to optimize workloads across NVLink, PCIe, Unified Memory, and NUMA systems.
- Hands-on background in RDMA, InfiniBand, GPUDirect, and related communication frameworks (UCX).
- Strong C/C++ programming skills with systems-level expertise (memory management, synchronization, cache coherency).
Preferred Qualifications
- Expertise in HPC workload optimization and GPU compute/memory tradeoffs.
- Knowledge of pinned memory, peer-to-peer transfers, zero-copy, and GPU memory lifetimes.
- Strong grasp of multithreaded and asynchronous programming patterns.
- Familiarity with AI frameworks (PyTorch, TensorFlow) and Python scripting.
- Understanding of low-level CUDA/PTX assembly for debugging or performance tuning.
- Experience with storage offloads (NVMe, IOAT, DPDK) or DMA-based acceleration.
- Proficiency with system profiling/debugging tools (Valgrind, cuda-memcheck, gdb, Nsight Compute/Systems, perf, eBPF).
- An advanced degree (PhD) with research in GPU systems, compilers, or HPC is a plus.