LLM Inference Kernel Engineer MLA
Location: Remote, United States
A high-growth, venture-backed AI innovator is pushing the boundaries of large-scale model performance, focusing on next-generation inference systems that operate at the intersection of model architecture and GPU execution. This organization is tackling some of the most complex challenges in modern AI, including optimizing trillion-parameter scale systems and redefining how attention mechanisms perform in real-world environments.
They are seeking a deeply technical LLM Inference Kernel Engineer to play a pivotal role in advancing cutting-edge attention architectures, specifically within Multi-Head Latent Attention MLA frameworks. This is a high-impact opportunity to directly influence performance breakthroughs that will shape product delivery timelines, investor milestones, and the future of scalable AI systems.
What You Will Do
- Design and implement high-performance GPU kernels tailored for large language model inference workloads
- Optimize CUDA kernels with a focus on memory efficiency, execution speed, and latency reduction
- Enhance token generation performance, KV cache utilization, and decoding efficiency in large-scale models
- Collaborate on integrating optimized kernels into modern inference serving frameworks such as vLLM or similar systems
- Work closely with a small, highly technical team to rapidly prototype, test, and deploy performance improvements
- Apply advanced techniques such as kernel fusion, tiling strategies, and warp-level optimization to improve throughput
- Translate complex attention mechanisms into production-ready, scalable GPU implementations
What You Bring
- Strong experience developing GPU kernels using CUDA C or C++ in performance-critical environments
- Hands-on experience optimizing inference workloads for large language models rather than purely research-based modeling
- Solid understanding of attention mechanisms, with exposure to advanced implementations such as fused attention or similar approaches
- Familiarity with modern inference stacks and serving frameworks
- Deep knowledge of GPU architecture, including memory hierarchy, bandwidth constraints, and latency tradeoffs
- Ability to operate in a fast-paced, highly iterative environment with minimal oversight
Preferred Qualifications
- Experience working with advanced attention techniques such as latent attention or similar architectures
- Exposure to large-scale or distributed model inference environments, including mixture-of-experts systems
- Contributions to performance optimization projects, open-source kernels, or inference tooling Familiarity with GPU profiling and performance analysis tools
- Background that bridges model architecture, systems engineering, and deployment layers
Why This Role Stands Out
This is not a traditional machine learning engineering position. The work sits at one of the most performance-critical layers in the AI stack, where low-level optimization directly impacts real-world model capability. You will have the opportunity to shape how advanced models operate at scale, contributing to meaningful innovations in inference performance and system efficiency.
About Blue Signal:
Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS