Share this job
LLM Inference Kernel Engineer MLA
Apply for this job

LLM Inference Kernel Engineer MLA

Location: Remote, United States


A high-growth, venture-backed AI innovator is pushing the boundaries of large-scale model performance, focusing on next-generation inference systems that operate at the intersection of model architecture and GPU execution. This organization is tackling some of the most complex challenges in modern AI, including optimizing trillion-parameter scale systems and redefining how attention mechanisms perform in real-world environments.


They are seeking a deeply technical LLM Inference Kernel Engineer to play a pivotal role in advancing cutting-edge attention architectures, specifically within Multi-Head Latent Attention MLA frameworks. This is a high-impact opportunity to directly influence performance breakthroughs that will shape product delivery timelines, investor milestones, and the future of scalable AI systems.


What You Will Do

  • Design and implement high-performance GPU kernels tailored for large language model inference workloads
  • Optimize CUDA kernels with a focus on memory efficiency, execution speed, and latency reduction
  • Enhance token generation performance, KV cache utilization, and decoding efficiency in large-scale models
  • Collaborate on integrating optimized kernels into modern inference serving frameworks such as vLLM or similar systems
  • Work closely with a small, highly technical team to rapidly prototype, test, and deploy performance improvements
  • Apply advanced techniques such as kernel fusion, tiling strategies, and warp-level optimization to improve throughput
  • Translate complex attention mechanisms into production-ready, scalable GPU implementations


What You Bring

  • Strong experience developing GPU kernels using CUDA C or C++ in performance-critical environments
  • Hands-on experience optimizing inference workloads for large language models rather than purely research-based modeling
  • Solid understanding of attention mechanisms, with exposure to advanced implementations such as fused attention or similar approaches
  • Familiarity with modern inference stacks and serving frameworks
  • Deep knowledge of GPU architecture, including memory hierarchy, bandwidth constraints, and latency tradeoffs
  • Ability to operate in a fast-paced, highly iterative environment with minimal oversight

Preferred Qualifications

  • Experience working with advanced attention techniques such as latent attention or similar architectures
  • Exposure to large-scale or distributed model inference environments, including mixture-of-experts systems
  • Contributions to performance optimization projects, open-source kernels, or inference tooling Familiarity with GPU profiling and performance analysis tools
  • Background that bridges model architecture, systems engineering, and deployment layers


Why This Role Stands Out

This is not a traditional machine learning engineering position. The work sits at one of the most performance-critical layers in the AI stack, where low-level optimization directly impacts real-world model capability. You will have the opportunity to shape how advanced models operate at scale, contributing to meaningful innovations in inference performance and system efficiency.


About Blue Signal:  

Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS 



Apply for this job