LLM Inference Kernel Engineer MLA

Share this job

LLM Inference Kernel Engineer MLA

Location: Remote, United States

A high-growth, venture-backed AI innovator is pushing the boundaries of large-scale model performance, focusing on next-generation inference systems that operate at the intersection of model architecture and GPU execution. This organization is tackling some of the most complex challenges in modern AI, including optimizing trillion-parameter scale systems and redefining how attention mechanisms perform in real-world environments.

They are seeking a deeply technical LLM Inference Kernel Engineer to play a pivotal role in advancing cutting-edge attention architectures, specifically within Multi-Head Latent Attention MLA frameworks. This is a high-impact opportunity to directly influence performance breakthroughs that will shape product delivery timelines, investor milestones, and the future of scalable AI systems.

What You Will Do

Design and implement high-performance GPU kernels tailored for large language model inference workloads
Optimize CUDA kernels with a focus on memory efficiency, execution speed, and latency reduction
Enhance token generation performance, KV cache utilization, and decoding efficiency in large-scale models
Collaborate on integrating optimized kernels into modern inference serving frameworks such as vLLM or similar systems
Work closely with a small, highly technical team to rapidly prototype, test, and deploy performance improvements
Apply advanced techniques such as kernel fusion, tiling strategies, and warp-level optimization to improve throughput
Translate complex attention mechanisms into production-ready, scalable GPU implementations

What You Bring

Strong experience developing GPU kernels using CUDA C or C++ in performance-critical environments
Hands-on experience optimizing inference workloads for large language models rather than purely research-based modeling
Solid understanding of attention mechanisms, with exposure to advanced implementations such as fused attention or similar approaches
Familiarity with modern inference stacks and serving frameworks
Deep knowledge of GPU architecture, including memory hierarchy, bandwidth constraints, and latency tradeoffs
Ability to operate in a fast-paced, highly iterative environment with minimal oversight

Preferred Qualifications

Experience working with advanced attention techniques such as latent attention or similar architectures
Exposure to large-scale or distributed model inference environments, including mixture-of-experts systems
Contributions to performance optimization projects, open-source kernels, or inference tooling Familiarity with GPU profiling and performance analysis tools
Background that bridges model architecture, systems engineering, and deployment layers

Why This Role Stands Out

This is not a traditional machine learning engineering position. The work sits at one of the most performance-critical layers in the AI stack, where low-level optimization directly impacts real-world model capability. You will have the opportunity to shape how advanced models operate at scale, contributing to meaningful innovations in inference performance and system efficiency.

About Blue Signal:

Blue Signal is an award-winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top-tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46Gs4yS

Apply for this job