As a Principal AI Infrastructure Abstraction Engineer, you will design and implement the foundational systems that make shared AI compute environments scalable, secure, and developer-friendly. Your work will focus on creating abstractions that hide hardware complexity while providing predictable, cloud-native interfaces for AI workloads.
This position bridges infrastructure and applied AI—turning raw GPUs and accelerators into programmable, elastic, and multi-tenant resources for both internal developers and enterprise clients.
Key Responsibilities
- Architect abstractions that map logical compute constructs (vGPUs, GPU pools, workload queues) to physical devices.
- Build APIs, services, and control planes that expose GPU and accelerator resources with strong isolation and quality-of-service guarantees.
- Develop mechanisms for secure GPU sharing, including time-slicing, partitioning, and namespace isolation.
- Work with orchestration and scheduling systems to ensure intelligent mapping of resources based on utilization, priority, and network topology.
- Define policies for quotas, fair allocation, and resource elasticity in shared environments.
- Integrate with AI/ML frameworks (PyTorch, TensorFlow, Triton, etc.) to optimize model training and inference workflows.
- Deliver observability and monitoring capabilities that trace resource usage from logical abstractions to hardware.
- Partner with platform security teams to strengthen access controls, onboarding processes, and tenant isolation.
- Support internal developer adoption of abstraction APIs while maintaining high performance and low overhead.
- Contribute to long-term compute platform strategy with a focus on modularity, abstraction, and scale.
Minimum Qualifications
- Bachelor’s degree with 15+ years of experience, Master’s with 12+ years, or PhD with 8+ years.
- Proven track record building production-grade infrastructure systems, preferably in Go, Python, or C++.
- Strong experience with containerization and orchestration platforms (Kubernetes, Docker, KubeVirt).
- Background in designing logical abstractions for compute, storage, or networking in multi-tenant systems.
- Familiarity with integrating with machine learning platforms (e.g., PyTorch, TensorFlow, Triton, MLFlow).
Preferred Qualifications
- Hands-on experience with GPU sharing, scheduling, or isolation (MIG, MPS, vGPUs, time-slicing, or device plugin models).
- Deep knowledge of resource management: quotas, prioritization, fairness, elasticity.
- Strong ability to think across hardware/software boundaries and design abstractions that scale.