PLEASE CLICK HERE TO SEE *ALL* OF OUR JOB OPENINGS!
Principal Machine Learning Engineer
This role reports to the Director of MLE and works closely with Engineering, Data Science, Product, and the Principal SRE. You will influence cross team platform standards and help elevate engineering rigor across ML and infrastructure. In addition to system design, you will mentor engineers on ML reliability, architecture decision making, and operational excellence.
What “Principal” Means
You will:
- Own ML systems architecture
- Define ML lifecycle standards
- Push event driven ML integration
- Design model packaging and deployment strategy
- Introduce systemic improvements
- Reduce architectural and data debt
- Establish testing and QA standards across ML workflows
You think in systems, not experiments.
Mission
Build a resilient, scalable ML platform that:
- Trains distributed models at scale
- Supports event driven feature computation
- Enables portable model deployment (internal + external)
- Standardizes ML lifecycle across products
- Aligns infrastructure to product usage patterns
What You’ll Own
1. ML Platform Architecture
Define and evolve:
- Training orchestration standards
- Batch vs. streaming inference strategy
- Feature store direction
- State store patterns and tooling
- CPU/GPU scaling strategy
- When to extend current tooling, and when to replace it
You will establish architectural principles that outlive individual projects.
2. Event Driven ML Integration
- Design feature pipelines as first-class ML system components
- Integrate queuing and event systems with ML workflows
- Build reactive retraining triggers
- Define model drift detection and automated response systems
- Ensure retraining pipelines are reproducible and fault tolerant
3. Model Packaging & Distribution
They are enabling customer deployable ML systems.
You will define:
- Model artifact standardization
- Deterministic builds
- Dependency isolation
- Runtime configuration injection
- Security constraints
- Version compatibility contracts
This is ML productization, not experimentation.
4. ML Observability, Testing & Reliability Standards
Define:
- Model performance SLIs
- Drift detection frameworks
- Data freshness guarantees
- Latency SLOs
- Model failure modes
Establish standards for:
- Automated testing of feature pipelines
- Training pipeline validation
- Model artifact verification
- CI/CD workflows for ML systems
- Safe promotion from experiment to production
Work closely with the Principal SRE to integrate telemetry and operational standards across the full stack.
5. Operational Excellence & On Call
You will help define and operate a sustainable ML on call model in partnership with Engineering and SRE.
This includes:
- Clear ownership boundaries between ML systems and infrastructure
- Incident classification and severity alignment
- Runbooks for model failures and data drift
- Postmortem processes focused on systemic improvement
- Reducing operational toil through automation
You are comfortable being accountable for production ML systems, as well as designing systems that make firefighting rare.
6. Reduce Data Architecture Debt
- Evaluate service landscape alignment to product usage
- Improve or redefine streaming feature architecture
- Reduce batch rigidity
- Recommend infrastructure simplifications
Ideal Background
- 10–15+ years of experience building and operating production systems
- Bachelor’s degree in computer science, Engineering, Mathematics, or related field — or equivalent practical experience. Advanced degrees are welcome but not required.
- Deep production experience with distributed ML systems
- Strong PyTorch and large-scale data engineering expertise
- Experience with Ray or comparable distributed frameworks
- Experience operating ML systems in production at scale
- Exposure to event driven architectures
- Experience improving testing and CI/CD practices for ML workflows
- Adtech experience preferred but not required
- Strong architectural opinions backed by real production experience
Current Technology Environment
- ML Frameworks: PyTorch, Ray (Train, Tune, Datasets), PySpark ML
- Data Platform: Databricks (Delta Lake, Unity Catalog), Snowflake, AWS (S3, EC2)
- MLOps: MLflow (experiment tracking, model registry), GitHub Actions
- Observability: Prometheus, Grafana, Datadog
- Languages: Python, SQL, JavaScript/TypeScript
- External LLM integrations (AWS Bedrock and OpenAI)
They are actively evolving toward event driven integration and portable ML deployment models.
What They’re Looking For
They want someone who:
- Has designed ML systems from zero
- Has migrated or rebuilt broken ML infrastructure
- Has owned production model failures
- Understands cost implications of ML design
- Challenges architectural assumptions constructively
Anticipated Interview Process
1. Conversational + Architecture Discussion: A live discussion focused on past systems, tradeoffs, and a collaborative diagramming / trouble shooting exercise.
2. Take Home GitHub Exercise: A practical ML systems exercise evaluating structure, testing, reproducibility, and clarity.
3. DS/MLE Deep Dive: Technical and strategic discussion around platform evolution and leadership approach.
4. CEO Conversation: Focused on long term platform direction and company alignment.
Job-3547897
*LI-MG1
#LI-Remote