Principal Machine Learning Engineer - Profitable AI Programmatic Advertising Platform

Share this job

USA

PLEASE CLICK HERE TO SEE *ALL* OF OUR JOB OPENINGS!

Principal Machine Learning Engineer

This role reports to the Director of MLE and works closely with Engineering, Data Science, Product, and the Principal SRE. You will influence cross team platform standards and help elevate engineering rigor across ML and infrastructure. In addition to system design, you will mentor engineers on ML reliability, architecture decision making, and operational excellence.

What “Principal” Means

You will:

Own ML systems architecture
Define ML lifecycle standards
Push event driven ML integration
Design model packaging and deployment strategy
Introduce systemic improvements
Reduce architectural and data debt
Establish testing and QA standards across ML workflows

You think in systems, not experiments.

Mission

Build a resilient, scalable ML platform that:

Trains distributed models at scale
Supports event driven feature computation
Enables portable model deployment (internal + external)
Standardizes ML lifecycle across products
Aligns infrastructure to product usage patterns

What You’ll Own

1. ML Platform Architecture

Define and evolve:

Training orchestration standards
Batch vs. streaming inference strategy
Feature store direction
State store patterns and tooling
CPU/GPU scaling strategy
When to extend current tooling, and when to replace it

You will establish architectural principles that outlive individual projects.

2. Event Driven ML Integration

Design feature pipelines as first-class ML system components
Integrate queuing and event systems with ML workflows
Build reactive retraining triggers
Define model drift detection and automated response systems
Ensure retraining pipelines are reproducible and fault tolerant

3. Model Packaging & Distribution

They are enabling customer deployable ML systems.

You will define:

Model artifact standardization
Deterministic builds
Dependency isolation
Runtime configuration injection
Security constraints
Version compatibility contracts

This is ML productization, not experimentation.

4. ML Observability, Testing & Reliability Standards

Define:

Model performance SLIs
Drift detection frameworks
Data freshness guarantees
Latency SLOs
Model failure modes

Establish standards for:

Automated testing of feature pipelines
Training pipeline validation
Model artifact verification
CI/CD workflows for ML systems
Safe promotion from experiment to production

Work closely with the Principal SRE to integrate telemetry and operational standards across the full stack.

5. Operational Excellence & On Call

You will help define and operate a sustainable ML on call model in partnership with Engineering and SRE.

This includes:

Clear ownership boundaries between ML systems and infrastructure
Incident classification and severity alignment
Runbooks for model failures and data drift
Postmortem processes focused on systemic improvement
Reducing operational toil through automation

You are comfortable being accountable for production ML systems, as well as designing systems that make firefighting rare.

6. Reduce Data Architecture Debt

Evaluate service landscape alignment to product usage
Improve or redefine streaming feature architecture
Reduce batch rigidity
Recommend infrastructure simplifications

Ideal Background

10–15+ years of experience building and operating production systems
Bachelor’s degree in computer science, Engineering, Mathematics, or related field — or equivalent practical experience. Advanced degrees are welcome but not required.
Deep production experience with distributed ML systems
Strong PyTorch and large-scale data engineering expertise
Experience with Ray or comparable distributed frameworks
Experience operating ML systems in production at scale
Exposure to event driven architectures
Experience improving testing and CI/CD practices for ML workflows
Adtech experience preferred but not required
Strong architectural opinions backed by real production experience

Current Technology Environment

ML Frameworks: PyTorch, Ray (Train, Tune, Datasets), PySpark ML
Data Platform: Databricks (Delta Lake, Unity Catalog), Snowflake, AWS (S3, EC2)
MLOps: MLflow (experiment tracking, model registry), GitHub Actions
Observability: Prometheus, Grafana, Datadog
Languages: Python, SQL, JavaScript/TypeScript
External LLM integrations (AWS Bedrock and OpenAI)

They are actively evolving toward event driven integration and portable ML deployment models.

What They’re Looking For

They want someone who:

Has designed ML systems from zero
Has migrated or rebuilt broken ML infrastructure
Has owned production model failures
Understands cost implications of ML design
Challenges architectural assumptions constructively

Anticipated Interview Process

1. Conversational + Architecture Discussion: A live discussion focused on past systems, tradeoffs, and a collaborative diagramming / trouble shooting exercise.

2. Take Home GitHub Exercise: A practical ML systems exercise evaluating structure, testing, reproducibility, and clarity.

3. DS/MLE Deep Dive: Technical and strategic discussion around platform evolution and leadership approach.

4. CEO Conversation: Focused on long term platform direction and company alignment.

Job-3547897

*LI-MG1

#LI-Remote

Apply for this job