Share this job
Principal Machine Learning Engineer - Profitable AI Programmatic Advertising Platform
USA
Apply for this job

PLEASE CLICK HERE TO SEE *ALL* OF OUR JOB OPENINGS!


Principal Machine Learning Engineer

This role reports to the Director of MLE and works closely with Engineering, Data Science, Product, and the Principal SRE. You will influence cross team platform standards and help elevate engineering rigor across ML and infrastructure. In addition to system design, you will mentor engineers on ML reliability, architecture decision making, and operational excellence.


What “Principal” Means

You will:

  • Own ML systems architecture
  • Define ML lifecycle standards
  • Push event driven ML integration
  • Design model packaging and deployment strategy
  • Introduce systemic improvements
  • Reduce architectural and data debt
  • Establish testing and QA standards across ML workflows


You think in systems, not experiments.


Mission

Build a resilient, scalable ML platform that:

  • Trains distributed models at scale
  • Supports event driven feature computation
  • Enables portable model deployment (internal + external)
  • Standardizes ML lifecycle across products
  • Aligns infrastructure to product usage patterns


What You’ll Own

1. ML Platform Architecture


Define and evolve:

  • Training orchestration standards
  • Batch vs. streaming inference strategy
  • Feature store direction
  • State store patterns and tooling
  • CPU/GPU scaling strategy
  • When to extend current tooling, and when to replace it


You will establish architectural principles that outlive individual projects.


2. Event Driven ML Integration

  • Design feature pipelines as first-class ML system components
  • Integrate queuing and event systems with ML workflows
  • Build reactive retraining triggers
  • Define model drift detection and automated response systems
  • Ensure retraining pipelines are reproducible and fault tolerant


3. Model Packaging & Distribution

They are enabling customer deployable ML systems.


You will define:

  • Model artifact standardization
  • Deterministic builds
  • Dependency isolation
  • Runtime configuration injection
  • Security constraints
  • Version compatibility contracts


This is ML productization, not experimentation.


4. ML Observability, Testing & Reliability Standards

Define:

  • Model performance SLIs
  • Drift detection frameworks
  • Data freshness guarantees
  • Latency SLOs
  • Model failure modes


Establish standards for:

  • Automated testing of feature pipelines
  • Training pipeline validation
  • Model artifact verification
  • CI/CD workflows for ML systems
  • Safe promotion from experiment to production


Work closely with the Principal SRE to integrate telemetry and operational standards across the full stack.


5. Operational Excellence & On Call

You will help define and operate a sustainable ML on call model in partnership with Engineering and SRE.


This includes:

  • Clear ownership boundaries between ML systems and infrastructure
  • Incident classification and severity alignment
  • Runbooks for model failures and data drift
  • Postmortem processes focused on systemic improvement
  • Reducing operational toil through automation


You are comfortable being accountable for production ML systems, as well as designing systems that make firefighting rare.


6. Reduce Data Architecture Debt

  • Evaluate service landscape alignment to product usage
  • Improve or redefine streaming feature architecture
  • Reduce batch rigidity
  • Recommend infrastructure simplifications


Ideal Background

  • 10–15+ years of experience building and operating production systems
  • Bachelor’s degree in computer science, Engineering, Mathematics, or related field — or equivalent practical experience. Advanced degrees are welcome but not required.
  • Deep production experience with distributed ML systems
  • Strong PyTorch and large-scale data engineering expertise
  • Experience with Ray or comparable distributed frameworks
  • Experience operating ML systems in production at scale
  • Exposure to event driven architectures
  • Experience improving testing and CI/CD practices for ML workflows
  • Adtech experience preferred but not required
  • Strong architectural opinions backed by real production experience


Current Technology Environment

  • ML Frameworks: PyTorch, Ray (Train, Tune, Datasets), PySpark ML
  • Data Platform: Databricks (Delta Lake, Unity Catalog), Snowflake, AWS (S3, EC2)
  • MLOps: MLflow (experiment tracking, model registry), GitHub Actions
  • Observability: Prometheus, Grafana, Datadog
  • Languages: Python, SQL, JavaScript/TypeScript
  • External LLM integrations (AWS Bedrock and OpenAI)


They are actively evolving toward event driven integration and portable ML deployment models.


What They’re Looking For

They want someone who:

  • Has designed ML systems from zero
  • Has migrated or rebuilt broken ML infrastructure
  • Has owned production model failures
  • Understands cost implications of ML design
  • Challenges architectural assumptions constructively


Anticipated Interview Process

1. Conversational + Architecture Discussion: A live discussion focused on past systems, tradeoffs, and a collaborative diagramming / trouble shooting exercise.

2. Take Home GitHub Exercise: A practical ML systems exercise evaluating structure, testing, reproducibility, and clarity.

3. DS/MLE Deep Dive: Technical and strategic discussion around platform evolution and leadership approach.

4. CEO Conversation: Focused on long term platform direction and company alignment.



Job-3547897

*LI-MG1

#LI-Remote

Apply for this job
Powered by