PLEASE CLICK HERE TO SEE *ALL* OF OUR JOB OPENINGS!
Principal Site Reliability Engineer
This role reports directly to the CTO and works cross functionally with Engineering, Data Science, Machine Learning, and Product. The Principal SRE sits at the intersection of infrastructure, ML systems, and platform governance, with broad influence across teams.
They are evolving toward:
- Event driven system design
- Container deployments to customer and partner infrastructure
- Reduced architectural rigidity
- Strong internal platform standards
What “Principal” Means
The Principal Site Reliability Engineer role focuses more on architectural leverage than operational work. This role will influence infrastructure, ML operations, governance, and platform strategy.
You will:
- Recommend architectural direction
- Reduce systemic complexity
- Introduce durable patterns
- Identify architectural risk early
- Retire services when necessary
- Define reliability standards across teams
- Shape how the AI platform is delivered to customers
Mission
As Principal Site Reliability Engineer, you will define and operate the architectural backbone of the AI platform. You will report directly to the CTO and work cross functionally with Engineering, Data Science, Machine Learning, and Product. This role sits at the intersection of infrastructure, ML systems, and platform governance, with broad influence across teams.
In addition to building systems, you will mentor and elevate other engineers in infrastructure best practices, operational rigor, and architectural thinking. You will help establish a culture of reliability, ownership, and continuous improvement across the organization.
You will design and operate a scalable, event-driven, multi-tenant ML infrastructure platform that supports:
- Distributed ML training (Databricks + Ray)
- Containerized product delivery to external customers
- Internal event driven services across AWS
- Centralized state store driven orchestration
- Governance across adtech integrations and third-party APIs
This role is responsible not only for technical execution, but for shaping how the company thinks about reliability, infrastructure standards, and long-term platform evolution.
What You’ll Own
1. Event Driven Platform Architecture
You will define the architectural direction for the company’s event driven platform and lead the build-out of a scalable SRE function to support it. While you will be hands-on in early design and critical implementations, you will not be operating alone. This role is expected to shape and grow the SRE capability over time.
You will:
- Design and implement AWS event driven systems using:
- EventBridge / RabbitMQ (where appropriate)
- MSK / Kafka
- Kinesis
- Lambda / Fargate
- SQS / SNS / Step Functions
- Architect centralized state stores (DynamoDB, Redis, Postgres) that:
- React to signals
- Trigger downstream services
- Maintain system integrity
- Establish architectural standards for:
- Idempotency
- Replay safety
- Event schema governance
- Operational clarity and traceability
As the platform evolves, you will help build and mentor an SRE team capable of operating and extending these systems. You will define what good looks like in architecture, reliability, and operational excellence and ensure that standards scale beyond any single individual.
Ideally you have participated in at least one event driven system build in production and have experience evolving infrastructure from early-stage patterns to durable, team supported systems.
2. Kubernetes & Control Plane Ownership
- Operate multi cluster Kubernetes environments in production
- Understand and tune:
- API server scaling
- etcd performance
- RBAC architecture
- Admission controllers
- Implement:
- GitOps patterns
- Progressive delivery
- Cluster level security policies
- Mult- tenant isolation
Bonus if you’ve:
- Built internal developer platforms
- Managed customer facing container workloads
- Operated ML workloads in Kubernetes
3. Infrastructure as Code, Governance & CI/CD Evolution
- Define Terraform module standards
- Create reusable infrastructure primitives
- Enforce GitHub guardrails (branch protections, CI gates)
- Evolve and standardize CI/CD pipelines to support:
- Automated infrastructure testing
- Policy validation (SSO, networking, security posture)
- Progressive deployment patterns
- Rollback and release safety mechanisms
- Introduce automated testing standards across:
- Infrastructure code
- Deployment pipelines
- Event schema changes
- Platform level integrations
- Standardize:
- Federated identity management
- Azure SSO
- API authentication patterns
- Key management
- Resource isolation
You will define how changes move safely from development to production, reducing human error and increasing deployment confidence.
4. External Model Serving Architecture
They are enabling customers to run the models without sending us their data.
You will define:
- Model packaging standards (OCI images)
- Secret injection patterns
- Network isolation models
- Telemetry back to the company
- Upgrade and compatibility strategy
- Runtime configuration contracts
This is effectively building a deployable AI product platform.
5. Observability & Reliability Standards
- Define SLIs, SLOs, and error budgets
- Separate ML reliability from infrastructure reliability
- Implement distributed tracing
- Design golden signals for:
- Event pipeline health
- Data freshness
- Model serving reliability
- Control plane stability
- Define and implement a sustainable operational model, including:
- On-call structure and rotation design
- Escalation paths and severity definitions
- Incident response standards
- Postmortem processes focused on systemic improvement
- Clear ownership boundaries between infrastructure, ML systems, and application teams
They use Datadog for observability, but you will define what “good” looks like.
6. Databricks as a Platform
- SSO implementation and maintenance
- Authentication and provisioning
- Terraform based deployments
- Cluster policies
- Unity Catalog governance
You own Databricks as infrastructure, not as a notebook environment.
Ideal Background
- 10–15+ years of experience designing and operating production infrastructure at scale.
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent practical experience. Advanced degrees are welcome but not required.
- Deep AWS architecture expertise across networking, SSO, compute, storage, and event driven services.
- Experience managing Kubernetes control planes in production environments.
- Experience building or migrating event driven systems at scale.
- Experience in ML heavy or data heavy environments.
- Has replaced or eliminated legacy infrastructure components and simplified system design.
- Has owned real production failures and led postmortems that resulted in systemic improvements.
- Strong architectural judgment and the ability to challenge assumptions constructively
- Certifications (AWS, Databricks, Kubernetes, etc.) are considered a plus but are not required.
What They’re Looking For
The ideal candidate:
- Has refactored/replaced legacy architectures in the past
- Has migrated systems safely
- Has designed something from zero
- Challenges leadership constructively
Anticipated Interview Process
They highly value architectural roles and aim to evaluate real world systems thinking rather than trivia or syntax knowledge. The process is designed to be practical and respectful of your time.
-
Conversational + Architecture Discussion: A live discussion focused on your past systems, decision making frameworks, and architectural tradeoffs. This includes a collaborative diagramming exercise centered on distributed systems and platform design.
-
Take Home Systems Exercise (GitHub Based): You’ll complete a practical take home exercise delivered via a private GitHub repository. This is designed to evaluate how you think about structure, reliability, testing, and operational clarity.
-
CTO Deep Dive: A technical and strategic discussion focused on platform evolution, organizational design, and how you approach infrastructure leadership at scale.
-
CEO Conversation: A final conversation centered on company vision, long term platform direction, and cultural alignment.
Job-3547897
*LI-MG1
#LI-Remote