Staff Site Reliability Engineer

Share this job

New York City, NY

We are looking for a Staff Site Reliability Engineer to help shape and scale a rapidly growing platform. In this role you will act as a senior individual contributor and work closely with engineering and product teams to design, build and operate systems that are reliable, observable and developer friendly.

You will play a key role in defining infrastructure strategy, improving how software is delivered and elevating operational excellence across the organization. This is a high impact opportunity for someone who enjoys tackling complex systems challenges, influencing architecture, and raising the reliability bar.

What you will own:

You will lead the direction and evolution of cloud infrastructure while modernizing compute and runtime environments. You will own CI/CD systems with a focus on speed, safety, automation and strong developer experience. You will establish and improve observability across metrics, logs and tracing, including alerting strategy, dashboards and service level objectives.
You will define and evolve reliability standards such as SLIs, SLOs and error budgets, while strengthening incident response processes and postmortems to build a strong reliability culture. You will lead high-severity incidents and ensure clear, actionable follow-ups.
In close partnership with engineering teams, you will design resilient and scalable systems and build automation that reduces operational overhead and risk. You will also mentor engineers and help drive best practices across teams.

Qualified candidates will have:

10+ years of experience in site reliability, infrastructure, or backend engineering.
Proven experience operating production systems in cloud environments and driving platform-level improvements.
Strong software engineering skills in one or more modern programming languages.
Demonstrated expertise running distributed systems at scale.
Deep familiarity with cloud platforms, observability tooling, and CI/CD systems.
A systems mindset with the ability to assess risk, design rollback strategies, and manage blast radius and feedback loops.
Experience defining and improving reliability practices such as SLIs, SLOs, and error budgets.
Passion for building CI/CD pipelines and environments that are fast, reliable, and self-service.
Ability to influence through clarity and collaboration rather than control.
Pragmatic approach with a focus on long-term system health and scalability.
Strong focus on learning from failure and continuously improving processes
Clear communicator who works effectively across teams.
Comfortable navigating ambiguity and setting direction in fast-paced environments

This is an onsite position in NYC offering competitive base, equity and benefits.

Apply for this job