Share this job
Senior Platform Engineer
Cambridge, MA
Apply for this job

Senior Platform/Infrastructure Engineer

Location: Fully remote (HQ Cambridge, MA)

Hours: 9–5 EST, with 2-day on-site visits every 6 weeks


You’ll be responsible for designing, scaling, and maintaining the infrastructure and internal developer platforms that power a real-time learning AI at a seed-stage startup. The role blends infrastructure ownership with platform engineering to enable AI/product teams to ship quickly and reliably.


Key Responsibilities - Infrastructure

  • Maintain production health: performance, reliability, cost efficiency, and security.
  • Manage GCP Kubernetes clusters (GKE), networking, storage, and compute resources.
  • Handle scaling, resource allocation, and high availability for growing customer demand.
  • Refine observability: logs, traces, metrics, dashboards, and alerts.
  • Perform security hardening and cost optimization.


Kep Responsibilities - Platform Engineering

  • Build internal tooling and abstractions for developer productivity.
  • Design CI/CD pipelines using GitHub Workflows and ArgoCD.
  • Provide self-service environments, internal portals, and deployment systems.


Collaboration & Communication

  • Work closely with AI and full-stack teams to optimize system architecture.
  • Explain technical concepts and trade-offs clearly to engineers and non-engineers.
  • Troubleshoot issues across multiple systems (Python, JavaScript, SQL).


Requirements

  • 5+ years in production cloud environments at scale.
  • Strong familiarity with GCP (primary) and some AWS experience.
  • Experience with Kubernetes (GKE), node pools, and memory-intensive jobs.
  • Working knowledge of CI/CD systems (GitHub Workflows + ArgoCD).
  • Exposure to observability tools (Datadog), databases (Cloud SQL, ClickHouse, Bigtable), and cloud services.


Skills & Qualities

  • Strong analytical and problem-solving ability.
  • Clear, collaborative communication.
  • Curiosity and ownership mentality.
  • Fluent in reading/debugging code across Python, JavaScript, SQL.


Technical Stack

  • Cloud: GCP (primary), AWS (secondary)
  • Kubernetes: GKE, multiple node pools
  • CI/CD: GitHub Workflows + ArgoCD
  • Data: Cloud SQL, ClickHouse, Bigtable, GCS, Dataflow
  • Networking: Cloudflare Workers, Durable Objects, WebSocket communication
  • Monitoring: Datadog
  • Environments: Production, Staging, Integration, Development 


Apply for this job
Powered by