Senior Platform/Infrastructure Engineer
Location: Fully remote (HQ Cambridge, MA)
Hours: 9–5 EST, with 2-day on-site visits every 6 weeks
You’ll be responsible for designing, scaling, and maintaining the infrastructure and internal developer platforms that power a real-time learning AI at a seed-stage startup. The role blends infrastructure ownership with platform engineering to enable AI/product teams to ship quickly and reliably.
Key Responsibilities - Infrastructure
- Maintain production health: performance, reliability, cost efficiency, and security.
- Manage GCP Kubernetes clusters (GKE), networking, storage, and compute resources.
- Handle scaling, resource allocation, and high availability for growing customer demand.
- Refine observability: logs, traces, metrics, dashboards, and alerts.
- Perform security hardening and cost optimization.
Kep Responsibilities - Platform Engineering
- Build internal tooling and abstractions for developer productivity.
- Design CI/CD pipelines using GitHub Workflows and ArgoCD.
- Provide self-service environments, internal portals, and deployment systems.
Collaboration & Communication
- Work closely with AI and full-stack teams to optimize system architecture.
- Explain technical concepts and trade-offs clearly to engineers and non-engineers.
- Troubleshoot issues across multiple systems (Python, JavaScript, SQL).
Requirements
- 5+ years in production cloud environments at scale.
- Strong familiarity with GCP (primary) and some AWS experience.
- Experience with Kubernetes (GKE), node pools, and memory-intensive jobs.
- Working knowledge of CI/CD systems (GitHub Workflows + ArgoCD).
- Exposure to observability tools (Datadog), databases (Cloud SQL, ClickHouse, Bigtable), and cloud services.
Skills & Qualities
- Strong analytical and problem-solving ability.
- Clear, collaborative communication.
- Curiosity and ownership mentality.
- Fluent in reading/debugging code across Python, JavaScript, SQL.
Technical Stack
-
Cloud: GCP (primary), AWS (secondary)
-
Kubernetes: GKE, multiple node pools
-
CI/CD: GitHub Workflows + ArgoCD
-
Data: Cloud SQL, ClickHouse, Bigtable, GCS, Dataflow
-
Networking: Cloudflare Workers, Durable Objects, WebSocket communication
-
Monitoring: Datadog
-
Environments: Production, Staging, Integration, Development