Senior Site Reliability Engineer (Hiring Immediately)
Akuity
Job Description
About AkuityWith the move to the cloud, Kubernetes has become widely adopted by DevOps and Platform Engineering teams, but it has also added complexity. While scaling Kubernetes at Intuit, the Akuity founders started building Argo CD in order to streamline the adoption of Kubernetes. Argo CD helps developers own, understand and deploy their K8s deployments via GitOps.
Today, Argo CD is the third most popular project in the CNCF (Cloud Native Computing Foundation) and is used by 70% of companies who are using Kubernetes in production. The list of Argo CD users includes companies like Intuit, BlackRock, Tesla, Major League Baseball, Peloton, and many more. The team founded Akuity in 2021 to enable enterprises to ship software faster and more reliably with modern GitOps best practices.
The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane. Trusted by top companies around the globe, the Akuity Platform provides the only end-to-end GitOps platform for the enterprises. Our mission is to simplify the software delivery process so that DevOps and Platform Engineering teams can move fast, and deploy code effortlessly without the fear of breaking things.The RoleWe are looking for a Senior SRE to help us keep the Akuity platform running at the level our enterprise customers expect.
This is a high-ownership role; you won't just respond to incidents, you'll shape how we define and defend reliability across the entire platform. You'll work closely with engineering, infrastructure, and product to build the systems and culture that let us scale with confidence.What You'll OwnPlatform Reliability & SLAsOwn SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against themDesign, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructureIdentify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixesPartner with engineering teams to build reliability into new features before they ship to productionOn-Call & Incident ResponseParticipate in an on-call rotation and act as incident commander for high-severity production eventsBuild and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution lowDrive improvements to alerting fidelity; reduce noise, increase signal, eliminate toilLead post-incident reviews with clear timelines, root cause analysis, and follow-through on action itemsWhat We're Looking ForRequired5+ years of SRE, platform engineering, or production operations experience in a SaaS environmentDeep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anythingStrong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAMExperience defining and operating against SLOs in production; you've written error budgets, not just read about themProficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent)Solid scripting and automation skills; Go, Python, Bash, or similar; you automate what you touchStrong written communication: clear runbooks, sharp incident reports, thoughtful post-mortemsLive within US time zones (Pacific through Eastern), including Canada and other regionsStrong AdvantageExperience with Argo CD, Kargo, or GitOps-based delivery workflowsFamiliarity with multi-region, multi-cluster Kubernetes deploymentsExperience with compliance-adjacent infrastructure (SOC 2, ISO 27001, HIPAA, or PCI DSS)Background operating infrastructure for other platform or developer tooling companiesOur StackKubernetes (EKS): multi-region, enterprise-grade clusters serving Argo CD and Kargo workloadsAWS: primary cloud provider across all production and DR environmentsArgo CD & Kargo: GitOps delivery tools we build and run ourselvesPrometheus, Grafana, and OpenTelemetry for observabilityTerraform and GitOps-driven infrastructure managementWhat We OfferCompetitive compensation, commensurate with experienceEquity participation in a well-funded, growing companyFully remote: work from anywhere within US time zones (Pacific through Eastern), including Canada and other regionsHome office stipend and equipment budgetFlexible time off and a culture that respects itWork directly with the engineers who built Argo CD and Kargo; you'll learn a lot hereUS-based employees receive full benefits, including comprehensive health, dental, and vision coverage. Candidates based outside the US will be engaged as contractors.J-18808-Ljbffr