Engineer II - DevOps
Neighborly
Job Description
About Neighborly: Neighborly is a local network of home service brands that will connect you to very specific vetted local experts. Our family of service professionals work with rigorous quality standards to repair, maintain, and enhance your home. With pros living in your community, scheduling is quick and convenient.
Engineer II(DevOpsEngineer) Role Summary: We are seeking an Engineer II (SRE) to join our Platform Engineeringteam.Thisroleemphasizesreliability,observability,andautomationwhile contributing to a shared internal platform that enables product teams to deploy and operate services safely and efficiently. Youwillworkatthe intersectionofcloudinfrastructure,Kubernetes,CI/CD,DevSecOps, and observabilityhelpingdefineand operatea reliableplatform usingSREprinciples suchasSLOs, errorbudgets,andblamelessincidentresponse.Thisisanexcellentrolefor someone early in their SRE/Platform career who wants to grow from keeping systems running into engineering for reliability at scale. Key Responsibilities Reliability & SRE Practices: Operateandimproveplatform reliabilityusing SREconcepts (SLIs,SLOs,error budgets) Support incident response, participate in on-call rotations, and contribute to blameless postmortems.
Identifyrecurringreliabilityrisksandhelpdrive remediationthroughautomation and design improvements. Track and improve service health indicators such as latency, availability, and error rates. Observability&ProductionVisibility: ConfigureandmaintainDatadogfor: APM(tracing andperformanceanalysis).
Centralizedlogging. Infrastructure andKubernetes monitoring. Dashboards,alerts,andSLOs.
Helpdefinemeaningfulalertsfocusedoncustomerimpactratherthanrawinfrastructure noise. Partner with application teams to improve instrumentation and production visibility. Platform&CloudEngineering: Support the operation of cloud workloads on AWS, following reliability and security best practices Assist in managing Kubernetes clusters, including deployment patterns, scaling behavior, and failure handling Contribute to platform capabilities that provide golden paths for application teams InfrastructureasCode&Automation: Buildandmaintaincloudandplatform infrastructureusingTerraform Help automate environment provisioning, configuration drift detection, and operational tasks Contribute to reusable platform modules that enable consistent and reliable environments CI/CD&SafeDelivery(DevSecOps): SupportAzureDevOpspipelines forbuild,test,anddeploymentautomation.
IntegrateDevSecOps controlsintodelivery workflows(securityscanning, secrets management, policy checks). Helpenablesaferdeploymentsthroughpracticessuchasprogressivedelivery,rollback automation, and validation gates. Collaboration&PlatformEnablement: Work closely with application teams to improve service reliability and operational readiness.
Contributetoplatformdocumentation,runbooks,operationalstandards,andself-service guides. Participateinsprintplanning, reliabilityreviews, andcontinuousimprovement initiatives. RequiredSkills & Experience: ExperiencewithAWS(e.g.,VPC,EKS,EC2,IAM, S3) Hands-onexperiencewithKubernetes,includingdeploymentsandbasic troubleshooting Practicalexperience using Terraformforinfrastructure provisioning.
ExperiencewithAzureDevOpsforCI/CDpipelines. ExperiencewithDatadogformonitoring,logging,alerting,orAPM. Understandingof DevSecOpsconceptsandsecuresoftwaredelivery.
FamiliaritywithLinuxsystemsandbasicnetworkingconcepts. Basic scripting skills (Bash, PowerShell, or Python) Preferred Qualifications. ExposuretoSREpractices(SLOs,SLIs,errorbudgets,incidentmanagement).
Hands-onexperiencewithDatadogSLOs,alerts,anddashboards. FamiliaritywithKubernetesorcloudreliabilitypatterns(autoscaling,health checks, graceful degradation). Experienceworkinginmulti-environmentormulti-tenantplatforms.
ExposuretomonitoringtoolsbeyondDatadog(e.g.,CloudWatch,Prometheus). Relevant certifications (AWS, Kubernetes, Terraform, Datadog) What Success Looks Like in This Role. Clear,actionableobservabilityacrosstheplatformusingDatadog.
Improved detection and faster resolution of production incidents (lower MTTD/MTTR). Reliable,repeatableinfrastructureandapplicationdeployments. GrowingadoptionofSREpracticesacrossengineeringteams.
A platformthatenablesteamsto ship quicklywithoutsacrificingreliability.