[LPS] Site Reliability Engineer
LPS
Job Description
Job SummaryThe Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of enterprise systems in a managed services (Day 2) environment.The role operates as a centralised reliability function across application, infrastructure, and vendor support layers, governing operational activities to ensure that incidents, changes, and patching activities are executed without impacting service stability or SLA commitments.The SRE works closely with Application Engineers (L1.5) and Application Vendors (L2), providing oversight, risk control, and engineering-driven improvements to maintain a stable and resilient production environment.
Key Responsibilities1. Reliability & Service AssuranceOwn end-to-end service reliability, including availability, performance, and system stabilityDefine and track reliability metrics (e.g., uptime, latency, error rates)Ensure SLA compliance through proactive monitoring and operational governanceEstablish service health indicators and early warning mechanisms
- Monitoring & ObservabilityDesign and implement monitoring, logging, and alerting frameworks across application and infrastructure layersDefine alert thresholds and reduce alert noise to improve signal qualityDevelop dashboards and reporting for real-time visibility of system healthContinuously enhance observability coverage across services
- Incident Management & RCALead major incident management (P1/P2) as incident commanderPerform end-to-end root cause analysis (RCA) across application, infrastructure, and vendor domainsCoordinate with Application Engineers and Vendors for issue resolutionDrive preventive and corrective actions to reduce incident recurrence
- Change & Patch GovernanceAssess operational risks associated with changes, releases, and patching activitiesWork with Application Engineers (L1.5) and Vendors (L2) to ensure safe execution of application patchesPerform pre- and post-change validation to ensure system stabilityGovern Go/No-Go decisions and support rollback planning in case of service degradation
- Performance & Capacity ManagementMonitor and optimise system performance across application and infrastructure layersConduct capacity planning and forecasting to ensure scalability and resilienceIdentify and address performance bottlenecks proactively
- Automation & Continuous ImprovementDrive automation of operational processes, including monitoring, recovery, and validationImplement self-healing and resilience mechanisms where applicableDevelop and maintain operational runbooks and automation scriptsContinuously improve system reliability through engineering practices
- Collaboration & GovernanceWork closely with Application Engineers (L1.5) for execution of operational activitiesCollaborate with Vendors (L2) for defect resolution and product-level fixesEnsure compliance with governance, security, and audit requirementsSupport service reviews, reporting, and continuous improvement initiatives
RequirementsCore RequirementsExperience in Site Reliability Engineering, DevOps, or production operations in enterprise environmentsStrong understanding of cloud platforms (preferably AWS)Experience with monitoring and observability toolsStrong troubleshooting capability across application and infrastructure layersExperience in incident management and root cause analysisFamiliarity with ITIL processes (incident, problem, change management)
PreferredExperience in system integrator or managed services (Day 2 operations) environmentExposure to enterprise applications (e.g., IWMS platforms such as Archibus or similar)Experience with automation and scripting (Python, PowerShell, etc.)Knowledge of performance tuning and capacity planning
Key CompetenciesStrong analytical and problem-solving skillsAbility to lead during high-pressure incidentsStructured and governance-driven mindsetProactive approach to reliability and continuous improvementStrong stakeholder coordination across internal teams and vendors