About the Role
We are seeking an experienced Site Reliability Engineer (SRE) / Infrastructure Engineer to join our Platform Engineering team. This role requires a hands-on technologist with deep expertise in cloud infrastructure, Kubernetes, DevOps, and SRE practices to ensure the performance, availability, scalability, and security of mission-critical platforms.
- Design, implement, and maintain highly available, scalable, and secure infrastructure across AWS, Azure, and GCP.
- Build and automate CI/CD pipelines using Azure DevOps, Jenkins, Ansible Tower, and Terraform.
- Manage containerized applications using Kubernetes, Docker, AKS, EKS, and GKE
- Develop and enforce SRE best practices including monitoring, incident response, capacity planning, and reliability automation.
- Implement Infrastructure as Code (IaC) using Terraform, Bicep, ARM templates, and CloudFormation.
- Collaborate with development, QA, and security teams to integrate DevSecOps pipelines.
- Use observability tools (e.g., ELK, Kibana, ) for metrics, logging, and alerting.
- Manage machine identity and key lifecycle with Venafi, TLS, and PKI-based automation.
- Lead root cause analysis and provide reliable fixes for complex infrastructure issues.
- Participate in architectural reviews, security audits, and disaster recovery planning.
Must-Have:
- 10+ years in infrastructure, DevOps, or SRE roles within enterprise-grade environments.
- Proven experience with AWS, Azure, and GCP cloud services.
- Hands-on expertise in Kubernetes (AKS/EKS/GKE), Helm, Docker.
- Strong scripting skills in Python, Bash, PowerShell.
- Experience with Terraform, Ansible.
- Familiarity with CI/CD tools: Jenkins, Azure DevOps, Octopus, GitHub Actions.
- In-depth knowledge of Linux, Windows Server, and hybrid cloud environments.
- Solid understanding of networking, load balancing (NGINX, F5, ELB), and firewalls.
- Knowledge of security best practices and tools (e.g., IAM, TLS, PKI, SIEM, WAF, DAST/SAST).
Nice-to-Have:
- Experience with Apache airflow, snowflake , and big data pipelines.
- Familiarity with SRE maturity models and service level objectives (SLOs, SLIs, SLAs).