Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities.
Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting.
Growing together.
We are seeking an accomplished DevOps Engineer to design, build, and operate secure, scalable, and automated platforms that support advanced AI/ML and Generative AI workloads across Azure and AWS, with solid capability to interoperate with GCP. You will own CI/CD, infrastructure-as-code, container orchestration, observability, and reliability engineering, partnering with Data Science and Security teams to deliver responsible, reliable AI services for healthcare analytics.
Role Summary
We're looking for a DevOps Engineer to design, build, and operate secure, scalable, and cost efficient platform capabilities for AI/ML and GenAI workloads on Azure and AWS.
Manage and operate cloud infrastructure to ensure reliability, scalability, and cost efficiency of applications and AI servicesPlan and execute CI/CD pipelines across the lifecycle - plan → code → build → test → stage → release → config → monitorOnboard applications to the DevOps toolchain; standardize golden paths and reusable Terraform modules and Helm chartsAutomate testing and deployments end-to-end; enforce trunk-based development and automated quality gatesCollaborate with developers to integrate application code with OS/runtime and production infrastructure (container images, base OS hardening, dependencies)Provide timely support on DevOps tooling; resolve incidents and requests within SLAs and follow the escalation matrix; perform RCA and implement durable fixes
Primary Responsibilities:
Platform, Automation & ReliabilityDesign, provision, and operate production-grade AKS (Azure) and EKS (AWS) clusters; implement autoscaling, multi-AZ/region topologies, and safe upgradesImplement Infrastructure-as-Code with Terraform/Terragrunt and Helm; enforce GitOps with Argo CD or Flux for declarative, auditable changesBuild CI/CD with GitHub Actions and Azure DevOps; also support Jenkins, GitLab CI/CD; manage artifact provenance and deployment strategies (blue/green, canary)Establish observability using OpenTelemetry, Prometheus/Grafana, ELK/OpenSearch, Azure Monitor, and CloudWatch; define SLOs/SLIsEngineer networking and traffic controls: ingress controllers, API gateways (NGINX/Envoy/Kong), service mesh (Istio/Linkerd), and WAFs; implement rate limiting and DDoS protectionsAI/ML & GenAI EnablementOperate AI training/inference platforms on Azure Machine Learning and Amazon SageMaker; manage model and data artifacts with MLflow/registriesOperationalize RAG/LLM services with Azure OpenAI and AWS Bedrock; standardize serving via KServe or managed endpoints; integrate vector databasesImplement data/model lineage, drift detection, shadow testing, and automated rollback based on health and evaluation signalsSecurity, Compliance & GovernanceApply Zero-Trust and least-privilege access (Azure AD, AWS IAM); implement RBAC, workload identity, network segmentation, and pod security standardsCentralize secrets with Azure Key Vault and AWS Secrets Manager/Parameter Store; implement rotation and access auditingMaintain SBOMs and image signing with attestations; prevent deployment of non-compliant artifacts; automate compliance evidence collectionOperations & SupportRun on-call and incident response with playbooks and blameless postmortems; drive MTTR reduction and reliability improvementsProvide timely support across multiple platforms; ensure customer satisfaction and SLA adherence; follow escalation matrix for complex casesImplement Infrastructure-as-Code with Terraform and Deployment ManagerBuild CI/CD pipelines with GitHub Actions (and Cloud Build where applicable)Containerize and deploy applications using Docker and Kubernetes (GKE)Automate operational tasks using Linux, Bash, and Python scriptingMonitor systems with Prometheus, Grafana, Splunk, and KibanaDevOps & SRE CompetenciesMonitoring and logging solutions: Prometheus, Grafana, ELK/Elastic Stack, OpenSearch, Splunk, KibanaUnderstanding of security best practices and compliance automation integrated into pipelinesComply with the terms and conditions of the employment contract, company policies and procedures, and any and all directives (such as, but not limited to, transfer and/or re-assignment to different work locations, change in teams and/or work shifts, policies in regards to flexibility of work benefits and/or work environment, alternative work arrangements, and other decisions that may arise due to the changing business environment). The Company may adopt, vary or rescind these policies and directives in its absolute discretion and without any limitation (implied or otherwise) on its ability to do soRequired Qualifications:
Graduate degree or equivalent experience6+ years in DevOps/Platform/SRE with 3+ years of operating Kubernetes in productionHands-on depth with Azure and AWS: AKS/EKS, Azure ML/SageMaker, ACR/ECR, IAM/Key Vault/Secrets Manager, and observability (Azure Monitor/CloudWatch)Hands-on experience in DevOps and CI/CD with a solid track record of successful project deliveryExperience applying SRE principles, including SLOs/SLIs, error budgets, and availability managementDeep knowledge of containerization (Docker) and orchestration (Kubernetes)Expertise in Infrastructure-as-Code with Terraform (and Ansible where applicable)Solid scripting and automation skills: Python and BashProficiency with Terraform/Terragrunt, Helm, GitOps (Argo CD/Flux); CI/CD with GitHub Actions and Azure DevOps; exposure to Jenkins/GitLab/CircleCIProficiency with CI/CD tools: Jenkins, GitHub Actions, Azure DevOps, GitLab CI/CD, CircleCIProven solid troubleshooting, root-cause analysis, and platform ownership; excellent communication skills
Preferred Qualifications:
Cloud certifications (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer)LLMOps/RAG experience with Azure OpenAI and AWS Bedrock; vector databases; evaluation pipelinesKnowledge of Service mesh (Istio/Linkerd), API gateways (NGINX/Envoy/Kong), and streaming (Kafka/MSK/Event Hubs)Healthcare data privacy/compliance familiarity; audit evidence automationKnowledge of Representative Tech StackAzure: AKS, Azure Machine Learning, Azure OpenAI, ACR, Key Vault, Azure Monitor/Application Insights, App Gateway, Data FactoryAWS: EKS, SageMaker, Bedrock, ECR, IAM/KMS, Secrets Manager, CloudWatch, ALB/NLBGCP: GKE, Compute Engine, VPC, Cloud IAM, Cloud Run, Cloud Functions, Cloud DNS, Cloud Monitoring, MIGsDevOps & Infra: Terraform/Terragrunt, Helm, Argo CD/Flux, Docker, KServe, NGINX/EnvoyCI/CD: GitHub Actions, Azure DevOps, Jenkins, GitLab CI/CD, CircleCISecurity: OPA/Gatekeeper, Kyverno, Trivy, Snyk, Checkov, SonarQube, Cosign/SBOM (SPDX/CycloneDX)Observability: OpenTelemetry, Prometheus, Grafana, ELK/Elastic Stack, OpenSearch, Azure Monitor, CloudWatch, Splunk, Kibana
At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes.
We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission. #AIForBetterHealthcare
#Nic