DevOps Engineer
VuMedi Inc.
Job Description
Job Description Job Description About Vumedi: Vumedi is the largest video education platform for doctors worldwide, dedicated to advancing medical education through innovative video-based learning. Our mission is to empower healthcare professionals by providing them with access to the latest clinical knowledge and surgical techniques from experts around the globe. We curate a vast library of high-quality educational content, enabling users to enhance their skills, stay informed about industry trends, and improve patient outcomes.
We are headquartered in Oakland, CA , and have additional offices in Minneapolis, MN, and Zagreb, Croatia. We're hiring a Senior/Staff/Principal DevOps Engineer to lead the development of our digital platform and products at this critical stage of Vumedi's growth. Why join Vumedi right now?
Build technology that matters in a fast-scaling Silicon Valley digital healthcare company : Your work directly impacts how doctors across the world learn and make decisions that save lives. Grow as we grow: Be part of a company in an accelerated growth phase, where expanding teams, products, and markets create real opportunities for ownership, leadership, and career progression. Build with AI : Work on applied LLM systems - from intelligent search to AI-driven content agents - and shape how AI transforms medical knowledge delivery.
Own your craft end-to-end : Take full responsibility for building systems that scale globally and power mission-critical workflows. Collaborate globally: Join a world-class team of passionate engineers on modern tech stack which will further drive your career development. Have real product impact : Influence the direction of product development by collaborating closely with product and leadership teams.
About the role: We are looking for a DevOps Engineer to join our engineering team and take ownership of our infrastructure, deployment processes, and overall platform reliability. You will work closely with backend and data teams to support a growing video and data platform used by millions of healthcare professionals worldwide. In this role, you will focus on improving our CI/CD pipelines, system reliability, and developer experience, while helping scale our cloud infrastructure in a secure and cost-efficient way.
You will work extensively with AWS services (compute, storage, networking, IAM, monitoring) and help ensure our systems are reliable, observable, and well-architected. You'll also support and enable emerging AI/ML and LLM-powered systems used for large-scale medical content processing, helping build and operate the infrastructure required for these workloads. This includes improving data pipelines, optimizing resource usage, and ensuring production-grade reliability of AI-driven services.
This is a high-impact role with a broad scope—from supporting production systems and data pipelines to driving long-term improvements in how we build, deploy, and operate our platform, with strong ownership and autonomy in shaping DevOps practices. What you will do: Own and improve our infrastructure, CI/CD pipelines, and deployment processes across multiple environments Work with AWS services (compute, storage, networking, IAM, monitoring) to ensure scalable, secure, and reliable systems Collaborate closely with backend and data teams to support production systems, data pipelines, and overall platform reliability Continuously improve developer experience by streamlining workflows, reducing friction, and enabling faster, safer deployments Contribute to improving security practices, access control, and compliance of our infrastructure Automate infrastructure and workflows using Python Improve observability by implementing and maintaining monitoring, logging, and alerting systems Troubleshoot production issues, participate in incident response, and implement long-term fixes to improve system stability Identify and drive improvements in performance, scalability, and cost efficiency across the platform Support and scale AI/ML and LLM-based systems, ensuring reliable infrastructure for data processing and content classification workloads Who you are: You have 5+ years of experience in DevOps, SRE, or infrastructure engineering, with a strong focus on cloud-native environments (preferably AWS) You have managed cloud infrastructure (networking, IAM, compute, storage) with a strong understanding of security best practices and cost optimization You have experience building and maintaining CI/CD pipelines to support rapid, reliable software delivery across multiple environments You are comfortable writing Python for automation, scripting, and building internal tooling to improve infrastructure and developer workflows You have a strong understanding of monitoring, logging, and observability (e.g., Datadog, Prometheus, CloudWatch), and proactively identifying and resolve issues You are comfortable debugging production issues across systems and collaborating with engineering teams to resolve them You are proactive, take ownership, and enjoy working in environments with high autonomy and evolving processes You communicate clearly and collaborate effectively with engineers, product managers, and other stakeholders You are curious and motivated to learn, especially in areas like AI/ML infrastructure and large-scale systems Required Qualifications: 5+ years of experience in DevOps, Site Reliability Engineering, or infrastructure-focused roles Proven experience designing and operating scalable, reliable, and secure cloud infrastructure (preferably AWS) in production environments Strong understanding of cloud security best practices (IAM, network security, secrets management), preferably within AWS Proficiency in Python for automation, scripting, and tooling Hands-on experience building and maintaining CI/CD pipelines Experience with monitoring, logging, and alerting tools (e.g., Datadog, CloudWatch, Prometheus) Experience working in a Linux-based environment Ability to drive infrastructure and DevOps strategy, balancing scalability, reliability, and cost Experience working cross-functionally and influencing engineering teams on best practices and architectural decisions Strong ownership mindset with the ability to operate autonomously in ambiguous environments Preferred Qualifications: Experience supporting or scaling AI/ML or LLM-based systems in production You have worked with containerized applications (Docker) and are familiar with orchestration concepts (Kubernetes or ECS is a plus) You are familiar with Infrastructure as Code principles (e.g., Terraform) and have experience implementing Infrastructure as Code from scratch in existing environments You have experience working with or supporting backend systems and data platforms (e.g., Postgres, Airflow is a plus) Background in backend engineering or software development Experience working in a fast-paced startup or scale-up environment Experience leading and mentoring engineers, while contributing to team-wide best practices This is a hybrid role, working 3 days a week (Monday, Wednesday, and Friday) in our Oakland office.