ABOUT THE TEAM
Zalando Research is at the forefront of driving innovation in fashion e-commerce. We are a dynamic and diverse group of scientists and ML engineers dedicated to solving complex challenges through cutting-edge machine learning and AI. Our work directly impacts the experience of millions of Zalando customers and empowers internal teams with state-of-the-art tools and capabilities. We foster a collaborative environment where ambitious research ideas are transformed into impactful solutions. As a Senior Principal MLOps Engineer, you will play a pivotal role in supercharging the productivity and innovation potential of our applied scientists by architecting and delivering world-class MLOps infrastructure. You will be the most senior engineer in this area. As we raise the bar in ML research, we are evolving our infrastructure to eliminate friction and empower our scientists to focus on pushing forward the science—by making MLOps seamless, reproducible, and future-proof.
WHERE YOUR EXPERTISE IS NEEDED
Ensure Persistent, Secure and Reproducible R&D Environments: Tackle bottlenecks and improve scalability, resilience, and cost-effectiveness in distributed training workloads across our research teams. Guarantee scientists can resume work across sessions and share exact research setups, enabling robust experiment tracking and ease of collaboration.
Curate the R&D ML Stack: Evaluate, select, and integrate the best-in-class technologies for our end-to-end R&D ML stack, ensuring our scientists have access to the most powerful tools all while hardening the security of our cloud setup.
Enable Advanced Visualization: Implement and manage streamlined setup processes for 3D GPU-backed remote desktops in the cloud with persistent storage and seamless RDP/VNC experiences, providing scientists with powerful interactive research environments backed by the latest GPUs.
Innovate with LLMs: Stay at the cutting edge of Large Language Model (LLM) advancements and spearhead their integration into the Applied Scientists' UX.
WHAT WE ARE LOOKING FOR
Proven MLOps Leadership: Extensive experience (6+ years) in designing, building, and maintaining scalable, reliable, and performant MLOps infrastructure, particularly on AWS with a strong focus on GPU-accelerated compute clusters.
Passion for Empowering Scientists: Always looking for ways to save users’ time, eliminate skill barriers, and amplify scientific impact.
HPC & GPU Optimization Expert: Deep understanding of HPC architectures, job scheduling, GPU utilization, and cost optimization strategies in a cloud environment.
Containerization, Orchestration & Technology Expert: Strong hands-on experience with Docker, EC2, AMI(s), EFS, Lustre, S3, JupyterHub, SQL, Superset, Databricks, SageMaker, Slurm, Ray, Kubeflow, Kubernetes (EKS), Nix, Devbox, and other containerization, environment isolation and orchestration technologies for ML workloads.
Infrastructure as Code (IaC) and automation first mindset: Proficiency with IaC tools like CloudFormation, CRDs, Terraform to automate infrastructure provisioning and management along with strong skills in CI/CD.
Champion of Reproducibility: A passion for building systems that ensure experimental reproducibility, environment consistency, and end-to-end automation of ML workflows. Experience with tools like MLflow, Weights & Biases, or similar for tracking, sharing, and deployment. You’re able to provide both ephemeral and persisted ML environments depending on the use case
Excellent Communicator & Collaborator: Ability to articulate complex technical concepts clearly to diverse audiences and work effectively with research scientists, engineers, heads, directors and product managers to understand their needs and drive solutions.
Able to understand ML-related scientific challenges and translate them into ergonomic, reliable MLOps solutions for diverse user groups.
Problem Solver & Strategic Thinker: A proactive approach to identifying pain points, devising innovative solutions, and thinking strategically about the long-term evolution of the MLOps landscape at Zalando Research.
PERKS AT WORK
Culture of trust, empowerment and constructive feedback, open source commitment, meetups, game nights, 70+ internal technical and fun guilds, knowledge sharing through tech talks, internal tech academy and blogs, product demos, parties & events.
Competitive salary, employee share shop, 40% Zalando shopping discount, discounts from external partners, centrally located offices, public transport discounts, municipality services, great IT equipment, flexible working times, additional holidays and volunteering time off, free beverages and fruits, diverse sports and health offerings.
Extensive onboarding, mentoring and personal development opportunities and an international team of experts.
Relocation assistance for internationals, PME family service and parent & child rooms* (*available in selected locations)
We celebrate diversity and are committed to building teams that represent a variety of backgrounds, perspectives and skills. All employment is decided on the basis of qualifications, merit and business need.