Writer

Platform engineer, MLOps

Posted 9 Days Ago

Be an Early Applicant

Remote

2 Locations

Senior level

Remote

2 Locations

Senior level

As a Platform Engineer in MLOps, you will design CI/CD pipelines, manage Kubernetes clusters, optimize performance, and support large-scale software applications.

The summary above was generated by AI

📐 About this role

As a Platform engineer, MLOps, you will be critical to deploying and managing cutting-edge infrastructure crucial for AI/ML operations, and you will collaborate with AI/ML engineers and researchers to develop a robust CI/CD pipeline that supports safe and reproducible experiments. Your expertise will also extend to setting up and maintaining monitoring, logging, and alerting systems to oversee extensive training runs and client-facing APIs. You will ensure that training environments are optimally available and efficiently managed across multiple clusters, enhancing our containerization and orchestration systems with advanced tools like Docker and Kubernetes.

This role demands a proactive approach to maintaining large Kubernetes clusters, optimizing system performance, and providing operational support for our suite of software solutions. If you are driven by challenges and motivated by the continuous pursuit of innovation, this role offers the opportunity to make a significant impact in a dynamic, fast-paced environment.

🦸🏻‍♀️ Your responsibilities:

Work closely with AI/ML engineers and researchers to design and deploy a CI/CD pipeline that ensures safe and reproducible experiments.
Set up and manage monitoring, logging, and alerting systems for extensive training runs and client-facing APIs.
Ensure training environments are consistently available and prepared across multiple clusters.
Develop and manage containerization and orchestration systems utilizing tools such as Docker and Kubernetes.
Operate and oversee large Kubernetes clusters with GPU workloads.
Improve reliability, quality, and time-to-market of our suite of software solutions
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
Provide primary operational support and engineering for multiple large-scale distributed software applications

⭐️ Is this you?

You have professional experience with:
- Model training
- Huggingface Transformers
- Pytorch
- vLLM
- TensorRT
- Infrastructure as code tools like Terraform
- Scripting languages such as Python or Bash
- Cloud platforms such as Google Cloud, AWS or Azure
- Git and GitHub workflows
- Tracing and Monitoring
Familiar with high-performance, large-scale ML systems
You have a knack for troubleshooting complex systems and enjoy solving challenging problems
Proactive in identifying problems, performance bottlenecks, and areas for improvement
Take pride in building and operating scalable, reliable, secure systems
Familiar with monitoring tools such as Prometheus, Grafana, or similar
Are comfortable with ambiguity and rapid change

Preferred skills and experience:

Familiar with monitoring tools such as Prometheus, Grafana, or similar
5+ years building core infrastructure
Experience running inference clusters at scale
Experience operating orchestration systems such as Kubernetes at scale

#LI-Remote

🍩 Benefits & perks (US Full-time employees)

Generous PTO, plus company holidays
Medical, dental, and vision coverage for you and your family
Paid parental leave for all parents (12 weeks)
Fertility and family planning support
Early-detection cancer testing through Galleri
Flexible spending account and dependent FSA options
Health savings account for eligible plans with company contribution
Annual work-life stipends for:
- Home office setup, cell phone, internet
- Wellness stipend for gym, massage/chiropractor, personal training, etc.
- Learning and development stipend
Company-wide off-sites and team off-sites
Competitive compensation, company stock options and 401k

Writer is an equal-opportunity employer and is committed to diversity. We don't make hiring or employment decisions based on race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other basis protected by applicable local, state or federal law. Under the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

By submitting your application on the application page, you acknowledge and agree to Writer's Global Candidate Privacy Notice.

#BI-Remote

Top Skills

AWS

Azure

Bash

Docker

GCP

Grafana

Huggingface Transformers

Kubernetes

Prometheus

Python

PyTorch

Tensorrt

Terraform

Vllm

Similar Jobs

Autodesk

Senior MLOps Engineer – AI/ML Platform

4 Days Ago

Remote

107K-156K Annually

Senior level

107K-156K Annually

Senior level

Big Data • Cloud • Digital Media • Machine Learning • Mobile • Software • Industrial

As a Senior MLOps Engineer, you'll optimize MLOps practices, automate model deployment, ensure scalable infrastructure, and collaborate across teams to enhance Autodesk's AI/ML platform.

Top Skills: AnsibleAWSAzureBashCi/CdDockerElk StackGrafanaKubernetesMlopsNoSQLPrometheusPythonPyTorchSQLTensorFlowTerraform

Writer

Platform engineer, MLOps

9 Days Ago

Remote

Senior level

Artificial Intelligence • Software • Generative AI

As a Platform Engineer for MLOps, you will design and manage CI/CD pipelines, oversee monitoring systems, and enhance container orchestration across scalable, distributed ML systems.

Top Skills: AWSAzureBashDockerGitGitGCPGrafanaHuggingface TransformersKubernetesPrometheusPythonPyTorchTensorrtTerraformVllm

Writer

Platform engineer, MLOps

16 Days Ago

Remote

Senior level

Artificial Intelligence • Software • Generative AI

As a Platform Engineer specializing in MLOps, you'll deploy and manage AI/ML infrastructure, develop CI/CD pipelines, and maintain Kubernetes clusters, focusing on system performance and operational support.

Top Skills: Ai/MlAWSAzureBashCi/CdDockerGitGitGCPGrafanaHuggingface TransformersKubernetesPrometheusPythonPyTorchTensorrtTerraformVllm

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Apply Save

By clicking Apply you agree to share your profile information with the hiring company.