Roadie

Lead Site Reliability Engineer

Posted 8 Days Ago

Remote

Senior level

Remote

Senior level

Lead and mentor SRE teams to enhance platform reliability, optimize software delivery, and manage Kubernetes infrastructure while resolving incidents and driving operational excellence.

The summary above was generated by AI

Roadie, a UPS company, is a leading logistics and delivery platform that helps businesses tackle the complexities of modern retail with unmatched delivery coverage, flexibility and visibility. Reaching 97% of U.S. households across more than 30,000 zip codes — from urban hubs to rural communities — Roadie provides seamless, scalable solutions that meet a variety of delivery needs.

With a network of more than 310,000 independent drivers nationwide, Roadie offers flexible delivery solutions that make complex logistics challenges easy, including solutions for local same-day delivery, delivery of big and bulky items, ship-from-store and DC-to-door.

Roadie is seeking a Lead Site Reliability Engineer to join our growing Technical Operations Team. We are looking for a leader with a proven track record of managing high-performing SRE teams in high-availability, mission-critical environments. The ideal candidate is a strategic problem solver with deep expertise in site reliability best practices, DevOps principles, AWS and GCP, Kubernetes, and automation. You will play a key role in driving reliability, scalability, and operational excellence across our platform.

What You'll Do

Lead and mentor teams focused on enhancing platform reliability, optimizing uptime, and improving software delivery, observability, and infrastructure operation
Architect, maintain, and optimize production and non-production Kubernetes clusters (EKS), as well as Elasticsearch (ES), MSK, RDS, and ElastiCache (Redis) clusters
Design, deploy, and manage monitoring and logging solutions using Prometheus, Loki, Thanos, Grafana, OpenTelemetry, and New Relic
Strategize and collaborate with cross-functional teams to proactively identify bottlenecks, optimize resource utilization, and prevent system failures
Define and enforce Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to drive reliability improvements
Automate and streamline operational tasks, reducing toil and increasing efficiency across engineering teams
Plan and forecast service capacity and demand, optimize costs, and fine-tune system performance
Lead troubleshooting initiatives, post mortems, and resolve production and non-production incidents, ensuring high availability and performance
Participate in and manage a 24/7 on-call rotation, responding to incidents and driving post-mortem improvements
Willingness to work non-standard hours to facilitate production upgrades or deployments on occasion

Technology We're Using Now

Python, Ruby on Rails, Golang
React/Redux, Objective-C and Swift, Android
Postgres, Redshift, Redis, Kafka
AWS/GCP
Docker/Kubernetes
OpenTelemetry/Prometheus/Thanos/Loki/Grafana/New Relic/Sentry
Git/CircleCI
ArgoCD

What You Bring

6+ Years in various SRE roles
6+ Years in various DevOPS/System Engineering roles
3+ Years in leading and managing SRE teams
6+ Years of experience building and managing production Kubernetes infrastructure
7+ Years experience with popular scripting languages (Python, Ruby, Bash, etc.)
Experience with Infrastructure as code such as Terraform or Crossplane
Experience with CI/CD Development tools (CircleCI, etc.)
Experience with GitOPS Tools (ArgoCD)
Experience using a broad range of AWS technologies (RDS, ElasticSearch, VPC, EKS, S3, CloudFront, MSK, Elasticache, CloudWatch, etc.)
Experience developing and maintaining YAML templating systems (Helm charts, Kustomize, etc)
Must be able to work independently, be self-motivated and handle multiple priorities
Comfortable working in a fast-paced agile environment

Finally, a willingness to admit what you don’t know, and learn what you need to learn quickly.

Why Roadie?

Competitive compensation packages
100% covered health insurance premiums for yourself
401k with company match
Tuition and student loan repayment assistance (that’s right - Roadie will contribute directly to your existing student loans!)
Flexible work schedule with unlimited PTO
Monthly 3-day weekends
Monthly WFH stipend
Paid sabbatical leave- tenured team members are given time to rest, relax, and explore
The technology you need to get the job done

This role is not eligible for Visa sponsorship. Applicants must be authorized to work for any employer in the U.S.

Top Skills

Android

Argocd

AWS

CircleCI

Docker

GCP

Git

Grafana

Kafka

Kubernetes

Loki

New Relic

Objective-C

Opentelemetry

Postgres

Prometheus

Python

React/Redux

Redis

Redshift

Ruby On Rails

Sentry

Swift

Terraform

Thanos

Similar Jobs

MongoDB

Lead, Site Reliability Engineer, Fabric

4 Days Ago

Remote

Hybrid

147K-289K Annually

Senior level

147K-289K Annually

Senior level

Big Data • Cloud • Software • Database

Lead the Fabric team as a Site Reliability Engineer, focusing on building resilient infrastructure for secure service communication, while overseeing team direction and addressing technical issues.

Top Skills: AWSAzureBgpDnsGCPKubernetesTcp/IpTls/MtlsVpcs

MongoDB

Lead, Site Reliability Engineer, Fabric

4 Days Ago

Remote

United States

147K-289K Annually

Senior level

147K-289K Annually

Senior level

Big Data • Cloud • Software • Database

The Lead Site Reliability Engineer will manage the Fabric team, ensuring secure communication infrastructure, guiding engineering practices, and participating in on-call support.

Top Skills: AWSAzureBgpDnsGCPKubernetesSdnTcp/IpTls/Mtls

Cisco Meraki

Lead Site Reliability Engineer, Observability - Remote

5 Days Ago

Easy Apply

Remote

Hybrid

Easy Apply

148K-236K Annually

Senior level

148K-236K Annually

Senior level

Hardware • Information Technology • Security • Software • Cybersecurity • Conversational AI

The Lead Site Reliability Engineer will design, develop, and operate observability systems, ensuring service reliability in large distributed environments. Responsibilities include scaling observability systems, writing monitoring libraries, and collaborating with engineering teams.

Top Skills: AnsibleBashElasticsearchGoKafkaPrometheusPythonRubyScalaTerraform

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Apply Save

By clicking Apply you agree to share your profile information with the hiring company.