The Baldwin Group Logo

The Baldwin Group

Observability (SRE) Engineer

Sorry, this job was removed at 06:31 p.m. (CST) on Tuesday, May 06, 2025
Remote
Hiring Remotely in US
Remote
Hiring Remotely in US

Similar Jobs

5 Days Ago
In-Office or Remote
2 Locations
184K-357K Annually
Senior level
184K-357K Annually
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The role involves architecting and operating large-scale observability systems, designing resilient telemetry pipelines, automating operations, and leading incident responses while collaborating with various teams.
Top Skills: ElasticsearchFlinkGoJaegerKafkaLokiMimirOpensearchOpentelemetryPrometheusPythonSparkTempoThanos
11 Days Ago
Remote
USA
140K-170K Annually
Senior level
140K-170K Annually
Senior level
Hardware • Machine Learning • Security • Software
The Site Reliability Engineer will manage software deployment for IoT devices, improve observability, maintain dashboards, automate processes, and collaborate on incident responses.
Top Skills: AnsibleAWSBashC/C++DatadogGrafanaGroovyJavaJavaScriptNoSQLPostgresPrometheusPythonRSigmaSQLTerraform
8 Days Ago
Easy Apply
Remote or Hybrid
7 Locations
Easy Apply
127K-249K Annually
Senior level
127K-249K Annually
Senior level
Big Data • Cloud • Software • Database
This role involves building and maintaining observability services, ensuring service reliability, and collaborating with other teams on best practices.
Top Skills: AWSFluentbitGCPJaegerKubernetesAzureQuickwitSplunkVectorVictoriametrics

The Baldwin Group is an award-winning entrepreneur-led and inspired insurance brokerage firm delivering expertly crafted Commercial Insurance and Risk Management, Private Insurance and Risk Management, Employee Benefits and Benefit Administration, Asset and Income Protection, and Risk Mitigation strategies to clients wherever their passions and businesses take them throughout the U.S. and abroad. The Baldwin Group has award-winning industry expertise, colleagues, competencies, insurers, and most importantly, a highly differentiated culture that our clients consider an invaluable expansion of their business. The Baldwin Group (NASDAQ: BWIN), takes a holistic and tailored approach to insurance and risk management.

We’re looking for a highly motivated, practical and responsible Observability/Site Reliability Engineer who is excited to play a critical role in our rapidly growing Platform team. The Observability Engineer role will make significant contributions to our Observability, APM, Monitoring and Logging strategy, be integral to our day-to-day operations, and be an advocate for designing and implementing Site Reliability Engineering principles within the company.

 The successful candidate will have experience with CI/CD, Observability, APM, Monitoring, Logging, Infrastructure-as-Code, On-Call Support. Understanding of Cloud (AWS/Azure), SRE Practices, version control, configuration management, and automation are also required.Principal Responsibilities:

  • Develop and maintain comprehensive observability solutions for infrastructure, applications, and services, and implement APM tools and frameworks to monitor application performance, user experience, and system health.
  • Implement and Maintain tools and systems that provide insights into the health and performance of applications and infrastructure including metrics, logs, and traces to monitor system behavior.
  • Proactively analyze performance metrics and logs to identify bottlenecks, failures, and areas for improvement, ensuring systems are consistently reliable, highly available, and optimally performing by addressing potential issues before they impact users.
  • Strategically assess system capacity requirements and plan for future growth to ensure seamless scalability, working closely with development and operations teams to implement robust and effective scaling strategies.
  • Create automated solutions for monitoring, deployment, scaling, and recovery operations, and develop custom tools and scripts to enhance observability and monitoring capabilities.
  • Collaborate closely with software engineers, QA teams, and operations staff to seamlessly integrate observability and reliability best practices into the development lifecycle with expert guidance and support for instrumenting code and services with comprehensive monitoring and logging solutions.
  • Develop and maintain incident response plans, including alerting, escalation, and communication protocols, and lead efforts to resolve production incidents, minimizing downtime, and ensuring thorough root cause analysis and post-mortem reviews

Education, Experience, Skills and Abilities Requirements:

  • 3+ years of experience as a Observability or Site Reliability Engineer role.
  • Experience with cloud infrastructure platforms such as AWS or Azure.
  • Proven Experience with administering Observability, Monitoring tools (Datadog or similar).
  • Experience with containerized and serverless compute technology (Docker, ECS, Kubernetes, Lambda, etc.)
  • Experience with DevOps & CI/CD processes and tools (GitHub, Terraform, Ansible etc.).
  • Experience in integrations b/w DevOps, SRE, Testing tools to generate DORA metrics, reports and create dashboards.
  • Understanding of SRE principles including SLO, SLI, KPI, Metrics, logging, tracing etc.
  • Proficient in writing scripts (Bash, PowerShell) and program in one or more language (Python, JavaScript, Go, Java, or similar).
  • Experience in capacity planning and scaling resource requirements based on traffic patterns and performance metrics.
  • Experience in preparing, executing, and improving incident response plans.
  • Strong understanding of on-call rotation practices and incident escalation processes.
  • Knowledge of security best practices and compliance standards relevant to observability and monitoring (e.g., GDPR, HIPAA).
  • Datadog or relevant Certifications preferred.
  • Highly self-motived, highly available, and driven to exceed colleague expectation
  • Ability to think critically and logically under pressure.
  • Strong technical experience with proven history of troubleshooting complex, cross segment, cross office, and cross team problems.
  • Demonstrates the organization’s core values, exuding behavior that is aligned with the firm’s culture.

Click here for some insight into our culture!

The Baldwin Group will not accept unsolicited resumes from any source other than directly from a candidate who applies on our career site. Any unsolicited resumes sent to The Baldwin Group, including unsolicited resumes sent via any source from an Agency, will not be considered and are not subject to any fees for any placement resulting from the receipt of an unsolicited resume.

What you need to know about the Austin Tech Scene

Austin has a diverse and thriving tech ecosystem thanks to home-grown companies like Dell and major campuses for IBM, AMD and Apple. The state’s flagship university, the University of Texas at Austin, is known for its engineering school, and the city is known for its annual South by Southwest tech and media conference. Austin’s tech scene spans many verticals, but it’s particularly known for hardware, including semiconductors, as well as AI, biotechnology and cloud computing. And its food and music scene, low taxes and favorable climate has made the city a destination for tech workers from across the country.

Key Facts About Austin Tech

  • Number of Tech Workers: 180,500; 13.7% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Dell, IBM, AMD, Apple, Alphabet
  • Key Industries: Artificial intelligence, hardware, cloud computing, software, healthtech
  • Funding Landscape: $4.5 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Live Oak Ventures, Austin Ventures, Hinge Capital, Gigafund, KdT Ventures, Next Coast Ventures, Silverton Partners
  • Research Centers and Universities: University of Texas, Southwestern University, Texas State University, Center for Complex Quantum Systems, Oden Institute for Computational Engineering and Sciences, Texas Advanced Computing Center

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account