Job Description

We are looking for a motivated Site Reliability Engineer (SRE) who will play a crucial role in driving operational excellence in our software development teams by ensuring the availability, performance and scalability of our production systems. You will work closely with one or potentially multiple software development teams to raise the bar in terms of their observability practice, enhance incident response capabilities and help reduce operational toil through automation.

Key Responsibilities

  • Implement and manage the observability stack (metrics, logs, traces and alerts) to ensure optimal performance and availability
  • Analyze observability data to proactively identify performance bottlenecks and drive reliability improvements
  • Define, track and report on Service Level Objectives (SLOs) and Service Level

Indicators (SLIs) for key services.

  • Identify, develop and implement automation tools to red...

Ready to Apply?

Take the next step in your AI career. Submit your application to TVH today.

Submit Application