Job Description

  • 7+ years of experience as a System Engineer, Site Reliability Engineer (SRE), or in IT Operations
  • Proficiency with monitoring and observability tools such as Grafana, Datadog, Splunk
  • Strong understanding of logging frameworks and best practices (e.g., Fluentd, Logstash, Loki)
  • Hands-on knowledge of infrastructure monitoring (CPU, memory, network, IOPS) and application performance monitoring (APM)
  • Proficiency in scripting and automation (Bash, PowerShell or Python) for automating incident responses and log parsing
  • Familiarity with containerized environments on Docker and monitoring their performance
  • Experience integrating monitoring and alerting tools into CI/CD pipelines
  • Understanding of non-functional requirements (NFRs) such as performance, reliability, availability, and observability
  • Ability to work within cross-functional teams, integrating monitoring and incident management into daily workflows
  • Nice to ha...

    Ready to Apply?

    Take the next step in your AI career. Submit your application to instinctools today.

    Submit Application