Job Description
7+ years of experience as a System Engineer, Site Reliability Engineer (SRE), or in IT OperationsProficiency with monitoring and observability tools such as Grafana, Datadog, SplunkStrong understanding of logging frameworks and best practices (e.g., Fluentd, Logstash, Loki)Hands-on knowledge of infrastructure monitoring (CPU, memory, network, IOPS) and application performance monitoring (APM)Proficiency in scripting and automation (Bash, PowerShell or Python) for automating incident responses and log parsingFamiliarity with containerized environments on Docker and monitoring their performanceExperience integrating monitoring and alerting tools into CI/CD pipelinesUnderstanding of non-functional requirements (NFRs) such as performance, reliability, availability, and observabilityAbility to work within cross-functional teams, integrating monitoring and incident management into daily workflowsNice to ha...
Ready to Apply?
Take the next step in your AI career. Submit your application to instinctools today.
Submit Application