Job Description

Job Description
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run production systems. SRE ensures that August 99's services—both our internally critical and our externally-visible systems, e.g. GitLab/developer tooling and hosted client sites for Agent Image—have reliability and uptime appropriate to users needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
General Responsibilities
Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement.
Install new / rebuild existing servers and configure hardware, services, settings, directories, storage, etc., following standards and project/operational requirements.
Perform daily system monitoring, verifying the integrity and availability of all hardware, server resources, systems, and key processes, reviewing system and application logs, ...

Ready to Apply?

Take the next step in your AI career. Submit your application to Confidential today.

Submit Application