Job Description

Job Responsibilities

  • Design, build, and maintain
    highly available, scalable, and reliable production systems
    .
  • Ensure
    system uptime, performance, and reliability
    by proactively monitoring, troubleshooting, and resolving incidents.
  • Implement and manage
    monitoring, alerting, and observability solutions
    (metrics, logs, traces).
  • Automate operational tasks to reduce manual effort and improve system reliability.
  • Lead
    incident response
    , root cause analysis (RCA), and post-incident reviews.
  • Collaborate with development teams to define
    SLIs, SLOs, and error budgets
    .
  • Improve CI/CD pipelines to enable safe, fast, and reliable deployments.
  • Manage
    capacity planning, performance tuning, and cost optimization
    .
  • Ensure security best practices across infrastructure and application layers.
  • Participate in
    on-call rotations
    and provide production support...

Ready to Apply?

Take the next step in your AI career. Submit your application to Net2Source (N2S) today.

Submit Application