Site Reliability Engineer (SRE)
Contractor posted 2 weeks ago in Technology (Software, IT, AI, Internet)Job Detail
-
Job ID 12002
Job Description
Site Reliability Engineer (SRE)
Job Title: Site Reliability Engineer
Department: Infrastructure / Platform Engineering
Reports To: Head of Engineering / Cloud Architect
Role Overview
We are hiring a Site Reliability Engineer to ensure system stability, performance, observability, and resilience across production environments. This role will focus on uptime, incident response, automation, performance tuning, and service reliability.
Key Responsibilities
- Monitor system availability, reliability, latency, and performance
- Build and improve observability across logs, metrics, and tracing
- Participate in incident management, root cause analysis, and recovery planning
- Improve system resilience, failover readiness, and operational maturity
- Define SLOs, SLIs, and reliability metrics
- Partner with DevOps and development teams to improve production readiness
- Automate operational tasks and reduce manual toil
- Support capacity planning and performance optimization
Required Qualifications
- Strong experience with Linux systems, cloud environments, and production operations
- Experience with monitoring and observability tools
- Knowledge of incident response, reliability practices, and service health monitoring
- Experience with scripting or coding for automation
- Strong troubleshooting and systems thinking capability
Preferred Qualifications
- Experience with Kubernetes, Prometheus, Grafana, ELK, Datadog, or similar platforms
- Familiarity with distributed systems and high-availability design
- Experience working in always-on or client-critical production environments


