Position Overview: The Site Reliability Engineer (SRE) will play a crucial role in enhancing the reliability, availability, and performance of systems while implementing automation and best practices for operational excellence.
Key Responsibilities:
Monitor system performance and reliability, responding promptly to incidents and alerts.
Develop and implement automation scripts and tools to streamline operations and improve efficiency.
Collaborate with development and operations teams to design systems that are resilient and scalable.
Conduct root cause analysis on incidents, implementing corrective measures to prevent recurrence.
Create and maintain documentation of systems, processes, and procedures to ensure clarity and compliance.
Participate in on-call rotations to provide support during incidents and emergencies.
Drive continuous improvement initiatives, focusing on enhancing system reliability and performance.
Stay up-to-date with industry best practices, tools, and technologies relevant to site reliability engineering.