ControlUp optimizes the digital experience captured through real-time observation, to deliver best-in-class employee productivity.
We are looking for a Site Reliability Engineer to join our team. The candidate will help drive the next generation of ControlUp innovation using cutting-edge technology. This role is an excellent opportunity to learn and develop cutting-edge technologies and methodologies.
Responsibilities:
- Develop automation scripts and tools to streamline operational processes
- Implement and maintain infrastructure practices
- Design and implement monitoring solutions to identify and address potential issues proactively
- Participate in on-call rotations to respond to incidents and troubleshoot system outages
- Conduct performance analysis and optimize system performance
- Identify bottlenecks and implement solutions to improve overall system efficiency
- Collaborate with teams to plan and forecast resource requirements
- Ensure the scalability of systems to accommodate growing user demands
- Promote and implement SRE best practices within the organization
- Drive initiatives to improve system reliability and reduce operational overhead
- Conduct post-incident reviews to analyse root causes and prevent future incidents
- Document and share findings with relevant teams to improve overall system reliability
- Stay informed about industry trends, emerging technologies, and best practices
- Actively participate in knowledge-sharing activities within the SRE community
Requirements:
- Proven experience in a Site Reliability Engineering or related role
- Familiarity with Windows and Linux
- Programming knowledge in languages such as C#, Python
- Excellent troubleshooting and problem-solving skills
- Strong communication and collaboration skillsand a “can do” attitude
Advantages:
- Experience with containerization and orchestration tools (EKS , AKS , docker for local machines)
- Familiarity with cloud platforms such as AWS, Azure
- Familiarity with infrastructure automation tools ( Terraform, Ansible)