Why Join Us?
Cyberint, a market leader in External Risk Management, empowers global organizations to detect, respond, and remediate external threats efficiently. Now part of Check Point, Cyberint continues to grow and innovate at the intersection of cybersecurity and cloud-native SaaS technologies.
Join Our Operations Team
We are seeking a proactive, experienced
Site Reliability Engineer (SRE) to join our dynamic Operations team. You’ll be working on a cutting-edge SaaS solution that runs on AWS (EKS-based Kubernetes infrastructure), supporting an architecture with many moving parts. If you're driven by reliability engineering, love automation, and want to make an impact on mission-critical platforms, this role is for you.
What You’ll Do
As an SRE at Cyberint, you will be instrumental in ensuring the observability, stability, and scalability of our platform. You will develop automated solutions and monitoring tools to proactively detect and respond to incidents, improve system resilience, and collaborate with engineering teams across the company to embed operational excellence into our product lifecycle.
Additionally, you will help evolve our AI-driven operational and monitoring tooling, including our on-call assistant bot, which leverages AI technologies to streamline incident resolution, automate repetitive tasks, and support real-time decision-making for engineers.
Key Responsibilities
- Design, implement, and maintain monitoring and alerting systems (e.g., Prometheus, Grafana) to detect and prevent reliability issues.
- Develop tools and automation (Python, Bash, etc.) for improving infrastructure reliability and operational efficiency.
- Collaborate with R&D and Product teams to embed reliability-first principles into every stage of the development process.
- Participate in and improve incident response processes, including running blameless postmortems and implementing preventive measures.
- Enhance our Infrastructure-as-Code (IaC) and CI/CD practices to streamline deployments and reduce risk.
- Maintain and extend internal AI-driven tools, such as bots that support SRE workflows (on-call management, triaging, etc.).
- Document infrastructure, playbooks, and operational procedures to facilitate onboarding and knowledge sharing.
Qualifications - 3+ years of experience in an SRE, DevOps, or similar role in a SaaS/cloud-native environment.
- Strong experience with Kubernetes, AWS, and cloud-based distributed systems.
- Hands-on experience building or maintaining monitoring stacks such as Prometheus, Grafana, ELK, etc.
- Proficiency in Python, Bash, or similar scripting languages.
- Experience with Infrastructure as Code tools (Terraform, Helm, etc.).
- Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins, ArgoCD).
- Solid analytical and problem-solving skills with a passion for operational excellence.
- Exposure to AI-based tooling (e.g., OpenAI API, LLM-based bots) to automate operations or enhance incident response processes. Nice to Have
- Experience with incident management platforms (e.g., PagerDuty).
- Security-minded mindset and experience in the cybersecurity industry.
- Experience with service mesh, zero-downtime deployments, or chaos engineering.
- Contributions to AI-assisted SRE initiatives or platform operations & monitoring automation.