Power the Future with Us!
SolarEdge (NASDAQ: SEDG) is a global leader in high-performance smart energy technology, with over 4,000 employees, offices in 34 countries, and millions of products installed in more than 133 countries. Our diverse product offering includes intelligent solar inverters, battery storage, backup systems, EV charging, and complete home energy management ecosystems. With world-class engineering and a relentless focus on innovation, we are creating a world where clean, green energy from the sun powers our homes, businesses, and communities.
We are looking for a Senior Observability & SRE Engineer to lead the evolution of our observability and reliability practices across a complex, distributed system landscape. This is a high-impact role that will shape how engineering teams monitor, trace, and improve the performance and reliability of our systems at scale.
Responsibilities:
- Champion observability best practices across engineering teams and heterogeneous system architectures.
- Lead the design and implementation of distributed tracing across hundreds of microservices, including asynchronous communication patterns (e.g., Kafka).
- Collaborate with service owners to define SLA/SLO targets and implement effective monitoring, alerting, and dashboards using tools like Prometheus, Grafana, and Elastic Stack.
- Operate, maintain, and enhance our observability stack, with a focus on:
- Elastic Stack (Elasticsearch, Logstash, Kibana) for logging
- Grafana for visualization
- Prometheus for metrics
- Integrate applications with APM solutions, with a strong preference for Elastic APM.
- Use programming skills (e.g., Python, Go, Java, scripting languages) to:
- Develop custom tooling and integrations to enhance observability and automate SRE workflows.
- Build advanced dashboards and data aggregation pipelines for deep system insights.
- Collaborate with development teams to embed observability and reliability into the SDLC, establishing standards and best practices.
- Participate in and help mature our incident response processes, including leading post-mortems.
- Proactively identify and resolve performance bottlenecks and system inefficiencies.
- Mentor engineers on SRE principles and observability techniques.
Qualifications (Must-Haves):
- Proven experience in Platform Engineering, SRE, or a similar role with a strong focus on observability in distributed environments.
- Demonstrated success in driving observability adoption across multiple teams.
- Deep knowledge of Kubernetes operations, including Helm-based deployments, monitoring, and troubleshooting.
- Hands-on experience with distributed tracing in asynchronous systems (Kafka, etc.).
- Expertise in defining and tracking SLOs/SLIs and building alerting and dashboarding strategies.
- Proficiency with observability tools such as Elastic Stack, Grafana, and Prometheus.
- Strong experience with APM tools, especially Elastic APM.
- Solid programming/scripting skills (e.g., Java/Kotlin, Python, JavaScript/TypeScript).
- Familiarity with diverse application infrastructures and build systems (Java/Maven, Python, C#).
- Experience with CI/CD pipelines (GitLab preferred) and code analysis tools (e.g., SonarQube, Artifactory/Xray).
- Excellent communication and collaboration skills.
Qualifications (Highly Desirable):
- Experience contributing to broader platform engineering initiatives (e.g., service catalog portals).
- Proven success in implementing self-service infrastructure provisioning.
- Experience with standardization initiatives and creating code templates/scaffolding.
- Familiarity with public cloud platforms, especially AWS.