Appcharge is a leading direct-to-consumer platform for app publishers.
We help publishers sell digital goods directly to their users - bypassing traditional app store commissions and owning the full payment and commerce experience.
The result: higher margins, better flexibility, and a more direct relationship with end users.
Backed by $89M from leading investors and operators, we’ve grown 14x in the past year and now process over half a billion dollars annually. We’re building the infrastructure that lets modern app businesses take back control of monetization – at scale.
As an SRE Engineer you own the reliability and availability of Appcharge’s production platform.
If you are passionate about building resilient systems, driving incident management standards, and shaping reliability culture across the company, Appcharge is where you want to be.
In your day-to-day you will collaborate with DevOps, FinOps, and all R&D teams to design, review, and harden services, lead incidents, and build self-service reliability tooling.
Responsibilities
- Define and maintain SLOs, SLIs, and error budgets for critical user journeys and platform services, and track them on shared dashboards.
- Lead production incident management: participate in the on-call rotation, run high-severity incidents as incident commander, and drive postmortems with clear action items.
- Standardize incident runbooks and automate common mitigations using our existing CI/CD, GitOps, and infrastructure tooling.
- Provide SRE input for new service designs and golden-path templates so services ship with health checks, readiness/liveness probes, timeouts, retries, and graceful degradation patterns by default.
- Partner with DevOps to embed reliability guardrails into CI/CD (canary and blue/green strategies, rollback policies, SLO/error-budget checks before and after deployment).
- Own the quality of production observability: define the minimal telemetry contract (logs, metrics, traces), continuously improve alerting, and reduce noise while protecting MTTR.
- Build and maintain self-service reliability tooling such as load and stress tests, chaos experiments, dependency maps, and DR/failover drills for engineering teams.
- Collaborate with FinOps to align reliability with cloud cost: support capacity planning, autoscaling and rightsizing strategies, and ensure golden-path services expose the tags and metrics needed for unit-economics KPIs.
- Analyze incident and reliability trends, reduce operational toil, and drive cross-team initiatives that lower change failure rate and increase system resilience.
- Champion reliability best practices across R&D through documentation, training, and close collaboration with backend and product teams.
Requirements
- 5+ years of experience in SRE / Infrastructure Engineering roles running customer-facing, cloud-native systems in production.
- Strong hands-on experience with Kubernetes and production observability stacks (for example Prometheus, Grafana, OpenTelemetry or similar).
- Proven experience defining and working with SLOs/SLIs, error budgets, and structured incident management processes (on-call, incident command, postmortems).
- Solid understanding of cloud platforms (AWS and/or GCP), networking fundamentals, and reliability patterns such as circuit breakers, timeouts, retries, and graceful degradation.
- Practical experience with CI/CD and GitOps concepts, and collaborating closely with teams that own ArgoCD, Helm, Terraform, and GitHub Actions or similar tools.
- Experience with at least one of: load testing, capacity planning, chaos engineering, or DR/failover exercises in production or large-scale staging environments.
- Awareness of security and compliance practices and their impact on reliability (IAM, MFA, audit trails, vulnerability management).
- Strong communication skills, ability to lead during high-pressure incidents, and a collaborative mindset when working with DevOps, FinOps, and product engineering teams.