About the Opportunity

This organization believes computers should handle more work — so humans can focus on creativity and innovation. Their mission is to make automation and AI accessible for everyone by developing intelligent, scalable platforms that help millions of businesses grow. You’ll join a forward-thinking team that values collaboration, cutting-edge tools, and customer success.

Why This Role Matters

This role goes beyond infrastructure maintenance. As a Reliability Engineer, you’ll directly shape the resilience and observability of complex distributed systems. Your work will ensure that customers experience fast, reliable, and consistent services, even as the organization scales globally.

Key Responsibilities

Build and enhance platform tooling for observability and system reliability.
Collaborate with product teams to improve incident detection and response.
Manage and evolve core observability systems — logging, metrics, alerting, dashboards.
Participate in the on-call rotation and contribute to the continuous improvement of incident response processes.
Write code (Go, Python, etc.) to automate operations and reduce manual tasks.
Strengthen infrastructure reliability using AWS, Kubernetes, and Terraform.
Establish and advocate best practices for monitoring and alerting.
Share knowledge through mentoring, documentation, and collaboration.
Explore and apply AI-powered reliability tools for debugging, alert correlation, and performance insights.

About You

4+ years of experience in systems, infrastructure, or backend engineering (preferably SaaS or cloud-native).
Skilled in Go or Python, with strong experience in Terraform, AWS, and Kubernetes.
Solid understanding of observability systems (metrics, logging, dashboards, alerting).
Hands-on experience diagnosing incidents and improving system resilience.
Proactive, analytical mindset — enjoys solving complex system challenges.
Excellent communication and collaboration skills in a remote setting.
Open-minded about using AI tools to enhance reliability and productivity.

Our Tools & Technologies

Cloud & Infrastructure: AWS, Kubernetes, Terraform, Redis, Kafka
Observability: Grafana, Datadog, Prometheus, Opensearch, Sentry
Programming Languages: Go, Python, TypeScript
CI/CD: GitLab, ArgoCD

What Success Looks Like

You deliver stable, maintainable reliability improvements to critical systems.
You elevate how teams detect, respond to, and learn from incidents.
Product teams gain stronger confidence in service performance under your guidance.
You promote thoughtful, customer-focused reliability practices.
You grow professionally while contributing to a supportive, inclusive, and feedback-driven culture.
You successfully introduce AI tools that reduce noise, accelerate debugging, and enhance decision-making.

APPLY

Site Reliability Engineer