Lead Site Reliability Engineer

RO-Remote

EMEA Careers

Req #: 16988

Type: Regular

Apply

Avalara, Inc

Overview:
Role Summary
As Avalara continues to scale its global SaaS platform and accelerate toward an AI-first operating model, we must fundamentally transform how reliability, deployment, and operational excellence are engineered.

This role exists to design and lead an enterprise-grade, AI-driven reliability ecosystem, enabling self-healing systems, intelligent observability, and low-risk deployment practices across multi-cloud environments. This includes modernizing RELE practices through automation, feature flag-driven deployments, and AI-powered operational workflows.

This is a high-impact individual contributor role responsible for reducing operational risk, eliminating manual toil, improving system resilience, and enabling faster, safer product delivery at scale.
How This Role Elevates Avalara
This role strengthens Avalara's reliability engineering and platform operations capability by introducing AI-driven, automation-first reliability practices.

This Lead Reliability Engineer will:

* Improve platform stability and reduce incidents through AIOps, predictive monitoring, and self-healing systems
* Accelerate deployment velocity and reduce risk through progressive delivery, feature flag strategies, and CI/CD optimization
* Enhance customer experience by improving availability, performance, and recovery times
* Increase operational efficiency by eliminating manual processes and introducing intelligent automation workflows
* Advance Avalara's AI-first strategy by embedding agentic AI into observability, incident response, and reliability engineering

Responsibilities:
Bar Raiser Expectations
As a Bar Raiser, this role is expected to elevate the performance of the entire reliability engineering function:

* Hold high standards for availability, reliability, automation, and operational excellence
* Use metrics such as MTTR, SLI/SLO/SLA adherence, deployment success rate, and incident reduction to drive decisions
* Simplify complex distributed systems into scalable, resilient, and automated platforms
* Mentor engineers and raise technical rigor, automation maturity, and AI adoption
* Challenge assumptions and drive measurable improvements in system reliability and deployment safety
* Leave every system, process, and platform more resilient and scalable than before

This role does not just operate systems-it redefines how reliability is engineered at scale.
Reliability Engineering & Platform Leadership
* Own the end-to-end reliability strategy for distributed SaaS systems across multi-cloud environments
* Design and implement AI-driven operations (AIOps) including anomaly detection, predictive failure analysis, and automated root cause identification
* Build and scale observability platforms using Prometheus, Grafana, OpenTelemetry, and ML-based analytics
* Architect self-healing systems and automation frameworks to eliminate manual operational toil
* Lead modernization of deployment practices through feature flags, progressive delivery, and safe rollout strategies
* Drive reliability improvements across Kubernetes-based container platforms
Platform Ownership & Deployment Engineering
* Own reliability of CI/CD pipelines and infrastructure as code (Terraform/Pulumi)
* Design deployment strategies that reduce risk, including:
* Feature flag-based releases
* Canary and progressive rollout models
* Automated rollback and kill-switch capabilities

* Improve deployment observability and traceability across environments
* Ensure high availability, scalability, and fault tolerance of production systems
Observability, Automation & AI Integration
* Implement advanced monitoring, logging, and tracing systems across services
* Integrate agentic AI workflows into incident detection, triage, and resolution
* Build automation pipelines using Go, Python, and modern workflow tools
* Enable AI-assisted observability, including:
* Intelligent alerting
* Automated diagnostics
* Performance optimization insights

* Drive adoption of automation-first and AI-first operational practices
Operational Excellence & Incident Management
* Lead incident response and on-call readiness for production systems
* Improve incident resolution time and system recovery through automation
* Conduct post-incident reviews and implement systemic improvements
* Communicate clearly with stakeholders and customers during incidents
12-Month Success Signals
Within the first 12 months, this role will have:

* Reduced MTTR by 30-50% through automation and AI-driven diagnostics
* Decreased production incidents and customer impact events
* Implemented AI-driven observability and alerting systems across core platforms
* Enabled feature flag-based deployment strategies across engineering teams
* Delivered self-healing automation workflows that significantly reduce manual intervention
* Increased deployment frequency with lower failure and rollback rates
* Elevated team capability through mentorship, standards, and AI adoption
AI Expectations
As an AI-first company, Avalara expects this role to embed AI into reliability engineering practices:

This role will:

* Design and implement AI-driven operational workflows for incident detection and resolution
* Use AI to predict failures, analyze system behavior, and optimize performance
* Build or integrate AI-powered observability assistants and diagnostics tools
* Identify high-value AI use cases tied to reliability, efficiency, and customer impact
* Apply AI responsibly with strong governance, security, and data considerations
* Elevate AI adoption across teams by sharing best practices and driving measurable outcomes

This role must demonstrate applied AI impact, not just familiarity.

Qualifications:
What You Bring
* B.S. in Computer Science or Engineering
* 10+ years of experience in SaaS, distributed systems, or reliability engineering
* Strong programming experience in Go, Java and Python
* Deep expertise in observability tools (Prometheus, Grafana, OpenTelemetry, etc.)
* Experience with multi-cloud platforms (AWS, GCP, Azure/OCI)
* Strong knowledge of Kubernetes, Docker, and container orchestration
* Advanced understanding of Linux systems, networking (TCP/IP, DNS), and cloud-native architecture
* Experience with Infrastructure as Code and CI/CD pipelines
* Familiarity with AI/ML-driven operations and automation workflows
* Proven ability to operate as a self-starter and drive complex initiatives independently
* Strong communication and documentation skills
* Willingness to participate in on-call rotation for production systems

Apply

Share this job: