Sr. Site Reliability Engineer (Engenheiro de Site Reliability IV)
BR-Remote
Brazil Careers
Req #: 15933
Type: Regular
|
Overview: Avalara is an AI-first company. We expect every engineer, manager, and leader to actively leverage AI to enhance productivity, quality, innovation, and customer value. AI is embedded in our workflows, decision-making, and products - and success at Avalara requires embracing AI as an essential capability, not an optional tool. As a member of our Reliability Engineering Product SRE team, you will be responsible for building production applications with the highest level of MVRs and SMMs, ensuring customer satisfaction through your expertise in SRE domain skills. We are seeking an individual who is passionate about automation, efficiency, and operational excellence. You will be using a bundled tech stack to provide deep visibility into customer, product, and infrastructure interactions. You will have a keen eye for SLOs, SLIs, SLAs, and the golden metrics that drive reliability. You will programmatically approach MVRs using coding and scripting languages, while also leveraging AI/ML-driven insights where applicable. Responsibilities: * Build products with MVRs and reliability standards, ensuring system resilience and scalability. * Set up and operate observability tools across multiple cloud providers, incorporating AI-powered anomaly detection to enhance monitoring. * Assist development teams in defining SLO/SLI dashboards and alerts, optimizing alerting signals with ML-based noise reduction techniques. * Use Go, Python, or Terraform to automate operational tasks and build self-healing mechanisms. * Manage and administer Grafana, Prometheus, Loki, and other observability tools, integrating predictive analytics where beneficial. * Troubleshoot and support production environments, using AI-assisted diagnostics where applicable for faster root cause identification. * Automate incident response workflows, leveraging AIOps to reduce manual toil and improve MTTR. Qualifications: Experience * Minimum 8 years of experience in a SaaS environment * Bachelor's degree in computer science or equivalent * Ability to participate in an on-call rotation Qualifications * AI-powered: Interest in AI-powered automation, including AIOps tools, ML-based alert tuning, and predictive maintenance. * Networking: Strong understanding of the OSI model, TCP/IP, and DNS; particularly as it relates to cloud environments. * Linux Fundamentals: Solid experience with the administration, security hardening, and performance tuning of one or more distributions of Linux. * Troubleshooting: A passion for tracking down technical root causes of distributed systems, and software. * Observability: Experience with developing service level indicators and objectives, instrumenting software, and building alerts. ML-based anomaly detection is a plus. * Software Engineering: An understanding of software engineering fundamentals with experience developing software with a team of engineers. * Automation: A strong desire to automate all of the things and eliminate toil. * Containers: A solid understanding of the underpinnings of container technology such as groups and namespaces. * Container Orchestration Systems: Experience with the operations, administration, and development of orchestration systems such as Kubernetes, ECS, Mesos, and Nomad. * IaC: Experience with deploying and maintaining infrastructure as code with tools such as Terraform, and Pulumi. * Technical Writing: Most of the services we develop are greenfield, and you will need to build documentation and diagrams for other engineering teams. * Customer Satisfaction: Keen eye for customer satisfaction (our customers are other engineering teams and Avalara customers). * Passion for Learning: Interest in the broader technology space with a constant desire to expand your understanding. * Adaptability: Experience working on a variety of projects. In short, we want people with T-shaped skills. * Tools & Technologies we are looking at as part of the skillset: Terraform, Grafana, Prometheus, Loki, Alert manager, Pushgateway, Prometheus exporters & client libraries, PromQL, LogQL, Fluentd, Fluent-bit, Sumologic, Splunk, Tempo, Jaeger, OpenTelemetry, Cortex, etc * Other Common Tools & Technologies expected: AWS, GCP, Oracle Cloud, Azure, Terraform, Pulumi, GitLab, Artifactory, Atlassian suite, GIT, Kubernetes, Go, C#, Python, Bash, Powershell, Docker, Windows, Linux, etc Preferred Qualifications * Programer Language: GO and Python * Distributed Computing: Experience architecting, developing, and deploying distributed services across regions and clouds. * GitLab: Experience in working with, managing, and deploying. * Artifactory: Experience in working with, managing, and deploying. * Open Source: Build side-projects or contribute to other open-source projects.