We are hiring a Site Reliability Engineer to join the LatAm portion of our globally distributed DPoD SRE team. Our team operates on a 24x5 follow-the-sun model, with the LatAm region covering business hours for the Americas and contributing to weekend and holiday on-call rotations. This role is ideal for an engineer who thrives in production environments, enjoys solving complex reliability challenges on cloud-native infrastructure, and wants to help shape the operational excellence of a platform used by both internal teams and external customers.
Responsibilities
-
Operate, monitor and troubleshoot production workloads running on Azure, including AKS clusters, virtual machines, networking and storage components
-
Respond to incidents during shift hours and on-call rotations, drive resolution, lead post-incident reviews and implement preventive measures
-
Build and maintain CI/CD pipelines in Azure DevOps to support reliable, repeatable deployments
-
Design, implement and maintain observability solutions including dashboards, alerts, log pipelines and SLI/SLO metrics that improve service reliability and operational visibility
-
Automate repetitive operational tasks ("toil") using scripting languages such as Python and Bash
-
Collaborate with engineering, product and support teams across regions to improve system reliability, scalability and performance
-
Contribute to runbooks, knowledge base articles and operational documentation
-
Participate actively in continuous improvement of the team's processes, tooling and incident management practices
Requirements
-
2+ years of experience in DevOps or Site Reliability Engineering
-
Hands-on experience operating workloads in Microsoft Azure (compute, networking, identity, storage)
-
Practical experience with Azure DevOps for CI/CD pipelines and repository management
-
Strong Linux administration and troubleshooting skills in production environments
-
Proficiency in at least one scripting language for automation purposes
-
Demonstrated experience applying Site Reliability Engineering principles (SLIs/SLOs, error budgets, toil reduction, automation, blameless postmortems)
-
Strong systematic troubleshooting skills across application, infrastructure and network layers
-
English proficiency at B2 level or higher
Nice to have
-
Skills in Bash scripting for systems automation and operational tooling
-
Hands-on experience with Azure Kubernetes Service (AKS) in production
-
Background in the Elastic Stack (Elasticsearch, Kibana) for logging and observability
-
Familiarity with formal Incident Management practices and ITSM frameworks (e.g., ITIL)
-
Expertise in configuring and managing NFS or other network-attached storage solutions
-
Proficiency in Python development for automation, observability tooling or platform integrations
-
Capability to define and manage Service Level Indicators and Service Level Objectives in a customer-facing service