Site Reliability Engineer

EPAM Systems, Inc. -
Desde casa

Postulate ahora

Detalles del empleo

Descripción completa del empleo

We are hiring a Site Reliability Engineer to join the LatAm portion of our globally distributed DPoD SRE team. Our team operates on a 24x5 follow-the-sun model, with the LatAm region covering business hours for the Americas and contributing to weekend and holiday on-call rotations. This role is ideal for an engineer who thrives in production environments, enjoys solving complex reliability challenges on cloud-native infrastructure, and wants to help shape the operational excellence of a platform used by both internal teams and external customers.

Responsibilities

Operate, monitor and troubleshoot production workloads running on Azure, including AKS clusters, virtual machines, networking and storage components
Respond to incidents during shift hours and on-call rotations, drive resolution, lead post-incident reviews and implement preventive measures
Build and maintain CI/CD pipelines in Azure DevOps to support reliable, repeatable deployments
Design, implement and maintain observability solutions including dashboards, alerts, log pipelines and SLI/SLO metrics that improve service reliability and operational visibility
Automate repetitive operational tasks ("toil") using scripting languages such as Python and Bash
Collaborate with engineering, product and support teams across regions to improve system reliability, scalability and performance
Contribute to runbooks, knowledge base articles and operational documentation
Participate actively in continuous improvement of the team's processes, tooling and incident management practices

Requirements

2+ years of experience in DevOps or Site Reliability Engineering
Hands-on experience operating workloads in Microsoft Azure (compute, networking, identity, storage)
Practical experience with Azure DevOps for CI/CD pipelines and repository management
Strong Linux administration and troubleshooting skills in production environments
Proficiency in at least one scripting language for automation purposes
Demonstrated experience applying Site Reliability Engineering principles (SLIs/SLOs, error budgets, toil reduction, automation, blameless postmortems)
Strong systematic troubleshooting skills across application, infrastructure and network layers
English proficiency at B2 level or higher

Nice to have

Skills in Bash scripting for systems automation and operational tooling
Hands-on experience with Azure Kubernetes Service (AKS) in production
Background in the Elastic Stack (Elasticsearch, Kibana) for logging and observability
Familiarity with formal Incident Management practices and ITSM frameworks (e.g., ITIL)
Expertise in configuring and managing NFS or other network-attached storage solutions
Proficiency in Python development for automation, observability tooling or platform integrations
Capability to define and manage Service Level Indicators and Service Level Objectives in a customer-facing service

Postulate ahora

Herramientas para candidatos

Herramientas para empresas

Explorar

Mantente conectado