Senior Site Reliability Engineer

EPAM Systems, Inc. -
Desde casa

Postulate ahora

Detalles del empleo

Descripción completa del empleo

We are seeking a Senior Site Reliability Engineer with substantial expertise in enhancing the reliability, availability, performance and scalability of production environments. The right candidate will bring a strong software engineering mindset paired with deep operational know-how, cloud expertise, automation capabilities and practical incident management experience.

This position centers on constructing dependable systems, cutting down operational toil, strengthening observability and supporting engineering teams in delivering services that meet established reliability targets.

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Responsibilities

Architect and deploy solutions that enhance system reliability, availability and performance
Establish and track SLIs, SLOs and error budgets
Develop automation that minimizes manual operational effort and repetitive activities
Enhance monitoring, logging, tracing and alerting capabilities
Take part in incident response, root cause analysis and postmortems
Partner with development teams to strengthen service resilience and operability
Maintain production systems and assist in resolving complex technical problems
Contribute toward capacity planning, performance tuning and disaster recovery strategies
Advocate for reliability engineering practices throughout teams

Requirements

Substantial experience in SRE, DevOps, Platform Engineering or Production Engineering positions
Practical experience maintaining production systems at scale
Familiarity with cloud platforms including AWS, Azure or GCP
Deep knowledge of observability tools used for monitoring, logging, tracing and alerting
Background in incident management, postmortems and root cause analysis
Solid scripting or programming abilities in Python, Go, Bash or similar languages
Familiarity with Linux systems, networking and distributed systems fundamentals
Working knowledge of containers and orchestration platforms like Docker and Kubernetes
Sound understanding of CI/CD, automation and Infrastructure as Code
Excellent problem-solving abilities and capacity to perform under pressure

Nice to have

Background defining SLIs, SLOs and error budgets
Familiarity with Prometheus, Grafana, Datadog, New Relic, Splunk, ELK or comparable tools
Hands-on use of Terraform or other IaC tools
Exposure to chaos engineering or resilience testing
Background with high-availability systems and disaster recovery planning
Certifications in cloud technologies or Kubernetes

We offer

Connectivity Bonus (25,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept).
Medicina Prepaga (It covers the collaborator and direct family group).
Paternity Leave (Two additional days are added to what is established by law, total of 4 days).
Discounts card.
English Training (English lessons, twice per week).
Training Program (Access to multiple customized training plans according to the needs of each role within the company).
Marriage bonus (The company doubles the allowance established by law that ANSES offers).
Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company).
External Agreements and Discounts.
Vacations: 14 calendar days a year

By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM´s Privacy Notice and Policy.

Postulate ahora

Herramientas para candidatos

Herramientas para empresas

Explorar

Mantente conectado