Lead DevOps Engineer

EPAM Systems, Inc. -
Desde casa

Postulate ahora

Detalles del empleo

Hace 1 día

Descripción completa del empleo

We are building a Lead DevOps Engineer role to own and evolve the AWS platform behind a custom VDI solution and cloud playtesting/streaming services. You will drive infrastructure-as-code, ECS/EKS operations, AWS Lambda automation, and GitHub Actions CI/CD standards while optimizing GPU EC2 cost/performance and leading incident response across the platform. Apply now to help keep the platform reliable, efficient, and scalable

Responsibilities

Design, build, and maintain AWS infrastructure with Terraform
Manage Terraform workflows and remote state through HashiCorp Cloud Platform (HCP)
Own the end-to-end infrastructure lifecycle, including provisioning, upgrades, decommissioning, and operational hygiene
Operate ECS clusters to deploy and run microservices that support the platforms
Administer EKS clusters that host and enable GitHub Actions runners, including necessary platform customizations
Optimize and right-size GPU-enabled EC2 capacity to meet user experience goals under strict cloud cost controls
Assess scaling behavior continuously, monitor utilization, and identify performance bottlenecks
Implement and maintain AWS Lambda functions that automate cleanup tasks, on-demand provisioning, and operational workflows
Standardize and improve GitHub Actions pipelines for Terraform plan/apply workflows, infrastructure releases, and container image build/publish/deploy processes
Lead troubleshooting and service restoration for platform-wide degradations such as VDI session drops, authentication issues, and machine/storage failures
Coordinate incident resolution across teams by driving investigation, mitigation, and follow-up actions
Create and keep current run books, operational documentation, and onboarding materials

Requirements

Proven 7+ years of experience in DevOps or platform engineering roles
Deep expertise in AWS infrastructure architecture, provisioning, and full lifecycle management
Hands-on proficiency with Terraform and HashiCorp Cloud Platform (HCP)
Solid experience operating container orchestration using ECS and EKS
Strong knowledge of GPU-enabled EC2 right-sizing, cloud cost management, and performance tuning
Practical competency with AWS Lambda for event-driven automation
Demonstrated background standardizing CI/CD using GitHub Actions pipelines
Proven track record leading reliability engineering, troubleshooting, and incident resolution
High ownership and accountability with the ability to work independently without close supervision
Strong troubleshooting and systems thinking, staying calm and methodical during incidents
Clear communication skills with both technical and non-technical stakeholders
Effective prioritization in a Kanban workflow, balancing planned work with urgent interruptions
English proficiency at B2 (Upper-Intermediate) level or higher

Nice to have

Familiarity with Amazon GameLift Streams
Understanding of streaming and playtesting platform needs
Ability to triage urgent ad-hoc requests that fall outside the standard Kanban flow

We offer

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Postulate ahora

Herramientas para candidatos

Herramientas para empresas

Explorar

Mantente conectado