We are building a Lead DevOps Engineer role to own and evolve the AWS platform behind a custom VDI solution and cloud playtesting/streaming services. You will drive infrastructure-as-code, ECS/EKS operations, AWS Lambda automation, and GitHub Actions CI/CD standards while optimizing GPU EC2 cost/performance and leading incident response across the platform. Apply now to help keep the platform reliable, efficient, and scalable
Responsibilities
-
Design, build, and maintain AWS infrastructure with Terraform
-
Manage Terraform workflows and remote state through HashiCorp Cloud Platform (HCP)
-
Own the end-to-end infrastructure lifecycle, including provisioning, upgrades, decommissioning, and operational hygiene
-
Operate ECS clusters to deploy and run microservices that support the platforms
-
Administer EKS clusters that host and enable GitHub Actions runners, including necessary platform customizations
-
Optimize and right-size GPU-enabled EC2 capacity to meet user experience goals under strict cloud cost controls
-
Assess scaling behavior continuously, monitor utilization, and identify performance bottlenecks
-
Implement and maintain AWS Lambda functions that automate cleanup tasks, on-demand provisioning, and operational workflows
-
Standardize and improve GitHub Actions pipelines for Terraform plan/apply workflows, infrastructure releases, and container image build/publish/deploy processes
-
Lead troubleshooting and service restoration for platform-wide degradations such as VDI session drops, authentication issues, and machine/storage failures
-
Coordinate incident resolution across teams by driving investigation, mitigation, and follow-up actions
-
Create and keep current run books, operational documentation, and onboarding materials
Requirements
-
Proven 7+ years of experience in DevOps or platform engineering roles
-
Deep expertise in AWS infrastructure architecture, provisioning, and full lifecycle management
-
Hands-on proficiency with Terraform and HashiCorp Cloud Platform (HCP)
-
Solid experience operating container orchestration using ECS and EKS
-
Strong knowledge of GPU-enabled EC2 right-sizing, cloud cost management, and performance tuning
-
Practical competency with AWS Lambda for event-driven automation
-
Demonstrated background standardizing CI/CD using GitHub Actions pipelines
-
Proven track record leading reliability engineering, troubleshooting, and incident resolution
-
High ownership and accountability with the ability to work independently without close supervision
-
Strong troubleshooting and systems thinking, staying calm and methodical during incidents
-
Clear communication skills with both technical and non-technical stakeholders
-
Effective prioritization in a Kanban workflow, balancing planned work with urgent interruptions
-
English proficiency at B2 (Upper-Intermediate) level or higher
Nice to have
-
Familiarity with Amazon GameLift Streams
-
Understanding of streaming and playtesting platform needs
-
Ability to triage urgent ad-hoc requests that fall outside the standard Kanban flow
We offer
-
International projects with top brands
-
Work with global teams of highly skilled, diverse peers
-
Healthcare benefits
-
Employee financial programs
-
Paid time off and sick leave
-
Upskilling, reskilling and certification courses
-
Unlimited access to the LinkedIn Learning library and 22,000+ courses
-
Global career opportunities
-
Volunteer and community involvement opportunities
-
EPAM Employee Groups
-
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn