We are seeking a Lead DevOps Engineer to design, operate, and continuously improve the AWS platform that powers a custom VDI platform and cloud playtesting/streaming platform. This is a primarily individual contributor role that requires strong ownership and the ability to work independently while collaborating with one other team member and customer stakeholders. You will be responsible for infrastructure-as-code, container platforms, automation, CI/CD standardization, cost/performance optimization (including GPU instances), and leading troubleshooting during platform-wide degradations.
Responsibilities
-
Design, build, and maintain AWS infrastructure using Terraform
-
Management of Terraform workflows and remote state using HashiCorp Cloud Platform (HCP)
-
Ownership of the infrastructure lifecycle including provisioning, upgrades, decommissioning and operational hygiene
-
Operation of ECS clusters to deploy and operate microservices supporting the platforms
-
Operation of EKS clusters used to host and enable GitHub Actions runners, including required platform customizations
-
Right-size and tune GPU-enabled EC2 capacity to balance user experience with strict cloud cost controls
-
Continuous assessment of scaling behavior, utilization and performance bottlenecks
-
Implementation and maintenance of AWS Lambda functions for automation such as cleanup tasks, on-demand provisioning and operational workflows
-
Standardize and optimize GitHub Actions pipelines for Terraform plan/apply workflows, infrastructure releases and container image build/publish/deploy processes
-
Lead troubleshooting and restoration efforts for platform-wide issues such as VDI session drops, authentication issues and machine/storage failures
-
Coordination of incident resolution across teams through investigation, mitigation and follow-up actions
-
Creation and maintenance of run books, operational documentation and onboarding materials
Requirements
-
5+ years of experience in DevOps or platform engineering roles
-
Expertise in AWS infrastructure design, provisioning and lifecycle management
-
Proficiency in Terraform and HashiCorp Cloud Platform (HCP)
-
Skills in container orchestration with ECS and EKS
-
Knowledge of GPU-enabled EC2 capacity right-sizing, cost management and performance tuning
-
Competency in AWS Lambda for event-driven automation
-
Background in CI/CD standardization with GitHub Actions pipelines
-
Capability to lead reliability engineering, troubleshooting and incident resolution
-
High ownership and accountability with the ability to work independently and deliver without close supervision
-
Strong troubleshooting and systems thinking, remaining calm and structured during incidents
-
Clear communication with both technical and non-technical stakeholders
-
Practical prioritization in a Kanban environment balancing planned work and urgent interruptions
-
English proficiency at B2 level or higher
Nice to have
-
Familiarity with Amazon GameLift Streams
-
Understanding of streaming and playtesting platform needs
-
Skills in triaging urgent ad-hoc requests outside the standard Kanban flow
We offer
-
International projects with top brands
-
Work with global teams of highly skilled, diverse peers
-
Healthcare benefits
-
Employee financial programs
-
Paid time off and sick leave
-
Upskilling, reskilling and certification courses
-
Unlimited access to the LinkedIn Learning library and 22,000+ courses
-
Global career opportunities
-
Volunteer and community involvement opportunities
-
EPAM Employee Groups
-
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn