We are seeking a Site Reliability Engineer with a strong programming background to join our Cloud Security and Infrastructure (CSI) team.
CSI provides a single point of entry to enable identity, branding and compliance, as well as a single point of management to support provisioning, monitoring, security and operational support. The ideal candidate will bring hands-on expertise in containerization, orchestration and observability to help build and maintain reliable, scalable systems.
Responsibilities
-
Create and manage applications, containerize them and run them using open-source container management tools such as Docker or Podman
-
Interpret container logs and trace specific events for troubleshooting purposes
-
Create and manage Kubernetes resource manifests for deployment into K8S clusters (e.g., Kind cluster locally or GKE/AKS in a cloud provider)
-
Deploy Prometheus agents to monitor infrastructure and application behavior
-
Raise and manage alerts based on observability data
-
Support provisioning, monitoring, security and operational tasks across distributed systems
-
Implement and maintain CI/CD pipelines and GitOps-based continuous deployment workflows
-
Collaborate with cross-functional teams to ensure system reliability and performance
Requirements
-
At least 2 years of hands-on programming experience
-
Proficiency in at least one scripting language
-
Hands-on expertise in Kubernetes and Linux
-
Knowledge of at least one cloud provider, with experience in Microsoft Azure
-
Familiarity with Prometheus or a similar monitoring agent and strong fundamentals of observability
-
Skills in Azure DevOps CI/CD pipelines and/or GitOps packaging and continuous deployment tools such as Helm and ArgoCD
-
Capability to troubleshoot distributed systems
-
Background in Terraform for infrastructure as code
-
Fluent communication skills in English at a B2+ level
Nice to have
-
Familiarity with Azure DevOps
-
Knowledge of Google Cloud Platform
-
Expertise in Istio
-
Proficiency in Prometheus