Design, implement, and support scalable, reliable infrastructure to power
production and development environments.
Manage and enhance our container orchestration systems, with a focus on
Kubernetes (EKS), while maintaining a balanced view of other critical AWS
services such as EC2, ALB, IAM, and VPC networking.
Build and maintain automation for application and infrastructure deployment,
scaling, and lifecycle management.
Partner with software engineering teams to improve build, release, and
deployment processes across CI/CD pipelines.
Monitor and improve system availability, latency, and performance across the full
stack—from cloud infrastructure to backend services.
Develop internal tools and scripts to enhance operational efficiency, resilience,
and security.
Play a key role in incident response efforts, including root cause analysis and
long-term remediation.
Participate in architecture reviews and help guide decisions on infrastructure
design, resilience, and observability.
Stay informed on industry trends in reliability engineering, cloud-native tooling,
and DevOps practices, and integrate improvements into our operational
playbook.
Champion security, scalability, and cost-efficiency in all infrastructure decisions.
5+ years of experience in a DevOps, SRE, or infrastructure engineering role
supporting production systems at scale.
Hands-on experience managing containerized applications using Kubernetes,
preferably AWS EKS, but with understanding of broader infrastructure
ecosystems.
Strong knowledge of AWS services and how they integrate to support modern
cloud architectures.
Proficiency with Infrastructure as Code (IaC) tools such as Terraform, and
configuration management tools.
Experience designing and supporting CI/CD pipelines (e.g., Jenkins, GitHub
Actions, ArgoCD, etc.).
Scripting or programming skills in Python, Go, or similar languages, used for
automation and tooling.
Deep understanding of systems observability, including logging, metrics, and
tracing (e.g., Prometheus, Grafana, CloudWatch).
Ability to diagnose and troubleshoot complex issues across distributed systems,
including performance bottlenecks and availability challenges.
Familiarity with security best practices for cloud and containerized environments.
Clear and proactive communicator, comfortable working cross-functionally in a
fast-paced environment.
Software Powered by iCIMS
www.icims.com