Responsibilities
- Design, implement, and support scalable, reliable infrastructure to power
production and development environments. - Manage and enhance our container orchestration systems, with a focus on
Kubernetes (EKS), while maintaining a balanced view of other critical AWS
services such as EC2, ALB, IAM, and VPC networking. - Build and maintain automation for application and infrastructure deployment,
scaling, and lifecycle management. - Partner with software engineering teams to improve build, release, and
deployment processes across CI/CD pipelines. - Monitor and improve system availability, latency, and performance across the full
stack—from cloud infrastructure to backend services. - Develop internal tools and scripts to enhance operational efficiency, resilience,
and security. - Play a key role in incident response efforts, including root cause analysis and
long-term remediation. - Participate in architecture reviews and help guide decisions on infrastructure
design, resilience, and observability. - Stay informed on industry trends in reliability engineering, cloud-native tooling,
and DevOps practices, and integrate improvements into our operational
playbook. - Champion security, scalability, and cost-efficiency in all infrastructure decisions.
Qualifications
- 5+ years of experience in a DevOps, SRE, or infrastructure engineering role
supporting production systems at scale. - Strong knowledge of AWS services and how they integrate to support modern
cloud architectures. - Proficiency with Infrastructure as Code (IaC) tools such as Terraform, and
configuration management tools. - Experience designing and supporting CI/CD pipelines (e.g., Jenkins, GitHub
Actions, ArgoCD, etc.). - Scripting or programming skills in Python, Go, or similar languages, used for
automation and tooling. - Deep understanding of systems observability, including logging, metrics, and
tracing (e.g., Prometheus, Grafana, CloudWatch). - Ability to diagnose and troubleshoot complex issues across distributed systems,
including performance bottlenecks and availability challenges. - Familiarity with security best practices for cloud and containerized environments.
- Clear and proactive communicator, comfortable working cross-functionally in a
fast-paced environment.
Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.