Job Description
Cisco ThousandEyes is seeking a Lead Site Reliability Engineer. The engineer will establish and maintain reliability and scalability and security standards for cloud platforms and big-data platforms which will support AI and machine learning operations. The engineer will work with engineering teams and product teams and operations teams and security teams to create enterprise-ready software as a service solutions.
Job ID: 1440127
Date Posted: February 25, 2026
Expiration Date: NA
Apply: Apply Now
Main Duties
- Design, build, and optimize cloud and data infrastructure for high availability and reliability of AI/ML systems.
- Implement SRE principles including monitoring, alerting, error budgets, and fault analysis.
- Collaborate with software development, product management, and security teams to support ML/AI workloads
- Troubleshoot production issues, conduct root cause analysis, and drive performance improvements.
- Lead architectural vision, define technical roadmap, and balance immediate and long-term goals.
- Mentor engineering teams and foster a culture of operational excellence.
- Engage with stakeholders to understand use cases and influence enterprise solutions.
- Develop strategic roadmaps and automation for deploying software at scale.
Essential Qualifications
- Demonstrate extensive practical experience with cloud technologies and they should especially focus on Amazon Web Services expertise.
- Exhibit advanced knowledge of Infrastructure as Code which includes Terraform together with Kubernetes and EKS.
- Demonstrate experience in building AI and machine learning infrastructure through their work with the Hadoop ecosystem which includes Spark Hive HDFS Gobblin Airflow EMR and SageMaker.
- Programming proficiency in Python and Go and one other programming language.
- Develop solutions which can grow efficiently and maintain high testing standards.
Preferred Qualifications:
- Knowledge of Unix and Linux operating systems together with client-server networking protocols and observability tools which include Prometheus and Grafana and ELK stack.
- CKA and CKAD and AWS DevOps Engineer certifications.
- Experience in designing software solutions and infrastructure systems for large-scale enterprise environments.