Lead Site Reliability Engineer, Network Assurance Data Platform, Cisco ThousandEyes, Cisco

March 30, 2026

Apply for this job

Email *
Executive Name *

Job Description

Cisco ThousandEyes is seeking a Lead Site Reliability Engineer. The engineer will establish and maintain reliability and scalability and security standards for cloud platforms and big-data platforms which will support AI and machine learning operations. The engineer will work with engineering teams and product teams and operations teams and security teams to create enterprise-ready software as a service solutions.

Job ID: 1440127

Date Posted: February 25, 2026

Expiration Date: NA

Apply: Apply Now

Main Duties

  • Design, build, and optimize cloud and data infrastructure for high availability and reliability of AI/ML systems.
  • Implement SRE principles including monitoring, alerting, error budgets, and fault analysis.
  • Collaborate with software development, product management, and security teams to support ML/AI workloads
  • Troubleshoot production issues, conduct root cause analysis, and drive performance improvements.
  • Lead architectural vision, define technical roadmap, and balance immediate and long-term goals.
  • Mentor engineering teams and foster a culture of operational excellence.
  • Engage with stakeholders to understand use cases and influence enterprise solutions.
  • Develop strategic roadmaps and automation for deploying software at scale.

Essential Qualifications

  • Demonstrate extensive practical experience with cloud technologies and they should especially focus on Amazon Web Services expertise. 
  • Exhibit advanced knowledge of Infrastructure as Code which includes Terraform together with Kubernetes and EKS. 
  • Demonstrate experience in building AI and machine learning infrastructure through their work with the Hadoop ecosystem which includes Spark Hive HDFS Gobblin Airflow EMR and SageMaker. 
  • Programming proficiency in Python and Go and one other programming language. 
  • Develop solutions which can grow efficiently and maintain high testing standards.

Preferred Qualifications:

  • Knowledge of Unix and Linux operating systems together with client-server networking protocols and observability tools which include Prometheus and Grafana and ELK stack.
  • CKA and CKAD and AWS DevOps Engineer certifications. 
  • Experience in designing software solutions and infrastructure systems for large-scale enterprise environments.