Conversation, Person, Adult, Male, Man, Head, Computer Keyboard, Face, Coat, Monitor

Staff Engineer – Agentic AIOps (MCP, Context Engineering, LLM Automation)

 

Notice: Equinix is aware of scams involving fake employment offers. Read more. 

Staff Engineer – Agentic AIOps (MCP, Context Engineering, LLM Automation)

  • JR-159647
  • Hybride
  • Bengaluru
  • Technology
  • Full time
Voir les favoris

Who are we?

Equinix is the world’s digital infrastructure company®, shortening the path to connectivity to enable the innovations that enrich our work, life and planet. 
 

A place where bold ideas are welcomed, human connection is valued, and everyone has the opportunity to shape their future.

Help us challenge assumptions, uncover bias, and remove barriers—because progress starts with fresh ideas. You’ll find belonging, purpose, and a team that welcomes you—because when you feel valued, you’re empowered to do your best work.

Job Summary

We are looking for highly skilled Staff Engineer (AIOps) to design, build, and scale the next generation of intelligent operational platforms for our ecosystem. These engineers will work at the intersection of SRE, machine learning, LLMs, observability, and automation, enabling predictive, autonomous operations across a globally distributed environment.

In this role, you will architect and implement AIOps capabilities such as intelligent incident routing, anomaly detection, operational copilots, ChatOps workflows, and automated remediation. You will partner closely with SRE, platform engineering, service management, and product teams to embed intelligence into operational workflows and redefine how digital operations are run.

This is a highly technical, hands-on role requiring strong depth in applied ML/LLMs, operational systems, automation frameworks, and observability data structures.

Responsibilities

AIOps Platform & Intelligence Development

  • Design and build AIOps models (LLMs or classical ML) for anomaly detection, correlation, root-cause identification, and intelligent event clustering.

  • Develop operational copilots and chatbots capable of responding to incidents, surfacing insights, and driving automation through natural language.

  • Build and maintain feature pipelines using telemetry, logs, metrics, traces, and runtime state for operational intelligence use cases.

  • Implement use cases for predictive and preventive operations—capacity forecasting, early warning systems, noisy neighbor detection, etc.

LLM Engineering & Applied AI

  • Build knowledge-grounding systems for operational copilots using runbooks, incident data, historical patterns, service maps, and topology.

  • Integrate LLM-based reasoning into observability and automation platforms.

  • Develop embeddings, retrieval systems (RAG), and intent classification for operational queries.

Automation & Intelligent Remediation

  • Build automated workflows for incident triage, diagnostics, collaboration, and remediation.

  • Architect closed-loop automation patterns connecting alerts → insights → action → verification.

  • Develop reusable automation modules with integration to unified observability, cloud platforms, and orchestration systems.

Data, Observability & Integration

  • Integrate AIOps models with observability platforms (logs, metrics, traces, events, topology).

  • Design real-time inference systems for high-volume telemetry streams.

  • Partner with SRE and platform teams to ensure pipelines, data contracts, and instrumentation support future AIOps workloads.

Operational Excellence & Collaboration

  • Work with transformation teams to define AIOps onboarding patterns, enablement models, and implementation guidelines.

  • Drive AIOps adoption across multiple products/platforms, ensuring reliability, scalability, and continuous improvement.

  • Participate in architecture reviews, data modeling discussions, and SRE transformation initiatives.

Qualifications

  • 8+ years of experience in SRE, platform engineering, ML engineering, data engineering, or AIOps-oriented roles.

  • Strong hands-on experience building ML or LLM-based systems with Python, PyTorch/TensorFlow, or modern LLM frameworks.

  • Experience building automation workflows using tools like StackStorm, Rundeck, Airflow, Jenkins, or cloud-native orchestration.

  • Deep understanding of observability data (logs, metrics, traces) and platforms like Datadog, Splunk, Prometheus, Grafana, ELK.

  • Experience designing and deploying RAG pipelines, embeddings, intent models, or operational chatbots.

  • Strong experience architecting streaming or event-driven systems (Kafka, Kinesis, Pub/Sub).

  • Familiarity with cloud-native systems, Kubernetes, microservices, and modern deployment patterns.

  • Excellent problem-solving skills with the ability to translate operational challenges into ML-based or automation-based solutions.

  • Ability to collaborate across SRE, platform, service management, and engineering teams.

Must Have Skills

  • Hands on Experience using Claude Code, Codex or GitHub Copilot any one of them

  • Good Understanding of Context Engineering

  • Understand the Agentic Harness frameworks

  • Have built one MCP server at least

Equinix is committed to ensuring that our employment process is open to all individuals, including those with a disability.  If you are a qualified candidate and need assistance or an accommodation, please let us know by completing this form.

Equinix is an Equal Employment Opportunity and, in the U.S., an Affirmative Action employer.  All qualified applicants will receive consideration for employment without regard to unlawful consideration of race, color, religion, creed, national or ethnic origin, ancestry, place of birth, citizenship, sex, pregnancy / childbirth or related medical conditions, sexual orientation, gender identity or expression, marital or domestic partnership status, age, veteran or military status, physical or mental disability, medical condition, genetic information, political / organizational affiliation, status as a victim or family member of a victim of crime or abuse, or any other status protected by applicable law. 

We use artificial intelligence in our hiring process. Learn more here.