Leveraging AI for efficient incident response
๐ Abstract
The article discusses Meta's efforts to advance their investigation tooling using AI, with a focus on improving root cause analysis for issues in their web monorepo. It describes a system that combines heuristic-based retrieval and large language model (LLM)-based ranking to provide AI-assisted root cause analysis, which has achieved 42% accuracy in identifying root causes during backtesting.
๐ Q&A
[01] Investigation Tools and Root Cause Analysis
1. What are the key challenges in investigating issues in systems dependent on monolithic repositories?
- The accumulating number of changes involved across many teams can present scalability challenges
- Responders need to build context on the investigation, such as what is broken, which systems are involved, and who might be impacted
2. How does Meta's AI-based system aim to address these challenges?
- The system incorporates a heuristics-based retriever to reduce the search space from thousands of changes to a few hundred
- It then uses a LLM-based ranker system to identify the root cause across these changes
3. What was the key factor in achieving 42% accuracy in the root cause analysis?
- Fine-tuning a Llama 2 (7B) model using historical investigations for which the underlying root cause was known
4. What measures are taken to mitigate the risks of the AI-based system?
- Prioritizing closed feedback loops and explainability of results
- Relying on confidence measurement methodologies to detect and avoid recommending low confidence answers
[02] Future Developments
1. What are the future plans for expanding the capabilities of the AI-based investigation tools?
- Enabling the systems to autonomously execute full workflows and validate their results
- Utilizing AI to detect potential incidents prior to code push, proactively mitigating risks before they arise