Observability at scale with Neural Networks: A more proactive approach

Tech Triveni speaker Keshav Peswani

Keshav Peswani

Senior Software Development Engineer

Expedia Group

About Keshav Peswani

Keshav Peswani has been working as a Senior Software Engineer at Expedia Group focusing on technology and innovation on various platform initiatives. Keshav is involved in building a neural network-based anomaly detection model as part of Expedia's adaptive alerting system, an open-source project for anomaly detection. He is also a core contributor to the open-source project Haystack from Expedia for distributed tracing, a software that facilitates the detection and remediation of problems in a service-oriented architecture. Keshav started his career at D.E. Shaw & Co. and through his journey has worked on several projects based on deep learning particularly recurrent neural networks, monolithic systems, distributed systems, big data processing. Keshav is a fast learner and passionate about deep learning and event-driven architecture.


Session

We at Expedia work on a mission of connecting people to places through the power of technology. To accomplish this, we build and run hundreds of micro-services that provide different functionalities to serve every single customer request, which results in generating billions of events. Now, what happens when one or more services fail at the same time? Well, to improve the observability in our system, we see a need to connect these failure points across our distributed topology to reduce mean time to detect(MTTD) and know (MTTK)

In this talk, we will present the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended up building our own solution(in open source). We will do a deep dive into our architecture and demonstrate how we ingest terabytes of tracing data (around 8 TB / day) in production with a peak throughput of over 550,000 spans / second for hundreds of micro-services.

We use this data for trending service errors/latencies/rate, perform anomaly detection on the aggregated trends, build service-dependency and network-latency graphs, other than our primary use case of distributed tracing.

With this increasing number, there felt the need to have a real-time intelligent alerting and monitoring system to move towards 24/7 reliability. We will talk about how we use neural networks on trends and perform anomaly detection, including a deep dive into the architecture for the automated training pipeline and online, compute using streams in a cost-effective manner

Share the talk