When the Cloud Fails: Debugging the "Undocumented" - Dhruv Jain, Gojek (GoTo Group) Indonesia
Jun 3, 2026•Channel
AI Analysis
Data from YouTube Data API v3•Updated Just now
Video Overview
Video Details
Published2 weeks ago
Duration20:59
Video ID13lvqYYdpYg
Languageen
CategoryScience & Technology
PrivacyPublic
Made for KidsNo
Video TypeRegular Video
Performance Metrics
Views54
Likes0
Comments0
Engagement Rate0.00%
Likes per 100 views0.00
Comments per 1K views0.00
Description
Don't miss out! Join us at our next KubeCon + CloudNativeCon events in Mumbai, India (18-19 June, 2026), Yokohama, Japan (29-30 July, 2026), and Shanghai, China (8-9 September, 2026). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io
When the Cloud Fails: Debugging the "Undocumented" - Dhruv Jain, Gojek (GoTo Group) Indonesia
What happens when a system degrades under high load while all internal metrics remain “green”? At hyperscale, supporting on-demand services across Southeast Asia’s most populous countries, a team observed up to a 7% drop in message delivery. The root cause was not application code, messaging brokers, or load balancers, but a hidden limitation deep within a cloud provider’s firewall.
This war-story session presents a forensic investigation into a managed cloud load balancer and its interaction with connection-tracking tables. The talk walks through the production cutover that triggered the issue and the targeted load testing that ultimately isolated the failure to cloud infrastructure behavior invisible to standard monitoring.
Beyond root cause analysis, the session focuses on outcomes: how sustained, evidence-based debugging led the cloud provider to acknowledge the issue—initially labeled a “limitation”—and introduce a new observability metric, firewall/connections_tracked. Attendees will leave with a practical framework for debugging black-box cloud failures and identifying the node-level metrics needed to detect silent network drops before they impact users.