Keras Tutorial: Checkpointing distributed models with Orbax
Mar 6, 2026•Channel
AI Analysis
Data from YouTube Data API v3•Updated Just now
Video Overview
Video Details
Published2 months ago
Duration7:34
Video IDDF9cvunST58
Languageen
CategoryScience & Technology
PrivacyPublic
Made for KidsNo
Video TypeRegular Video
Performance Metrics
Views630
Likes31
Comments0
Engagement Rate4.92%
Likes per 100 views4.92
Comments per 1K views0.00
Video Tags
Description
Don't let device failures or power outages ruin your training runs. In this tutorial, Yufeng Guo demonstrates how to use Keras with the Orbax checkpointing library. Learn how to implement a custom checkpoint manager and Keras callbacks to ensure your model state is always safely stored.
0:00 Introduction to Orbax & Keras Integration
0:39 Exploring Keras Checkpointing
1:11 Why Extend Keras for Multi-Host Environments?
1:48 What is Orbax?
2:29 Building Utility Classes: KerasOrbaxCheckpointManager & OrbaxCheckpointCallback
2:57 Deep Dive into KerasOrbaxCheckpointManager
3:45 Coding the Get, Save, and Restore State Functions
4:37 Implementing the OrbaxCheckpointCallback
5:12 Protecting Against Device Failures & Preemption
5:31 Implementation Details & Model.fit Integration
6:07 Checkpointing in Action: File Directory Walkthrough
6:56 Summary & Final Tips
Resources:
Orbax checkpointing in Keras - Developer guide → https://goo.gle/40T2LI8
ModelCheckpoint - Keras 3 API documentation → https://goo.gle/3PkAlEq
Subscribe to Google for Developers → https://goo.gle/developers
Speaker: Yufeng Guo
Products Mentioned: Google AI