Keras Tutorial: Checkpointing distributed models with Orbax

Mar 6, 2026Channel
AI Analysis
Data from YouTube Data API v3Updated Just now

Video Overview

Video Details

Published2 months ago
Duration7:34
Video IDDF9cvunST58
Languageen
CategoryScience & Technology
PrivacyPublic
Made for KidsNo
Video TypeRegular Video

Performance Metrics

Views630
Likes31
Comments0
Engagement Rate4.92%
Likes per 100 views4.92
Comments per 1K views0.00

Description

Don't let device failures or power outages ruin your training runs. In this tutorial, Yufeng Guo demonstrates how to use Keras with the Orbax checkpointing library. Learn how to implement a custom checkpoint manager and Keras callbacks to ensure your model state is always safely stored. 0:00 Introduction to Orbax & Keras Integration 0:39 Exploring Keras Checkpointing 1:11 Why Extend Keras for Multi-Host Environments? 1:48 What is Orbax? 2:29 Building Utility Classes: KerasOrbaxCheckpointManager & OrbaxCheckpointCallback 2:57 Deep Dive into KerasOrbaxCheckpointManager 3:45 Coding the Get, Save, and Restore State Functions 4:37 Implementing the OrbaxCheckpointCallback 5:12 Protecting Against Device Failures & Preemption 5:31 Implementation Details & Model.fit Integration 6:07 Checkpointing in Action: File Directory Walkthrough 6:56 Summary & Final Tips Resources: Orbax checkpointing in Keras - Developer guide → https://goo.gle/40T2LI8 ModelCheckpoint - Keras 3 API documentation → https://goo.gle/3PkAlEq Subscribe to Google for Developers → https://goo.gle/developers Speaker: Yufeng Guo Products Mentioned: Google AI

Related Videos

More videos from Google for Developers