Machine Learning Project For Beginners with XGBoost + NVIDIA GPU 🤖🧠
Oct 19, 2025•Channel
AI Analysis
Data from YouTube Data API v3•Updated Just now
Video Overview
Video Details
Published8 months ago
Duration34:08
Video IDF_8RKstP2X8
Languageen
CategoryScience & Technology
PrivacyPublic
Made for KidsNo
Video TypeRegular Video
Performance Metrics
Views4.7K
Likes384
Comments60
Engagement Rate9.48%
Likes per 100 views8.20
Comments per 1K views12.81
Description
Ever wondered what makes people tip more in taxis? 🚕💵 In this hands-on machine learning project, we’ll build a complete workflow on real-world NYC data — cleaned, engineered, and trained entirely on GPU using XGBoost CUDA and cuDF Pandas! 🐼
(🚨No GPU?🚨 I’ll show you how to use one for free on Google Colab! 😉)
You’ll see how professionals approach problems, handle massive data, and fix memory errors - designing real data-science pipelines step by step! 😎 By the end, you’ll have a meaningful project that’s fun to build, technically impressive, and looks perfect on your portfolio!! 🤩 Join me on this adventure — and learn how to think like a pro-level data scientist.
💡 What You’ll Learn
- Handling Real-World Datasets: Cleanup, Missing Values, Anomalies, Aggregation. 📊
- Solving memory limitations and runtime crushes with cuDF Pandas + RMM. 💾
- Accelerating machine learning with XGBoost on NVIDIA GPUs. 🤖
- Evaluate your model’s performance — and keep making it smarter! 💪🤓
- And most importantly — develop the mindset of a data scientist, solving problems instead of guessing. 🔎
🧠 What Makes This Project Different
This isn’t another “beginner demo” — it’s a real workflow based on real data and real problems. You’ll experience the same challenges professionals face: huge sloppy datasets, missing labels, CPU and GPU memory limits — all explained step by step, in simple terms.
I’ll show you why we make each decision, not just how to code it — so you learn to think, debug, and reason like a pro.
🔗 Important Links
------------------------------------------------
🔹Download Tutorial Code and Smaller Dataset from GitHub:
https://github.com/MariyaSha/nyc_taxi_xgboost_lab
🔹 Download Full Dataset from NYC Open Data:
https://data.cityofnewyork.us/Transportation/2023-Yellow-Taxi-Trip-Data/4b4i-vvec/about_data
🔹RAPIDS Installation Guide:
https://docs.rapids.ai/install/
🔹Official NVIDIA Google Colab Notebook - 🧐 VERY ADVANCED 🧐:
https://colab.research.google.com/drive/1vlzvB981pej2RlKmXBUF1CNzyxl8YpJg
📽️ Important Tutorials
------------------------------------------------
⭐ WSL + Conda Setup:
https://youtu.be/luM5kwH6tjQ
⭐ Machine Learning with Scikit-Learn:
https://youtu.be/-IvNzmrcyUM
⭐ cuDF Pandas For Beginners:
https://youtu.be/9KsJRyZJ0vo
⭐ What is CUDA?
https://youtu.be/r9IqwpMR9TE
⏰ Time Stamps
------------------------------------------------
01:08 - Download Dataset
01:43 - Solving Big Data Problems with GPU Processing
02:46 - Google Colab Setup with Free T4 GPU
03:02 - Local Setup with NVIDIA GPU
03:43 - RAPIDS Installation Guide
05:07 - Solving Jupyter Kernel Crash with cuDF Pandas
05:29 - Handling Missing Values
05:53 - Detect Missing Values
06:29 - Replace with Zero
07:31 - Replace with Mean
08:57 - Investigate Columns with Ambiguous Names
11:21 - Drop Columns (If No Other Option)
12:01 - Split Data For Training & Testing
12:07 - Shuffle Data
13:39 - Features & Targets Split
14:02 - Train & Test Split
16:20 - Load XGBoost Model on GPU
17:55 - Train XGBoost Model
18:08 - Test XGBoost Model and Get Predictions
18:45 - Solve ValueError : DataFrame.dtypes must be int float bool or category
20:15 - Evaluate Trained Model
22:39 - Data Optimization & Anomalies
22:41 - Detect Data Anomalies with Aggregation
23:47 - Solve XGBoostError : No GPU Memory Left with RMM
25:04 - Handle Negative Charges and Unrealistic Distances
28:19 - Detect and Handle Unrealistic Transactions
30:28 - Second Train Run on Optimized Data
31:45 - Best Practices
31:45 - Plot Training Results & Feature Importance
32:17 - Hyperparameter Tuning
32:49 - Date Extraction : From String to Int or Category
33:05 - K-Fold Validation
33:45 - Thanks for Watching!
🚀 Environment Setup
------------------------------------------------
You can run this project in two ways, coding along with me:
1️⃣ Google Colab:
- Change your runtime to T4 GPU.
- Use smaller version of the NYC Taxi dataset (5 million rows). Download above 👆
2️⃣ Local setup:
- Make sure you have a CUDA compatible GPU.
- Use WSL and Minforge/Conda (⚠️MUST! ⚠️).
- Use current command from RAPIDS Installation Guide for your setup (⚠️MUST! ⚠️).
- Use the full version of the NYC Taxi dataset (38 million rows). Download above 👆
💻 Tutorial Code
------------------------------------------------
📌 Remove all the rows that have negative numbers:
data = data[~data.select_dtypes("number").lt(0).any(axis=1)]
📌 Solve "XGBoostError: No GPU memory is left" and kernel crashes:
import rmm
rmm.reinitialize(pool_allocator=True, initial_pool_size="8GB")
#MachineLearning #DataScience #Python #BigData #GPU #NVIDIA #RAPIDS #DataAnalysis #DataCleaning #PythonTutorial #AI #pythonprogramming