Skip to content

Commit 5952a25

Browse files
committed
added 2025-03-27 session
1 parent b714965 commit 5952a25

File tree

2 files changed

+93
-0
lines changed

2 files changed

+93
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ Office Hours are held every week, offering engaging talks and live Q&A sessions
1919

2020
(latest first)
2121

22+
- **2025-Mar-20:** [Workshop: Preparing High Quality Datasets with Data Prep Kit](events/2025-03-27__high-quality-dataset-with-dpk.md)
2223
- **2025-Mar-20:** [Workshop: Hands on with Data Prep Kit](events/2025-03-20__data-prep-kit-hands-on.md)
2324
- **2025-Mar-13:** [Workshop: Hands on with Docling](events/2025-03-13__docling-hands-on.md)
2425
- **2025-Mar-06:** [Introducing GneissWeb - a state-of-the-art LLM pre-training dataset](events/2025-03-06__gneissweb.md)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Workshop: Preparing High Quality Datasets with Data Prep Kit (2025 Mar 27)
2+
3+
<!-- ## 🔗 [tinyurl.com/jzbvaeak](https://tinyurl.com/jzbvaeak) -->
4+
5+
<!-- <img src="../assets/qrcode_2025-02-27__data-prep-review.png" width="400px"> -->
6+
7+
## Event Details
8+
9+
[Event sign up](https://www.meetup.com/ibm-developer-sf-bay-area-meetup/events/306536813){:target="_blank" rel="noopener"}<br>
10+
🗓️: **March 27, 2025 Thursday**<br>
11+
⏰: **9 am PST / 11 am CST / 12 pm EST / 5pm GMT**
12+
Duration: **1 hour**
13+
14+
**Event recording will be available soon**
15+
16+
**[Check resources](#resources)** - code, presentation slides ..etc
17+
18+
**[Q & A section](#q--a)**
19+
20+
---
21+
22+
23+
## Agenda
24+
25+
- Welcome, housekeeping, etc.
26+
- Quick intro about AI Alliance (3 min)
27+
- Workshop: Preparing High Quality Datasets with Data Prep Kit (40 mins)
28+
- Q&A (10 mins)
29+
- Wrap-up
30+
31+
## Workshop: Hands-on with Data Prep Kit
32+
33+
![](../assets/data-prep-kit-1.png)
34+
35+
36+
### Overview
37+
38+
When building machine learning and data applications, a significant portion of your time will be dedicated to data wrangling - from content extraction and filtering out problematic and low quality data. In this hands-on session we will explore Data Prep Kit - an open source toolkit, designed to streamline these essential tasks. Attendees will learn first hand how to use the Data Prep Kit to improve overall data quality such as removing spam and low quality documents, removing HAP (Hate Abuse Profanity) speech, removing PII (Personally Identifiable Information) data, thus leading to higher quality dataset.
39+
40+
### Description
41+
42+
Join us for an interactive, hands-on session where you will learn to clean up data and prepare high quality datasets.
43+
44+
In this workshop we will do the following:
45+
46+
- Extract content from various documents (PDFs, HTML)
47+
- cleanup and remove markups
48+
- Detect and remove SPAM content
49+
- Score and remove low-quality documents
50+
- Identify and remove PII data
51+
- Detect and remove HAP (Hate Abuse Profanity) speech from documents
52+
- More about Data Prep Kit : https://github.com/IBM/data-prep-kit
53+
54+
**What do you need to participate in this workshop?**
55+
56+
- Comfortable in python programming language
57+
- We will run the workshop code using Google Collab (free) - no other setup is needed!
58+
59+
**Session Type:**
60+
Hands on workshop
61+
62+
**Audience**:
63+
LLM app developers, data scientists, data engineers
64+
65+
**Technical Level**:
66+
Intermediate
67+
68+
**Prerequisites**:
69+
None
70+
71+
**Duration**
72+
45 mins
73+
74+
### Resources
75+
76+
will be available soon.
77+
78+
### Speaker: Sujee Maniyam
79+
80+
**AI Engineer, Developer Advocate @ Node51 (Consulting for [IBM / The AI Alliance](https://thealliance.ai/))** <br>
81+
82+
Sujee Maniyam is an expert in Generative AI, Machine Learning, Deep Learning, Big Data, Distributed Systems, and Cloud technologies. He is passionate about developer education, fostering community engagement. Sujee has led numerous training sessions, hackathons, and workshops. He is also an author, open source contributor and frequent speaker at conferences and meetups.
83+
84+
[email protected] &nbsp;&nbsp;
85+
<img src="../assets/linkedin.svg" width="16 px"> [Linkedin](https://www.linkedin.com/in/sujeemaniyam/){:target="_blank" rel="noopener"} &nbsp;&nbsp;
86+
[portfolio](https://sujee.dev/portfolio?utm_medium=speaker_bio&utm_source=the-ai-alliance.github.io&utm_campaign=speaking_aialliance_offie_hours){:target="_blank" rel="noopener"}
87+
88+
---
89+
90+
## Q & A
91+
92+
Please review the session recording

0 commit comments

Comments
 (0)