Skip to content

Latest commit

 

History

History
92 lines (56 loc) · 3.32 KB

2025-03-27__high-quality-dataset-with-dpk.md

File metadata and controls

92 lines (56 loc) · 3.32 KB

Workshop: Preparing High Quality Datasets with Data Prep Kit (2025 Mar 27)

Event Details

Event sign up{:target="_blank" rel="noopener"}
🗓️: March 27, 2025 Thursday
⏰: 9 am PST / 11 am CST / 12 pm EST / 5pm GMT
Duration: 1 hour

Event recording will be available soon

Check resources - code, presentation slides ..etc

Q & A section


Agenda

  • Welcome, housekeeping, etc.
  • Quick intro about AI Alliance (3 min)
  • Workshop: Preparing High Quality Datasets with Data Prep Kit (40 mins)
  • Q&A (10 mins)
  • Wrap-up

Workshop: Hands-on with Data Prep Kit

Overview

When building machine learning and data applications, a significant portion of your time will be dedicated to data wrangling - from content extraction and filtering out problematic and low quality data. In this hands-on session we will explore Data Prep Kit - an open source toolkit, designed to streamline these essential tasks. Attendees will learn first hand how to use the Data Prep Kit to improve overall data quality such as removing spam and low quality documents, removing HAP (Hate Abuse Profanity) speech, removing PII (Personally Identifiable Information) data, thus leading to higher quality dataset.

Description

Join us for an interactive, hands-on session where you will learn to clean up data and prepare high quality datasets.

In this workshop we will do the following:

  • Extract content from various documents (PDFs, HTML)
  • cleanup and remove markups
  • Detect and remove SPAM content
  • Score and remove low-quality documents
  • Identify and remove PII data
  • Detect and remove HAP (Hate Abuse Profanity) speech from documents
  • More about Data Prep Kit : https://github.com/IBM/data-prep-kit

What do you need to participate in this workshop?

  • Comfortable in python programming language
  • We will run the workshop code using Google Collab (free) - no other setup is needed!

Session Type:
Hands on workshop

Audience:
LLM app developers, data scientists, data engineers

Technical Level:
Intermediate

Prerequisites:
None

Duration
45 mins

Resources

will be available soon.

Speaker: Sujee Maniyam

AI Engineer, Developer Advocate @ Node51 (Consulting for IBM / The AI Alliance)

Sujee Maniyam is an expert in Generative AI, Machine Learning, Deep Learning, Big Data, Distributed Systems, and Cloud technologies. He is passionate about developer education, fostering community engagement. Sujee has led numerous training sessions, hackathons, and workshops. He is also an author, open source contributor and frequent speaker at conferences and meetups.

[email protected]   •   Linkedin{:target="_blank" rel="noopener"}   •   portfolio{:target="_blank" rel="noopener"}


Q & A

Please review the session recording