Skip to content

Latest commit

 

History

History
87 lines (48 loc) · 4.04 KB

2025-02-06__data-prep-kit.md

File metadata and controls

87 lines (48 loc) · 4.04 KB

Data Preparation using Data Prep Kit (2025-Feb-06)

Event Details

Event information{:target="_blank" rel="noopener"}
🗓️: Thursday Feb 06, 2025
⏰: 9 am PST / 11 am CST / 12 pm EST / 5pm GMT
Duration: 1 hour

Event recording is available{:target="_blank" rel="noopener"}

Check resources - code, presentation slides ..etc


Agenda

  • Welcome
  • Quick intro about AI Alliance (3 min)
  • Presentation: Data preparation with Data Prep Kit (40 mins)
  • Q&A (10 mins)
  • Wrap-up

Session: Data preparation with Data Prep Kit

About

When building LLM applications, a significant portion of your time will be dedicated to data wrangling - from content extraction and cleaning to de-duplication and filtering out problematic data. This talk introduces Data Prep Kit - an open source toolkit, designed to streamline these essential tasks. Attendees will learn how the Data Prep Kit can accelerate data preparation, improve overall data quality, and enhance the efficiency of building robust LLM applications.

Data Prep Kit is a comprehensive Python library that democratizes and accelerates data preparation by providing out-of-the-box solutions for common tasks. Engineered to scale from a single laptop to large cloud clusters, it has been successfully used to process terabytes of data for training IBM Granite Large Language Models (LLMs).

Data Prep Kit offers a robust feature set including duplicate elimination, advanced document and code handling, language detection (for both spoken and programming languages), removal of personally identifiable information (PII), as well as spam, hate speech, and malware detection.

Join this session to explore how this tool can accelerate data preparation, enhance data quality, and improve the efficiency of building robust LLM applications. A live demonstration will also showcase some of their key features, time permitting.

More about Data Prep Kit{:target="_blank" rel="noopener"}

Session Type:
Presentation and Demo

Audience:
LLM app developers, data scientists, data engineers

Technical Level:
Beginner - Intermediate

Prerequisites:
None

Resources

📺 Presentation: Data Prep for LLM Applications with Data Prep Kit{:target="_blank" rel="noopener"}

💻 Code

https://github.com/IBM/data-prep-kit

🖥️ code: examples/notebooks/pdf-processing-1

We will run through a PDF processing pipeline.

Notebook: pdf_processing_1_python.ipynb{:target="_blank" rel="noopener"} - We can run this on Google colab

Support and Community

🙋 Ask questions, get help, give us feedback at Data Prep Kit discussion forum{:target="_blank" rel="noopener"}

Speaker: Sujee Maniyam

AI Engineer, Developer Advocate @ Node51 (Consulting for IBM / The AI Alliance)

Sujee Maniyam is an expert in Generative AI, Machine Learning, Deep Learning, Big Data, Distributed Systems, and Cloud technologies. He is passionate about developer education, fostering community engagement. Sujee has led numerous training sessions, hackathons, and workshops. He is also an author, open source contributor and frequent speaker at conferences and meetups.

[email protected]   •   Linkedin{:target="_blank" rel="noopener"}   •   💼 portfolio{:target="_blank" rel="noopener"}