Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrate Data Prep Kit to OTDI Steering Committee #103

Open
deanwampler opened this issue Feb 12, 2025 · 1 comment
Open

Demonstrate Data Prep Kit to OTDI Steering Committee #103

deanwampler opened this issue Feb 12, 2025 · 1 comment
Assignees

Comments

@deanwampler
Copy link
Contributor

Data Prep Kit (DPK) is an open source data engineering framework released by IBM Research. It was implemented to support the development of IBM's open source Granite family of LLMs.

DPK offers three value propositions:

  1. Workflows as the basis of data transformations: rather than relying on raw Python code to execute complex transforms and handle fault cases, DPK is build on Kubeflow Pipelines, providing an abstraction for higher value workflows.
  2. Since it is based on Kubeflow + Ray, workflows developed on a laptop can be seamlessly scaled up to clusters consisting of hundreds of nodes.
  3. Since DPK is open source, a community can collaborate on implementing workflows to handle difficult problems facing LLM data engineering, such as detecting hate speech, personally identifiable information, and licensing issues...all problems that will require a collaboration to address.

The OTDI steering committee consists of several large, experienced, production quality data producers. The purpose of this ticket is to see if the three value props listed above are pain points the large data producers are facing, and if so, discuss how they are addressing them currently.

@deanwampler deanwampler converted this from a draft issue Feb 12, 2025
@deanwampler
Copy link
Contributor Author

I'll organize the meeting and Boris will demonstrate what we have. We are setting up preliminary meetings with a few people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants