-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate how we should use Croissant #92
Comments
Croissant spec is here https://docs.mlcommons.org/croissant/docs/croissant-spec.html. The main advantage of Croissant is Resources, that describe individual files and their content. This is very convenient/useful, for example, to see if individual documents have license info. It also has some additional goodies including Record sets, fields, Data sets, Format, transform, etc. This, of course, is not free - the size of Croissant metadata can be quite large and requires constant maintenance from the data set owner. We will need a larger conversation with data set owners, to figure pros and cons of adding this. |
If we are going to take advantage of the Croissant spec, the immediate action would be to make sure any dataset that is listed in the OTDI catalog has a Croissant interface. Not sure if this should be a hard requirement, but it should be discussed. Datasets from seven data producers are currently listed as trusted data sets in the OTDI catalog:
TL;DR:
|
Related to #91 - HuggingFace datset analytics. There are two daily jobs currently deployed on AWS ECS to query HuggingFace Hub. The first job gets a list of all the datasets, the second retrieves the Croissant data for each. Results are written to parquest files in S3, where serverless SQL can then be used to run SQL against them: ![]() The SQL: ![]() |
Using the above, as of 2025-Feb-20, there are a total of 310,292 datasets in HuggingFace Hub, 255,539 have Croissant metadata, 54,753 do not. |
Very nice! As we discussed, let's present this work in the next OTDI meeting for feedback and also think about ways we might use this information, e.g., expose the SQL interface in some "secure" way in the catalog page of the website. |
It looks like there are about 685 datasets (out of 309,000+) that have a column called 'license'....meaning the license on the dataset is at the ROW level, rather than at the DATASET level itself. I manually verified a few of them. Query below: select count(*)
from hf_datasets_detail
where
date = cast('2025-02-21' as DATE) and
regexp_like(croissant, '\"column\"\s*:\s*\"license\"') Most of these contain datasets contains links to images, or snippets of text (code, articles, etc). I'll check more into this later. |
Croissant (Google, MLCommons, and GitHub) is a metadata format developed by Google, HF, and others under the MLCommons umbrella. Its goal is to extend existing work, especially from schema.org, to define standardized, searchable metadata for datasets that would be used in ML/AI scenarios. It is certainly what we should use, unless we find reasons not to, for our metadata, where feasible.
Croissant is used to support dataset search tools like this Google tool and this HF Croissant editor UI (github). Hence, we should do that, too.
More links:
The text was updated successfully, but these errors were encountered: