Evaluate how we should use Croissant #92

deanwampler · 2025-01-28T18:49:28Z

Croissant (Google, MLCommons, and GitHub) is a metadata format developed by Google, HF, and others under the MLCommons umbrella. Its goal is to extend existing work, especially from schema.org, to define standardized, searchable metadata for datasets that would be used in ML/AI scenarios. It is certainly what we should use, unless we find reasons not to, for our metadata, where feasible.

Croissant is used to support dataset search tools like this Google tool and this HF Croissant editor UI (github). Hence, we should do that, too.

More links:

The HF URL to search HF for datasets using Croissant for metadata.
The MLCommons datasets (I'm not sure if this is intended as an example or how they really track all the ones they use.)
The MLCommons page again, where there are links to join the mailing list, meetings, etc. and links to the current standards: format and RAI (responsible AI).

blublinsky · 2025-02-10T10:58:07Z

Croissant spec is here https://docs.mlcommons.org/croissant/docs/croissant-spec.html.
Datasets specific part is here - https://docs.mlcommons.org/croissant/docs/croissant-spec.html#dataset-level-information
It is remarkably close to the HF dataset card - no surprise here, HF is a large Croissant supporter. The two fields, missing in the card, but present in Croissant spec are publisher and date.

The main advantage of Croissant is Resources, that describe individual files and their content. This is very convenient/useful, for example, to see if individual documents have license info. It also has some additional goodies including Record sets, fields, Data sets, Format, transform, etc.

This, of course, is not free - the size of Croissant metadata can be quite large and requires constant maintenance from the data set owner. We will need a larger conversation with data set owners, to figure pros and cons of adding this.

jolson-ibm · 2025-02-19T13:55:57Z

If we are going to take advantage of the Croissant spec, the immediate action would be to make sure any dataset that is listed in the OTDI catalog has a Croissant interface. Not sure if this should be a hard requirement, but it should be discussed. Datasets from seven data producers are currently listed as trusted data sets in the OTDI catalog:

BrightQuery
- This OTDI link points to the BrightQuery web site, not any data, so it will be difficult to implement a Croissant interface. Perhaps we can discuss with them what there Croissant policy is?
Common Crawl Foundation
- Same as BrightQuery above.
EPFL
- This links directly to the Guidelines dataset in HuggingFace, which has a Croissant interface here. This is only one of the four datasets mentioned in the Github README.md mentioned on the OTDI website.
Meta
- Meta's Data For Good datasets are using The Humanitarian Data Exchange to house their data. That exchange does not seem to have a Croissant interface to its datasets. Perhaps we should open a dialogue with them, as they are hosting approximately 20k datasets.
PleIAs
- All of the listed PleIAs datsests are in HuggingFace, and some of them are using the Croissant interface, and others are not. Most notable of the nots: PleIAs's "Open Culture" non-english datasets. I'll take a closer look at this tomorrow, but it should not automatically be assumed that since you are using HuggingFace to distribute your dataset, you automatically get a Croissant interface,
ServiceNow
- Similar to PleIAs above, they have numerous datsets hosted on HuggingFace. A casual inspection showed all of them having a Croissant interface, but I will take a closer look tomorrow.
SemiKong
- SemiKong has a single dataset listed on the OTDI web site. It is housed with HuggingFace and does have a Croissant interface.

TL;DR:

Datasets stored on HuggingFace have a high probability of having a Croissant interface due to HuggingFace Hub supporting the spec. Being housed on HuggingFace does not automatically mean you can access Croissant metadata, as shown by the PleIAs "Open Culture" datasets.
The only other mass distribution channel currently used by datasets on the OTDI page is The Humanitarian Data Exchange which does not seem to support Croissant.
The producer websites linked to from the OTDI page in lieu of links to actual datasets is inconclusive. Not clear if there is a Croissant interface to the datasets stored here or not.
Path forward: Either the distribution platforms are going to have to support a Croissant interface (like HuggingFace), or if a data producer is self-hosting, they will have to provide a self-hosted interface to provide a common, consistent interface across all OTDI datasets, which can be easily and safely consumed by the OTDI data consumers.

jolson-ibm · 2025-02-20T19:14:16Z

Related to #91 - HuggingFace datset analytics.

There are two daily jobs currently deployed on AWS ECS to query HuggingFace Hub. The first job gets a list of all the datasets, the second retrieves the Croissant data for each. Results are written to parquest files in S3, where serverless SQL can then be used to run SQL against them:

The SQL:

jolson-ibm · 2025-02-20T19:18:51Z

Using the above, as of 2025-Feb-20, there are a total of 310,292 datasets in HuggingFace Hub, 255,539 have Croissant metadata, 54,753 do not.

deanwampler · 2025-02-20T20:49:40Z

Very nice! As we discussed, let's present this work in the next OTDI meeting for feedback and also think about ways we might use this information, e.g., expose the SQL interface in some "secure" way in the catalog page of the website.

jolson-ibm · 2025-02-21T14:06:30Z

It looks like there are about 685 datasets (out of 309,000+) that have a column called 'license'....meaning the license on the dataset is at the ROW level, rather than at the DATASET level itself. I manually verified a few of them. Query below:

select count(*)
from hf_datasets_detail 
where 
	date = cast('2025-02-21' as DATE) and 
	regexp_like(croissant, '\"column\"\s*:\s*\"license\"')

Most of these contain datasets contains links to images, or snippets of text (code, articles, etc).

I'll check more into this later.

deanwampler added this to FA5: OTDI Tasks Jan 28, 2025

deanwampler moved this to Todo in FA5: OTDI Tasks Jan 28, 2025

blublinsky self-assigned this Feb 10, 2025

deanwampler assigned jolson-ibm Feb 18, 2025

deanwampler moved this from Todo to In Progress in FA5: OTDI Tasks Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate how we should use Croissant #92

Evaluate how we should use Croissant #92

deanwampler commented Jan 28, 2025 •

edited

Loading

blublinsky commented Feb 10, 2025

jolson-ibm commented Feb 19, 2025 •

edited

Loading

jolson-ibm commented Feb 20, 2025

jolson-ibm commented Feb 20, 2025

deanwampler commented Feb 20, 2025

jolson-ibm commented Feb 21, 2025 •

edited

Loading

Evaluate how we should use Croissant #92

Evaluate how we should use Croissant #92

Comments

deanwampler commented Jan 28, 2025 • edited Loading

blublinsky commented Feb 10, 2025

jolson-ibm commented Feb 19, 2025 • edited Loading

jolson-ibm commented Feb 20, 2025

jolson-ibm commented Feb 20, 2025

deanwampler commented Feb 20, 2025

jolson-ibm commented Feb 21, 2025 • edited Loading

deanwampler commented Jan 28, 2025 •

edited

Loading

jolson-ibm commented Feb 19, 2025 •

edited

Loading

jolson-ibm commented Feb 21, 2025 •

edited

Loading