Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate how we should use Croissant #92

Open
deanwampler opened this issue Jan 28, 2025 · 6 comments
Open

Evaluate how we should use Croissant #92

deanwampler opened this issue Jan 28, 2025 · 6 comments
Assignees

Comments

@deanwampler
Copy link
Contributor

deanwampler commented Jan 28, 2025

Croissant (Google, MLCommons, and GitHub) is a metadata format developed by Google, HF, and others under the MLCommons umbrella. Its goal is to extend existing work, especially from schema.org, to define standardized, searchable metadata for datasets that would be used in ML/AI scenarios. It is certainly what we should use, unless we find reasons not to, for our metadata, where feasible.

Croissant is used to support dataset search tools like this Google tool and this HF Croissant editor UI (github). Hence, we should do that, too.

More links:

  • The HF URL to search HF for datasets using Croissant for metadata.
  • The MLCommons datasets (I'm not sure if this is intended as an example or how they really track all the ones they use.)
  • The MLCommons page again, where there are links to join the mailing list, meetings, etc. and links to the current standards: format and RAI (responsible AI).
@blublinsky
Copy link
Contributor

Croissant spec is here https://docs.mlcommons.org/croissant/docs/croissant-spec.html.
Datasets specific part is here - https://docs.mlcommons.org/croissant/docs/croissant-spec.html#dataset-level-information
It is remarkably close to the HF dataset card - no surprise here, HF is a large Croissant supporter. The two fields, missing in the card, but present in Croissant spec are publisher and date.

The main advantage of Croissant is Resources, that describe individual files and their content. This is very convenient/useful, for example, to see if individual documents have license info. It also has some additional goodies including Record sets, fields, Data sets, Format, transform, etc.

This, of course, is not free - the size of Croissant metadata can be quite large and requires constant maintenance from the data set owner. We will need a larger conversation with data set owners, to figure pros and cons of adding this.

@jolson-ibm
Copy link

jolson-ibm commented Feb 19, 2025

If we are going to take advantage of the Croissant spec, the immediate action would be to make sure any dataset that is listed in the OTDI catalog has a Croissant interface. Not sure if this should be a hard requirement, but it should be discussed. Datasets from seven data producers are currently listed as trusted data sets in the OTDI catalog:

  1. BrightQuery
    • This OTDI link points to the BrightQuery web site, not any data, so it will be difficult to implement a Croissant interface. Perhaps we can discuss with them what there Croissant policy is?
  2. Common Crawl Foundation
    • Same as BrightQuery above.
  3. EPFL
    • This links directly to the Guidelines dataset in HuggingFace, which has a Croissant interface here. This is only one of the four datasets mentioned in the Github README.md mentioned on the OTDI website.
  4. Meta
    • Meta's Data For Good datasets are using The Humanitarian Data Exchange to house their data. That exchange does not seem to have a Croissant interface to its datasets. Perhaps we should open a dialogue with them, as they are hosting approximately 20k datasets.
  5. PleIAs
    • All of the listed PleIAs datsests are in HuggingFace, and some of them are using the Croissant interface, and others are not. Most notable of the nots: PleIAs's "Open Culture" non-english datasets. I'll take a closer look at this tomorrow, but it should not automatically be assumed that since you are using HuggingFace to distribute your dataset, you automatically get a Croissant interface,
  6. ServiceNow
    • Similar to PleIAs above, they have numerous datsets hosted on HuggingFace. A casual inspection showed all of them having a Croissant interface, but I will take a closer look tomorrow.
  7. SemiKong
    • SemiKong has a single dataset listed on the OTDI web site. It is housed with HuggingFace and does have a Croissant interface.

TL;DR:

  1. Datasets stored on HuggingFace have a high probability of having a Croissant interface due to HuggingFace Hub supporting the spec. Being housed on HuggingFace does not automatically mean you can access Croissant metadata, as shown by the PleIAs "Open Culture" datasets.

  2. The only other mass distribution channel currently used by datasets on the OTDI page is The Humanitarian Data Exchange which does not seem to support Croissant.

  3. The producer websites linked to from the OTDI page in lieu of links to actual datasets is inconclusive. Not clear if there is a Croissant interface to the datasets stored here or not.

  4. Path forward: Either the distribution platforms are going to have to support a Croissant interface (like HuggingFace), or if a data producer is self-hosting, they will have to provide a self-hosted interface to provide a common, consistent interface across all OTDI datasets, which can be easily and safely consumed by the OTDI data consumers.

@jolson-ibm
Copy link

Related to #91 - HuggingFace datset analytics.

There are two daily jobs currently deployed on AWS ECS to query HuggingFace Hub. The first job gets a list of all the datasets, the second retrieves the Croissant data for each. Results are written to parquest files in S3, where serverless SQL can then be used to run SQL against them:

Image

The SQL:

Image

@jolson-ibm
Copy link

Using the above, as of 2025-Feb-20, there are a total of 310,292 datasets in HuggingFace Hub, 255,539 have Croissant metadata, 54,753 do not.

@deanwampler
Copy link
Contributor Author

Very nice! As we discussed, let's present this work in the next OTDI meeting for feedback and also think about ways we might use this information, e.g., expose the SQL interface in some "secure" way in the catalog page of the website.

@jolson-ibm
Copy link

jolson-ibm commented Feb 21, 2025

It looks like there are about 685 datasets (out of 309,000+) that have a column called 'license'....meaning the license on the dataset is at the ROW level, rather than at the DATASET level itself. I manually verified a few of them. Query below:

select count(*)
from hf_datasets_detail 
where 
	date = cast('2025-02-21' as DATE) and 
	regexp_like(croissant, '\"column\"\s*:\s*\"license\"')

Most of these contain datasets contains links to images, or snippets of text (code, articles, etc).

I'll check more into this later.

@deanwampler deanwampler moved this from Todo to In Progress in FA5: OTDI Tasks Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants