Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Semi-automate" periodic gathering of HF dataset statistics #97

Open
deanwampler opened this issue Feb 5, 2025 · 2 comments
Open

"Semi-automate" periodic gathering of HF dataset statistics #97

deanwampler opened this issue Feb 5, 2025 · 2 comments
Assignees
Labels
data pipelines Defining and implementing data processing pipelines
Milestone

Comments

@deanwampler
Copy link
Contributor

Let's start with monthly running of Joe's pipeline for gathering the stats. With such low frequency, it would be sufficient to use a calendar reminder to run it, rather than something more automated (unless that's easy to do).

@deanwampler deanwampler added the data pipelines Defining and implementing data processing pipelines label Feb 5, 2025
@deanwampler deanwampler added this to the 2025-02-14 milestone Feb 5, 2025
@deanwampler deanwampler moved this to Todo in FA5: OTDI Tasks Feb 5, 2025
@blublinsky
Copy link
Contributor

Can we define the meaning of the word statistics better? Is this just license info? Something else?
The complete definition for data set cards has a lot of fields with potential info, while in reality, the majority of cards have language, license, pretty_name. Some additionally have configs, size_categories, and task_categories.
In general, data card is a yaml file embedded in readme.md. It is loaded as a dictionary and instead of predefined set of fields has a set of fields based on the yaml definition. SO just getting a list of fields in the card is an interesting statistics

@deanwampler
Copy link
Contributor Author

deanwampler commented Feb 12, 2025

We'll start with the data Joe gathered in his experiment, then decide how to expand the set.

For timing this task, it is gated by setting up the AWS environment, #104.

@deanwampler deanwampler modified the milestones: 2025-02-14, 2024-11-30 Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data pipelines Defining and implementing data processing pipelines
Projects
Status: Todo
Development

No branches or pull requests

3 participants