Profiling individuals (and machines) behind datasets #828

JoanGi · 2025-03-10T17:40:51Z

This issue is to provide a way to describe information about the humans who contribute to a dataset and their profile characteristics. This is an early proposal of the Croissant Responsible AI subgroup, and our main goal is to gather insights from the community.

Motivation:
Several responsible AI documentation frameworks are stressing the need to profile those involved in the different processes of the data creation pipeline. The subjectivity added by the involved in the creation process can introduce biases and affect the data's suitability for specific use cases (Ghal et al, Miceli et al. and Hube et al.). For instance, a dataset of everyday household objects could be used to train an ML classifier in the object detection task; however, due to cultural shifts, household objects vary across regions, and the demographics of the annotators are relevant to ensure that the data properly fits with the user-specific use case. Another case is in language data where the profile of the speakers (those who generated the language data points) is relevant to assess the diversity of your data (Bender et al). For instance, older adults tend to use different forms of language than younger ones, and, training a chatbot intended for older people with a dataset with a lot of slang could lead to unintended behavior. Another example, are in healthcare dataset, where the profile of those represented by the data are key to match the inclusion/exclusion criteria for its reusability in new studies and to avoid ML flaws (Roberts et al.). So, knowing the profile of those represented by the data is also relevant to evaluate its suitability.

Problem statement:
Despite the relevance of the profile of those involved in the data creation (annotation, collection, represented by) to evaluate the suitability of the data, searching for a dataset by these specific profiles is a query still difficult to solve.
Current practices of sharing profile information:

Proposed mechanism:

Uses the data-level annotation proposal. So, we can describe the profiles behind a single attribute (cr:Field), or a group of attributes (cr:RecordSet).

If someone reuse a part of a dataset they should maintain the profile information. (in line with DDI-CDIF)

Adds a specific semantic to Croissant to disclose the kind of profile described: annotatedBy, collectedBy, population, machineAnnotator, machineGenerated
Can identify synthetic data (machine generators) and augmented data (machine annotators).

Examples 1: Describing the annotators

DICES dataset discloses the profile of the annotators inside the data schema with a set of fields. We created a RecordSet inside the “annotation” to indicate that these attributes are describing the labels, and particularly those who annotated the labels.

{
"@type": "cr:RecordSet", "@id": "conversations",
  “field”: [
    { "@type": "cr:Field", "@id": "degree_of_harm","dataType": ["cr:Label"], ...}.
    { "@type": "cr:Field", "@id": "harm_type", ... },
    ….
  ],
  "annotation": [{
    "@type": "cr:RecordSet", 
    "@id": "conversation/annotatedBy", 
    “field”: [
      { "@type": "cr:Field", 
        "@id": "conversation/annotatedBy/rater_gender", 
        "dataType": ["sc:Text"],
        "equivalentProperty" : "cr:annotatedBy" },
      { "@type": "cr:Field", "@id": "conversation/annotatedBy/race", ...},
      ...
    ]},
}

Example 2: Identifying machine-annotated data
Dataset augmented with machine-generated data is every time more frequent. The following example allows to indicate at a granular level which data points are generated by humans or by machines.

{
"@type": "cr:RecordSet", "@id": "conversations",
  “field”: [...],
  "annotation": [
  {
    "@type": "cr:RecordSet", 
    "@id": "conversation/annotatedBy", 
    “field”: [
      { "@type": "cr: Field", 
        "@id": "conversation/machineAnnotator",, 
        "dataType": ["sc:Text"],
        "equivalentProperty": ["cr:MachineAnnotator"]
        "value" : {"@id": "GPT-4o-annotator"}, 
      }
   }
}

Example 3: Describing the represented cohort of people
In a dataset of lung images with a diagnosis label of pneumonia, knowing if the individuals represented are kids or adults could be useful for future reuse of the dataset (or a specific subset) and to avoid ML flaws (Roberts et al.). While this information could be found usually in natural text documentation (a.k.a Data Cards, data papers, other technical documents…), representing it inside a structured from in Croissant may aid in the dataset discovery and reusability.

The following sets a constant value that indicates that the dataset is composed of adult people.

{
"@type": "cr:RecordSet", "@id": "images",
  “field”: [
    { "@type": "cr:Field", "@id": "images/image", ... },
    { "@type": "cr:Field", "@id": "images/diagnosis", ... }
  ],
  "annotation": [{
    "@type": "cr:Field", 
     "@id": "images/population", 
     "dataType": ["sc:Text"],
     "equivalentProperty" : "cr:Population"
     "value" : { "minAge": 18, "maxAge": 65 }  
    },
}

Comments:

One of the most common cases would be that the profile information refers to the whole dataset. Should then be allowed to create some dataset-level fields, such as annotatedBy, CollectedBy, Population?
Should we define the cr:annotatedBy, cr:collectedBy, cr:machineAnnotator, cr:machineGenerated, cr:Population? (I've have been working recently in a similar topic to document software teams profile, and we came up with a minimal set of attributes that be defined to profile a team (https://www.arxiv.org/pdf/2503.05470))

The text was updated successfully, but these errors were encountered:

JoanGi added the enhancement label Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling individuals (and machines) behind datasets #828

Profiling individuals (and machines) behind datasets #828

JoanGi commented Mar 10, 2025