Skip to content

jonathan-hartman/csvw-workshop

Folders and files

NameName
Last commit message
Last commit date
Mar 15, 2024
Mar 19, 2024
Mar 15, 2024
Mar 19, 2024
Mar 19, 2024
Mar 15, 2024

Repository files navigation

License: CC0-1.0

Workshop – (F.A.)I.R. tabular data for Pythons pandas library

Among researchers, the CSV format is commonly chosen to publish or archive tabular research data. Although widely known, open, and readable without specific software, parsing CSV-formatted data is error-prone. This is due to a severely lacking formalization of the format, resulting in a proliferation of formatting practices, where the choice of the separating character (,, ;, \t, etc.) is just the tip of the iceberg. Additionally, the format does not offer standardized methods to incorporate metadata – including the vital information about the table’s columns, such as descriptions, units, data formats, etc. The W3C recommendation CSV on the Web (CSVW) overcomes these limitations by introducing a method to produce CSV-accompanying JSON documents that contain this missing information, thus rendering CSV files interoperable and reusable (according to the FAIR principles).

The workshop participants will implement functions that allow the import and export of CSV/CSVW data pairs for DataFrame objects of Python's pandas library. To solve this task, they will be made aware of the shortcomings of the CSV format and become acquainted with (a subset) of the CSVW standard.

Task

Implement two python methods to import and export CSV/CSVW pairs into and from pandas.DataFrame objects. Suggestions of function bodies are defined in python/read_csv_metadata.py and python/write_csv_metadata.py. Find example data in the data folder.

As a first step, the methods should be able to handle the according example data (use Dublin Core for descriptive metadata, e.g.: creator, description, license).

Materials

CSV

Problems of the CSV format

The format properties of a CSV file are referred to as its dialect.

General Recommendations

  • the RFC 4180 comes closest to a format standard, use it

  • making use of the header row is strongly recommended, since it's the only way to incorporate meaning into the file.

  • use the XML Schema definitions to format temporal data, e.g.: https://www.w3.org/TR/xmlschema-2/#dateTime

  • delimit decimals with a dot (.)

CSVW

CSVW defines a schema to describe CSV files with an accompanying JSON-LD file (naming scheme: <csv-filename>.csv-metadata.json).

CSVW is designed to not only describe CSV data file, but to also validate its contents and allow the transformation of the data into linked data formats. Additionally, CSVW allows describing multiple CSVs with one CSVW file; including the relations between the described files, by the means of primary and foreign keys.

This workshop focuses on the subset of CSVW that allows to describe the dialect, general metadata and the tables columns in detail. The option to describe multiple CSVs in one CSVW file is disregarded as well.

CSVW-Subset: CSVW-Full: https://www.w3.org/TR/tabular-metadata/#metadata-format

Problems with CSVW

The dialect object cannot handle footer rows.

Related Projects/Tools (non-exhaustive)

License

Excluding the data folder, all contents of this repository are licensed under the Creative Commons Zero v1.0 Universal License. See https://creativecommons.org/publicdomain/zero/1.0/ for more information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published