Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support workflow to deposit corpora into BiodiversityPMC #310

Open
jhpoelen opened this issue Dec 2, 2024 · 7 comments
Open

support workflow to deposit corpora into BiodiversityPMC #310

jhpoelen opened this issue Dec 2, 2024 · 7 comments

Comments

@jhpoelen
Copy link
Member

jhpoelen commented Dec 2, 2024

@myrmoteras said

We are moving all the deposits in
Taxodros https://zenodo.org/communities/taxodros/records?q=&l=list&p=1&s=10&sort=newest ,
BHL https://zenodo.org/communities/bhl-blr/records?q=&l=list&p=1&s=10&sort=newest
Batlit https://zenodo.org/communities/batlit/records?q=&l=list&p=1&s=10&sort=newest

to BiodiversityPMC, to make the searchable, annotate them with the intent, what can be done and where the limitation of this shortcut is.

For this we need the files and the metadata, that is included in the deposits. Can you and Julien please find a temporary solution to get the metadata for the current records, and then in a second step how we can get new uploads to Zenodo transferred to bPMC?

The underlying (literature) dataset for the bhl-blr, taxodros and batlit Zenodo communities are available as versioned and signed data packages. So, the deposits in the Zenodo communities are a reflection of the versioned literature datasets (corpora): the originals are found elsewhere. And, these deposits are not version controlled, because the metadata can be edited without leaving much of a trace.

This is why I propose to implement the following workflow, similar to already existing workflows (e.g., taxodros -> Zenodo, batlit -> Zenodo), where the taxodros, batlit, and bhl-blr corpora are translated into BiodiversityPMC speak. This way, we have a controlled workflow (i.e. working from versioned sources) which we run without having to worry about Web API rate limiting, or other constraints (e.g., search results restricted to 10k).

Happy to discuss more, please feel free to schedule a meeting.

Curious to hear your thoughts!

image

@jhpoelen
Copy link
Member Author

jhpoelen commented Dec 3, 2024

An example of a BiodiversityPMC query provided by Julien G. on 2024-12-03 -

https://biodiversitypmc.sibils.org/?query=Halacarid%20mites&tab=plazi#results-section

with attached screenshot taken on 2024-12-03

image

Still looking for example that show the (meta-) data associated with these records in the search results.

@jhpoelen
Copy link
Member Author

jhpoelen commented Dec 3, 2024

Please see included json export of the search results associated with example https://biodiversitypmc.sibils.org/?query=Halacarid%20mites&tab=plazi#results-section .

sibils_2024-12-03-10h30.json

with first 100 lines being:

{
  "medline": [
    {
      "_index": "sibils_med24_v4.1.5.4",
      "_id": "30314157",
      "_score": 39.582214,
      "_ignored": [
        "abstract.keyword",
        "annotations_str.keyword"
      ],
      "_source": {
        "title": "A checklist of epibiont suctorian and peritrich ciliates (Ciliophora) on halacarid and hydrachnid mites (Acari: Halacaridae Hydrachnidia).",
        "abstract": "Based on published records and original data, a list of the epibiont suctorian and peritrich ciliates (Ciliophora) on halacarid and hydrachnid mites is presented. Altogether 13 suctorian and 10 peritrich species from hydrachnid and halacarid mites were listed. From this list, six suctorian and one peritrich species have been reported from halacarid mites, while four suctorian and four peritrich species were found on hydrachnid mites determined up to species level. The remaining specimens were determined upto the generic level. The halacarid and hydrachnid species do not share any suctorian and peritrich species and some of the ciliate species are specific to certain taxonomic groups of the hosts.The host specificity of both suctorian and peritrich ciliates, localization on the host body and environment are discussed. Some ciliate species specific to hydrachnid mites prefer lotic or lentic habitats. In most cases, both suctorian and peritrich ciliates prefer only marine or only fresh water bodies. It was also mentioned that both suctorian and peritrich ciliates have not distinct preferences in localization on their host body.",
        "journal": "Zootaxa",
        "authors": "Chatterjee Tapas|Dovgal Igor|PeŠiĆ Vladimir|Zawal Andrzej",
        "entrez_date": "2018-10-14",
        "pmid": "30314157",
        "mesh_terms": "D000818:Animals|D016798:Ciliophora|D017753:Ecosystem|D005618:Fresh Water|D008925:Mites",
        "sup_mesh_terms": "",
        "chemicals": "",
        "publication_types": "Journal Article",
        "keywords": "suctorians, peritrichs, halacarids, hydrachnids, epibiosis, host, specificity, localization, environment, Acari",
        "pmcid": "",
        "doi": "10.11646/zootaxa.4457.3.4",
        "annotations_str": "mesh mesh_D016798|agriculture agrovoc_c_1618|agriculture agrovoc_c_57|mesh mesh_D057189|species ott_150772|species ott_302424|species ncbitaxon_full_6933|species ncbitaxon_full_5878|mesh mesh_D008925|species ncbitaxon_full_92068|species ott_69277|agriculture agrovoc_c_d1532eb7|species ott_5343665|species ott_804400|chemical pubchemmesh_9576412|species ncbitaxon_full_94797|mesh mesh_D000053|species ott_302424|mesh mesh_D008925|mesh mesh_D016798|agriculture agrovoc_c_1618|mesh mesh_D011996|agriculture agrovoc_c_d1532eb7|species ott_5343665|species ncbitaxon_full_5878|mesh mesh_D008925|agriculture agrovoc_c_d1532eb7|mesh mesh_D008925|mesh mesh_D008925|agriculture agrovoc_c_d1532eb7|agriculture agrovoc_c_d1532eb7|conceptual_entity covocconceptualentities_CE_37|biotic_interaction robiext_ROBI_000072|biotic_interaction robiext_ROBI_000072|conceptual_entity covocconceptualentities_CE_77|agriculture agrovoc_c_2593|agriculture agrovoc_c_3673|species ott_302424|biological_process go_bp_GO:0051179|mesh mesh_D058507|environment envo_ENVO_01000254|biotic_interaction robiext_ROBI_000043|conceptual_entity covocconceptualentities_CE_37|biotic_interaction robiext_ROBI_000072|mesh mesh_D004777|biotic_interaction robiext_ROBI_000015|mesh mesh_D017753|mesh mesh_D008925|agriculture agrovoc_c_3456|agriculture agrovoc_c_d1532eb7|species ott_302424|environment envo_ENVO_00002011|agriculture agrovoc_c_50164|agriculture agrovoc_c_3102|biotic_interaction robiext_ROBI_000072|species ott_302424|biological_process go_bp_GO:0051179|conceptual_entity covocconceptualentities_CE_37|mesh mesh_D000818|species ott_5343665|mesh mesh_D016798|agriculture agrovoc_c_1618|species ncbitaxon_full_5878|environment envo_ENVO_01001110|mesh mesh_D017753|mesh mesh_D005618|mesh mesh_D008925|agriculture agrovoc_c_d1532eb7|agriculture agrovoc_c_48277bfc|environment envo_ENVO_01000254|agriculture agrovoc_c_57|species ott_16124|species ott_150772|biological_process go_bp_GO:0051179|species ncbitaxon_full_6933|conceptual_entity covocconceptualentities_CE_77|species ncbitaxon_full_6021|mesh mesh_D058507|biotic_interaction robiext_ROBI_000072|chemical pubchemmesh_9576412|species ott_1003210|mesh mesh_D004777|mesh mesh_D000053|agriculture agrovoc_c_2593|species ncbitaxon_full_39462|conceptual_entity covocconceptualentities_CE_37",
        "annotations_material": "mesh|D016798|Ciliophora|ciliophora||agrovoc|c_1618|Ciliophora|ciliophora||agrovoc|c_57|Acari|acari||mesh|D057189|Checklist|checklist||ott|150772|Acari|acari||ott|302424|Ciliophora|ciliates||ncbitaxon_full|6933|Acari|acari||ncbitaxon_full|5878|Ciliophora|ciliophora||mesh|D008925|Mites|mites||ncbitaxon_full|92068|Hydracarina|hydrachnidia||ott|69277|Hydracarina|hydrachnidia||agrovoc|c_d1532eb7|mites|mites||ott|5343665|Ciliophora|ciliophora||ott|804400|Halacaridae|halacaridae||pubchemmesh|9576412|Fenpyroximate|acari||ncbitaxon_full|94797|Halacaridae|halacaridae||mesh|D000053|Acari|acari||mesh|D011996|Records|records||covocconceptualentities|CE_37|host|host||robiext|ROBI_000072|host|host|hosts||covocconceptualentities|CE_77|Specificity|specificity||agrovoc|c_2593|environment|environment||agrovoc|c_3673|hosts|hosts||go_bp|GO:0051179|localization|localization||mesh|D058507|Host Specificity|host specificity|host, specificity||envo|ENVO_01000254|environmental system|environment||robiext|ROBI_000043|parent to|share||mesh|D004777|Environment|environment||robiext|ROBI_000015|habitat|habitats||mesh|D017753|Ecosystem|habitats|ecosystem||agrovoc|c_3456|habitats|habitats||envo|ENVO_00002011|fresh water|fresh water||agrovoc|c_50164|bodies|bodies||agrovoc|c_3102|freshwater|fresh water||mesh|D000818|Animals|animals||envo|ENVO_01001110|ecosystem|ecosystem||mesh|D005618|Fresh Water|fresh water||agrovoc|c_48277bfc|epibiosis|epibiosis||ott|16124|Peritrichia|peritrichs||ncbitaxon_full|6021|Peritrichia|peritrichs||ott|1003210|Suctoria|suctorians||ncbitaxon_full|39462|Suctoria|suctorians"
      },
      "processed_facets": {
        "cellosaurus": [],
        "species": [],
        "ott": [
          "Acari",
          "Ciliophora",
          "Hydracarina",
          "Halacaridae",
          "Peritrichia",
          "Suctoria"
        ],
        "robi": [],
        "journal": [
          "Zootaxa"
        ],
        "publication_types": [
          "Journal Article"
        ],
        "article_type": [],
        "subset": [],
        "chemicals": [],
        "mesh_terms": [
          "D000818:Animals",
          "D016798:Ciliophora",
          "D017753:Ecosystem",
          "D005618:Fresh Water",
          "D008925:Mites"
        ],
        "ext": [],
        "licence": [],
        "language": []
      }
    },
    {
      "_index": "sibils_med24_v4.1.5.4",
      "_id": "36095786",
      "_score": 38.474808,
      "_ignored": [
        "abstract.keyword",
        "annotations_str.keyword"
      ],
      "_source": {
        "title": "An annotated checklist of halacarid mites (Acari, Halacaridae) from India.",
        "abstract": "A compilation of halacarid mite species (Halacaridae) from India has been carried out based on published records. Indian halacarid records were mostly found among algal habitats, some are also reported associated with pneumatophores and halophytes. Copidognathus is the most dominant genus with 19 species reported from the Indian coast. Reports of halacarid mites from Indian interstitial habitats are mostly doubtful and needs verification. Acarochelopodia delamarei, Copidognathus fabricii, Scaptognathus hallezi, Simognathus minutus should be excluded from Indian record. The real diversity and distribution of Indian halacarid fauna is far from being complete and future investigations may reveal further new taxa.",
        "journal": "Zootaxa",
        "authors": "Chatterjee Tapas",
        "entrez_date": "2022-09-12",
        "pmid": "36095786",
        "mesh_terms": "D000818:Animals|D017753:Ecosystem|D007194:India|D008925:Mites",
        "sup_mesh_terms": "",
        "chemicals": "",
        "publication_types": "Journal Article",
        "keywords": "",
        "pmcid": "",
        "doi": "10.11646/zootaxa.5141.4.1",
        "annotations_str": "mesh mesh_D008925|mesh mesh_D007194|mesh mesh_D057189|agriculture agrovoc_c_57|species ncbitaxon_full_94797|chemical pubchemmesh_9576412|agriculture agrovoc_c_d1532eb7|species ott_150772|species ncbitaxon_full_6933|mesh mesh_D000053|species ott_804400|species ott_804400|species ncbitaxon_full_94797|mesh mesh_D011996|mesh mesh_D007194|mesh mesh_D008925|agriculture agrovoc_c_25112|agriculture agrovoc_c_3456|mesh mesh_D011996|biotic_interaction robiext_ROBI_000015|mesh mesh_D017753|mesh mesh_D055051|species ncbitaxon_full_1027280|species ott_819122|environment envo_ENVO_01000687|mesh mesh_D006301|mesh mesh_D058028|agriculture agrovoc_c_3456|biotic_interaction robiext_ROBI_000015|mesh mesh_D017753|agriculture agrovoc_c_d1532eb7|mesh mesh_D008925|species ott_3534487|species ncbitaxon_full_1027280|species ott_3534875|species ott_3535348|species ncbitaxon_full_2060451|species ncbitaxon_full_2060448|species ott_3535176|agriculture agrovoc_c_2821|conceptual_entity covocconceptualentities_CE_19|agriculture agrovoc_c_15ab5afd|mesh mesh_D000818|environment envo_ENVO_01001110|mesh mesh_D017753|mesh mesh_D007194|mesh mesh_D008925|agriculture agrovoc_c_d1532eb7",
        "annotations_material": "mesh|D008925|Mites|mites|mite||mesh|D007194|India|india||mesh|D057189|Checklist|checklist||agrovoc|c_57|Acari|acari||ncbitaxon_full|94797|Halacaridae|halacaridae||pubchemmesh|9576412|Fenpyroximate|acari||agrovoc|c_d1532eb7|mites|mites||ott|150772|Acari|acari||ncbitaxon_full|6933|Acari|acari||mesh|D000053|Acari|acari||ott|804400|Halacaridae|halacaridae||mesh|D011996|Records|records||agrovoc|c_25112|halophytes|halophytes||agrovoc|c_3456|habitats|habitats||robiext|ROBI_000015|habitat|habitats||mesh|D017753|Ecosystem|habitats|ecosystem||mesh|D055051|Salt-Tolerant Plants|halophytes||ncbitaxon_full|1027280|Copidognathus|copidognathus||ott|819122|Copidognathus|copidognathus||envo|ENVO_01000687|coast|coast||mesh|D006301|Health Services Needs and Demand|needs||mesh|D058028|Research Report|reports||ott|3534487|Copidognathus fabricii|copidognathus fabricii||ott|3534875|Acarochelopodia delamarei|acarochelopodia delamarei||ott|3535348|Simognathus minutus|simognathus minutus||ncbitaxon_full|2060451|Simognathus|simognathus||ncbitaxon_full|2060448|Scaptognathus|scaptognathus||ott|3535176|Scaptognathus hallezi|scaptognathus hallezi||agrovoc|c_2821|fauna|fauna||covocconceptualentities|CE_19|distribution|distribution||agrovoc|c_15ab5afd|new taxa|new taxa||mesh|D000818|Animals|animals||envo|ENVO_01001110|ecosystem|ecosystem"
      },
      "processed_facets": {
        "cellosaurus": [],
        "species": [],
        "ott": [
          "Acari",
          "Halacaridae",
          "Copidognathus",
          "Copidognathus fabricii",
          "Acarochelopodia delamarei",
          "Simognathus minutus",
          "Scaptognathus hallezi"
        ],
        "robi": [],
        "journal": [
          "Zootaxa"

@jhpoelen
Copy link
Member Author

jhpoelen commented Dec 3, 2024

Please see included csv export of the search results associated with example https://biodiversitypmc.sibils.org/?query=Halacarid%20mites&tab=plazi#results-section .

sibils_2024-12-03-10h30.csv

with apparent malformed line on line 4

cat sibils_2024-12-03-10h30.csv | mlr --icsvlite --omd --ifs ';' cat

producing:

| query | collection | doc_id | title | authors | date | answer | score |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Halacarid mites | medline | 30314157 | A checklist of epibiont suctorian and peritrich ciliates (Ciliophora) on halacarid and hydrachnid mites (Acari: Halacaridae Hydrachnidia). | Chatterjee Tapas\|Dovgal Igor\|PeŠiĆ Vladimir\|Zawal Andrzej | undefined |  | 39.582214 |
| Halacarid mites | medline | 36095786 | An annotated checklist of halacarid mites (Acari, Halacaridae) from India. | Chatterjee Tapas | undefined |  | 38.474808 |
mlr: mlr: CSV header/data length mismatch 8 != 9 at filename (stdin) line  4.
.

@jhpoelen
Copy link
Member Author

jhpoelen commented Dec 5, 2024

@myrmoteras suggests to create a csv file like the following:

authors pub year title ... original doi pdf content id corpus version zenodo deposition id zenodo deposition url
Koopman 1994 Chiroptera: Systematics ... 10.123/456 hash://md5/70c3e0fba7379e09e95a38569fe29da7 hash://md5/26f7ce5dd404e33c6570edd4ba250d20 13422270 https://zenodo.org/records/13422270

@jhpoelen
Copy link
Member Author

jhpoelen commented Dec 5, 2024

The underlying workflow could be as depicted in the attached diagram.

  1. Zotero group is packaged into a versioned copy
  2. Zenodo metadata is derived from this version copy
  3. Zenodo metadata is uploaded to Zenodo along with the linked pdf (content)
  4. For each successful deposit, Zenodo replies with the stored metadata along with the assigned record/deposit id (deposit transaction log in Zenodo json format)
  5. In order to use this information, the deposit transaction logs is saved somewhere/somehow as a versioned object
  6. Now, this deposit transaction log is translated into some kind format that helps users (to be defined) to import the metadata into their preferred literature management system (e.g., Zotero, Endnote, ...).

With this, we have create another derived data product which includes a json and csv file that contains all the deposits made within a specific upload activity of a version of a literature corpus.

In other words, we'd like to have a feature that help export an entire Zenodo community into a format that is compatible with Zotero, Endnote etc.

image

@jhpoelen
Copy link
Member Author

jhpoelen commented Dec 5, 2024

Here's a way to embed this export feature on the Zenodo community page.

image

@jhpoelen
Copy link
Member Author

jhpoelen commented Dec 5, 2024

Note that a reference list can be download from https://batlit.org/refs.csv .

id authors date title journal doi
https://www.zotero.org/groups/bat_literature_project/items/YLGQ4TQY Vriesendorp | Schulenberg | Alverson | Moskovits | Moscoso 2006 Rapid biological inventories: Sierra del Divisor
https://www.zotero.org/groups/bat_literature_project/items/YI9XP2PF Carus 1896 Wissenschaftliche Mittheilungen. Zoologischer Anzeiger
https://www.zotero.org/groups/bat_literature_project/items/8YCQ2DAS Handley, Jr. 1996 New species of mammals from northern South America: bats of the genera Histiotus Gervais and Lasiurus Gray (Chiroptera: Vespertilionidae). PROCEEDINGS OF THE BIOLOGICAL SOCIETY OF WASHINGTON
https://www.zotero.org/groups/bat_literature_project/items/PXE28CDH Bastian Jr. | Tanaka | Anunciado | Natural | Sumalde | Namikawa 2002 Evolutionary Relationships of Flying Foxes (Genus Pteropus) in the Philippines Inferred From DNA Sequences of Cytochrome b Gene Biochemical Genetics
https://www.zotero.org/groups/bat_literature_project/items/7XCKGXZ5 Handley Jr 1984 New species of mammals from northern South America: a long-tongued bat, genus Anoura Gray. PROCEEDINGS OF THE BIOLOGICAL SOCIETY OF WASHINGTON
https://www.zotero.org/groups/bat_literature_project/items/JEIMLDIZ Gastón | Trucco | Tellaeche | Bracamonte | Cuello | Novillo | Lizárraga 2018 09 ä Mamíferos puneños y altoandinos Serie Conservación de la Naturaleza
https://www.zotero.org/groups/bat_literature_project/items/R4EXYWU3 Carvalho | Maas | Peracchi | Gomes 2016 Fruit consumption of Prosopis juliflora (Fabaceae) and Anacardium occidentale (Anacardiaceae) by Artibeus (Phyllostomidae) in the Caatinga biome. Bol. Soc. Bras. Mastozool.
https://www.zotero.org/groups/bat_literature_project/items/N9AXRQZ8 Deshpande 2012 Assessing diversity and distribution of bats in relation to land-use and anthropogenic threats in the southern Western Ghats, India.
https://www.zotero.org/groups/bat_literature_project/items/Y5968JVL Douangboubpha | Sanamxay | Xayaphet | Bumrungsri | Bates 2012 First record of Sphaerias blanfordi (Chiroptera: Pteropodidae) from Lao PDR. Tropical Natural History

screenshot below using https://batlit.org/refs on 2024-12-05

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant