This README details the Metadata, Data, Annotations and annotation guide and the Notebooks created for the methodological research on travelogues. We evaluated different methodologies to apply aspect-based sentiment analysis to this literary-historical dataset.
The metadata for the entire corpus is split per collection, and can be downloaded via our Drive folder.
- Biodiversity Heritage Library | BHL_merged.csv
- Travelogues Project | TP_merged.csv
- Italian Travelogues | IT_merged.csv
- Gutenberg Project | GB_merged.csv
- DBNL | DBNL_merged.csv
Each .CSV-file contains the following columns:
- ID (new ID for processing the data)
- language (language the text is written in)
- title (title of the book)
- author (author of the book)
- date_published (year the book was published)
- Original_ID (original ID from the source. These are also the names of the text files in the gathered corpus.)
- no_of_character (number of characters)
- no_of_tokens (number of tokens as processed by SpaCy)
- OCR_quality (quality of the OCR according to the Garbageness Score.
Texts gathered are named according to the Original_ID column.
The BHL corpus is published on our Drive due to its size. The dataset is split according to the key words used to scrape the texts (explor, journe, excurs, travel, expeditie, reis, trip). The texts contain a multitude of languages (Dutch, English, French, German, Portuguese, Latin, ...). The code used to scrape this data from the API is published in our Notebooks folder.
The Gutenberg corpus is published on our Drive. The texts are in both English and French.
The DBNL is published on our Drive. It contains all texts requested from the DBNL website that are related to travel. The texts are all in Dutch.
The Italian Travels dataset can be gathered from the project "Today we Have Passed with the Ancients...": Visions of Italy between XIX and XX century . Files are available in .TEI and .TXT.
The German Travelogues Project dataset can be gathered from their GitHub repository. More information on the corpus can be found on their website.
We created an annotated dataset comprising texts in English, Dutch, German and French which were annotated for biodiversity-related aspects and their associated sentiment. The annotated dataset is published on our Drive. The aspects annotated are further detailed in the annotation_guide.PDF. Sentiment-bearing words are annotated on a 1 (very negative) to 5 (very positive)-point scale. Sentiment was also annotated on the level of the sentence. The .ZIP-file Annotations.zip contains the annotated files in UIMA CAS XMI (XML 1.1), and can easily be parsed using the Inceptalytics API.
The following aspects were considered:
- PERSON
- LOCATION
- ORGANISATION
- FAUNA
- FLORA
- BIOME
- HUMAN_LANDFORM
- NATURAL_LANDFORM
- NATURAL_PHENOMENON
- WEATHER
- MYTH
Four notebooks are cleaned up and made available for reuse and further adaptation to specific use-cases within DH. These notebooks were developed to apply aspect-based sentiment analysis to the English subset of our annotated data. All notebooks detail two steps: 1) aspect extraction and 2) sentiment analysis. Aspect extraction is evaluated by turning the annotations into BIO-labels and then using the dependency Nervaluate. Sentiment analysis is evaluated on the gold standard annotations.
Rule_based_ABSA.ipynb for aspect extraction and sentiment analysis.
- ML_based_ABSA_aspects.ipynb (aspect extraction).
- ML_based_ABSA_feature_extraction_classification.ipynb (extracting MacBERT and BERT embeddings as features and use the embeddings in a set of classifiers).
Prompt-based workflow based on the mistralai/Mixtral-8x7B-Instruct-v0.1 model. Perform aspect extraction and sentiment analysis respectively.
GENLLM_based_ABSA.ipynb
Code to scrape the BHL website based on keywords.
Scraping_BHL.ipnyb