Skip to content

Commit

Permalink
Add file on Trafilatura integration (#294)
Browse files Browse the repository at this point in the history
* add file on Trafilatura integration

* review markdown

* Update integrations/trafilatura.md

Co-authored-by: Stefano Fiorucci <[email protected]>

* Update integrations/trafilatura.md

Co-authored-by: Bilge Yücel <[email protected]>

* Update integrations/trafilatura.md

Co-authored-by: Bilge Yücel <[email protected]>

* Update integrations/trafilatura.md

Co-authored-by: Bilge Yücel <[email protected]>

* Update integrations/trafilatura.md

Co-authored-by: Bilge Yücel <[email protected]>

* add Trafilatura logo

---------

Co-authored-by: Stefano Fiorucci <[email protected]>
Co-authored-by: Bilge Yücel <[email protected]>
  • Loading branch information
3 people authored Jan 8, 2025
1 parent 47764a4 commit 1c85bb4
Show file tree
Hide file tree
Showing 2 changed files with 70 additions and 0 deletions.
70 changes: 70 additions & 0 deletions integrations/trafilatura.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
layout: integration
name: Trafilatura
description: Efficiently gather text and metadata on the Web for LLM and RAG
authors:
- name: Adrien Barbaresi
socials:
github: adbar
twitter: adbarbaresi
linkedin: https://www.linkedin.com/in/adrienbarbaresi
pypi: https://pypi.org/project/trafilatura/
repo: https://github.com/adbar/trafilatura
report_issue: https://github.com/adbar/trafilatura/issues
logo: /logos/trafilatura.png
type: Data Ingestion
version: Haystack 2.0
---


### Table of Contents

- [Overview](#overview)
- [Installation](#installation)
- [Usage](#usage)
- [Settings](#settings)


## Overview

Trafilatura is a cutting-edge Python package and command-line tool designed to gather text on the Web and simplify the process of turning raw HTML into structured, meaningful data. Its extraction component is seamlessly integrated into Haystack.

Going from HTML bulk to essential parts can alleviate many problems related to text quality by focusing on the actual content and avoiding the noise, which is beneficial for LLM applications.


## Installation

```bash
pip install haystack-ai trafilatura
```


## Usage

Trafilatura powers the [`HTMLToDocument`](https://docs.haystack.deepset.ai/docs/htmltodocument) component in Haystack's converters. Here is how to use it:

```python
from haystack.components.converters import HTMLToDocument

converter = HTMLToDocument()
results = converter.run(sources=["path/to/sample.html"])
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the HTML file.'
```


### Settings

The `__init__` and `run` methods take an optional `extraction_kwargs` parameter which is then passed to Trafilatura. It has to be a dictionary of arguments known to the package, here are useful ideas in this context:

- Choice of HTML elements
- `include_comments=True` (comment sections at the bottom of articles)
- `include_images=True`
- `include_tables=True` (active by default)
- `prune_xpath=["//p[@class='discarded']"]` (pruning the tree before extraction)
- Optimization for precision or recall
- `favor_precision=True` (if your results contain too much noise)
- `favor_recall=True` (if parts of your documents are missing)

For more information see the [Python usage](https://trafilatura.readthedocs.io/en/latest/usage-python.html) and [function description](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract) parts of the official documentation.
Binary file added logos/trafilatura.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 1c85bb4

Please sign in to comment.