Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
---
title: FileOrURLToMarkdownConverterBatch
title: FileOrURLToMarkdownConverterAPI
createTime: 2025/10/09 16:52:48
permalink: /en/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterbatch/
permalink: /en/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterapi/
---

## 📘 Overview

`FileOrURLToMarkdownConverterBatch` is a knowledge extraction operator that supports extracting structured content from multiple file formats (e.g., PDF, Office documents, web pages, plain text) and URLs, converting them into a unified Markdown format. The operator automatically detects the file type and invokes the optimal parsing engine (such as MinerU or trafilatura) to preserve the original layout and key information.
`FileOrURLToMarkdownConverterAPI` is an operator that utilizes the official API of MinerU for knowledge extraction, it supports extracting structured content from multiple file formats (e.g., PDF, Office documents, web pages, plain text) and URLs, converting them into a unified Markdown format. The operator automatically detects the file type and invokes the optimal parsing engine (such as MinerU or trafilatura) to preserve the original layout and key information.

## **init** Function

```python
def __init__(self, intermediate_dir: str = "intermediate", lang: str = "en", mineru_backend: str = "vlm-vllm-engine", ):
def __init__(self, intermediate_dir: str = "intermediate", mineru_backend: str = "vlm", api_key:str = None):
```

### init Parameter Description

| Parameter | Type | Default | Description |
| :------------------- | :--- | :------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **intermediate_dir** | str | "intermediate" | Directory path used to store intermediate files generated during the conversion process. |
| **lang** | str | "en" | Specifies the main language of the document (e.g., 'zh' for Chinese, 'en' for English) to optimize parsing performance. |
| **mineru_backend** | str | "vlm-sglang-engine" | Specifies the backend engine for MinerU, used for handling complex documents such as PDFs. Options include "pipeline", "vlm-transformers", "vlm-vllm-engine", and "vlm-http-client". |
| **api_key** | str | None | Specifies the API key for accessing external MinerU services. |
| **mineru_backend** | str | "vlm" | Specifies the backend engine for MinerU, used for handling complex documents such as PDFs. Options include "pipeline", "vlm-transformers", "MinerU-HTML". |

### Prompt Template Description

Expand All @@ -45,10 +45,10 @@ def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: s
## 🧠 Example Usage

```python
self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterBatch(
self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterAPI(
intermediate_dir="../example_data/KBCleaningPipeline/raw/",
lang="en",
mineru_backend="vlm-vllm-engine",
api_key="your-api-key-here",
mineru_backend="vlm",
)
self.knowledge_cleaning_step1.run(
storage=self.storage.step(),
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: FileOrURLToMarkdownConverterFlash
createTime: 2025/10/09 16:52:48
permalink: /en/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterflash/
---

## 📘 Overview

`FileOrURLToMarkdownConverterFlash` is an operator that utilizes the local installation of MinerU model for knowledge extraction, it supports extracting structured content from multiple file formats (e.g., PDF, Office documents, web pages, plain text) and URLs, converting them into a unified Markdown format. The operator automatically detects the file type and invokes the optimal parsing engine (such as MinerU or trafilatura) to preserve the original layout and key information.

## **init** Function

```python
def __init__(
self,
intermediate_dir: str = "intermediate",
mineru_model_path=None,
batch_size:int = 4,
replicas:int = 1,
num_gpus_per_replica:float = 1,
engine_gpu_util_rate_to_ray_cap:float = 0.9
):

```

### init Parameter Description

| Parameter | Type | Default | Description |
| :------------------- | :--- | :------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **intermediate_dir** | str | "intermediate" | Directory path for storing intermediate files generated during the conversion process. |
| **mineru_model_path** | str | None | Model path used by FlashMinerU (required; e.g., the directory of MinerU2.5-xxx weights). |
| **batch_size** | int | 4 | Batch size for processing. |
| **replicas** | int | 1 | Number of processes for multi-process inference. |
| **num_gpus_per_replica** | float | 1 | Number of GPUs occupied by each replica. |
| **engine_gpu_util_rate_to_ray_cap** | float | 0.9 | Upper limit coefficient for Ray resource utilization (flash-mineru essentially uses ray for multi-process inference), for example, setting it to 0.9 means that ray will reserve 10% of the resources, since it is necessary to leave some resources for ray's management processes while preventing OOM under the condition of ensuring computational efficiency, it is usually set between 0.8 and 1.0. |


### Prompt Template Description

| Prompt Template Name | Main Purpose | Applicable Scenario | Characteristics |
| -------------------- | ------------ | ------------------- | --------------- |
| -- | -- | -- | -- |

## run Function

```python
def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: str = "text_path"):
```

#### Parameters

| Name | Type | Default | Description |
| :------------- | :-------------- | :---------- | :---------------------------------------------------------------------- |
| **storage** | DataFlowStorage | Required | Data flow storage instance responsible for reading and writing data. |
| **input_key** | str | "source" | Input column name containing the file path or URL to be processed. |
| **output_key** | str | "text_path" | Output column name that stores the path to the generated Markdown file. |

## 🧠 Example Usage

```python
self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterFlash(
intermediate_dir = "intermediate",
mineru_model_path="<path_to_local>/MinerU2.5-2509-1.2B",
batch_size = 4,
replicas = 2,
num_gpus_per_replica = 1,
engine_gpu_util_rate_to_ray_cap = 0.9
)
self.knowledge_cleaning_step1.run(
storage=self.storage.step(),
# input_key=,
# output_key=,
)
```

#### 🧾 Default Output Format

| Field | Type | Description |
| :-------- | :--- | :----------------------------------- |
| source | str | Input source file path or URL. |
| text_path | str | Path to the generated Markdown file. |

Example Input:

```json
{
"source":"/path/to/your/document.pdf"
}
```

Example Output:

```json
{
"source":"/path/to/your/document.pdf",
"text_path":"intermediate/document_pdf.md"
}
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
title: FileOrURLToMarkdownConverterLocal
createTime: 2025/10/09 16:52:48
permalink: /en/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterlocal/
---

## 📘 Overview

`FileOrURLToMarkdownConverterLocal` is an operator that utilizes the local installation of MinerU model for knowledge extraction, it supports extracting structured content from multiple file formats (e.g., PDF, Office documents, web pages, plain text) and URLs, converting them into a unified Markdown format. The operator automatically detects the file type and invokes the optimal parsing engine (such as MinerU or trafilatura) to preserve the original layout and key information.

## **init** Function

```python
def __init__(self,
intermediate_dir: str = "intermediate",
mineru_backend: str = "vlm-auto-engine",
mineru_source: str = "local",
mineru_model_path:str = None,
mineru_download_model_type:str = "vlm"
):
```

### init Parameter Description

| Parameter | Type | Default | Description |
| :------------------- | :--- | :------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **intermediate_dir** | str | "intermediate" | Directory path used to store intermediate files generated during the conversion process. |
| **mineru_backend** | str | "vlm" | Specifies the backend engine for MinerU, used for handling complex documents such as PDFs. Options include "pipeline", "vlm-transformers", "MinerU-HTML". |
| **mineru_source** | str | "local" | Specifies the source of the MinerU model, corresponding to MINERU_MODEL_SOURCE. Options include "modelscope", "huggingface", "local". |
| **mineru_model_path** | str | None | Local model directory, required when `mineru_source='local'`. |
| **mineru_download_model_type** | str | "vlm" | Specifies the type of MinerU model to download. |

### Prompt Template Description

| Prompt Template Name | Main Purpose | Applicable Scenario | Characteristics |
| -------------------- | ------------ | ------------------- | --------------- |
| -- | -- | -- | -- |

## run Function

```python
def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: str = "text_path"):
```

#### Parameters

| Name | Type | Default | Description |
| :------------- | :-------------- | :---------- | :---------------------------------------------------------------------- |
| **storage** | DataFlowStorage | Required | Data flow storage instance responsible for reading and writing data. |
| **input_key** | str | "source" | Input column name containing the file path or URL to be processed. |
| **output_key** | str | "text_path" | Output column name that stores the path to the generated Markdown file. |

## 🧠 Example Usage

```python
self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterLocal(
intermediate_dir="../example_data/KBCleaningPipeline/raw/",
mineru_backend="vlm-auto-engine",
mineru_source="local",
mineru_model_path="<path_to_local>/MinerU2.5-2509-1.2B",
mineru_download_model_type="vlm"
)
self.knowledge_cleaning_step1.run(
storage=self.storage.step(),
# input_key=,
# output_key=,
)
```

#### 🧾 Default Output Format

| Field | Type | Description |
| :-------- | :--- | :----------------------------------- |
| source | str | Input source file path or URL. |
| text_path | str | Path to the generated Markdown file. |

Example Input:

```json
{
"source":"/path/to/your/document.pdf"
}
```

Example Output:

```json
{
"source":"/path/to/your/document.pdf",
"text_path":"intermediate/document_pdf.md"
}
```
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ The Knowledge Base Cleaning Operator can perform extraction, organization, and c

| Name | Applicable Type | Description | Official Repository/Paper |
| --------------------- | :-------------- | ------------------------------------------------------------ | ------------------------------------------------------ |
| FileOrURLToMarkdownConverterBatch🚀✨ | Knowledge Extraction | This operator extracts various heterogeneous text knowledge into markdown format for subsequent processing. | - |
| FileOrURLToMarkdownConverterFlash🚀🚀✨ | Knowledge Extraction | This operator is used to extract various heterogeneous text knowledge into Markdown format for easy subsequent processing. (Based on Flash-MinerU) | [Flash-MinerU](https://github.com/OpenDCAI/Flash-MinerU) |
| FileOrURLToMarkdownConverterAPI🚀✨ | Knowledge Extraction | This operator is used to extract various heterogeneous text knowledge into Markdown format for easy subsequent processing. (Based on MinerU Official API) | [MinerU](https://github.com/opendatalab/MinerU) |
| FileOrURLToMarkdownConverterLocal✨ | Knowledge Extraction | This operator is used to extract various heterogeneous text knowledge into Markdown format for easy subsequent processing. (Based on MinerU) | [MinerU](https://github.com/opendatalab/MinerU) |
| KBCChunkGenerator✨ | Corpus Segmentation | This operator provides multiple methods to split full texts into appropriately sized segments for subsequent operations like indexing. | - |
| KBCTextCleaner🚀✨ | Knowledge Cleaning | This operator uses LLM to clean organized raw text, including but not limited to normalization and privacy removal. | - |
| Text2MultiHopQAGenerator🚀✨ | Knowledge Paraphrasing | This operator uses a three-sentence sliding window to paraphrase cleaned knowledge bases into a series of multi-step reasoning QAs, which better facilitates accurate RAG reasoning. | [MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD) |
Expand Down
Loading