diff --git a/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterBatch.md b/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterAPI.md similarity index 68% rename from docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterBatch.md rename to docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterAPI.md index d470dcede..f15439108 100644 --- a/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterBatch.md +++ b/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterAPI.md @@ -1,17 +1,17 @@ --- -title: FileOrURLToMarkdownConverterBatch +title: FileOrURLToMarkdownConverterAPI createTime: 2025/10/09 16:52:48 -permalink: /en/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterbatch/ +permalink: /en/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterapi/ --- ## 📘 Overview -`FileOrURLToMarkdownConverterBatch` is a knowledge extraction operator that supports extracting structured content from multiple file formats (e.g., PDF, Office documents, web pages, plain text) and URLs, converting them into a unified Markdown format. The operator automatically detects the file type and invokes the optimal parsing engine (such as MinerU or trafilatura) to preserve the original layout and key information. +`FileOrURLToMarkdownConverterAPI` is an operator that utilizes the official API of MinerU for knowledge extraction, it supports extracting structured content from multiple file formats (e.g., PDF, Office documents, web pages, plain text) and URLs, converting them into a unified Markdown format. The operator automatically detects the file type and invokes the optimal parsing engine (such as MinerU or trafilatura) to preserve the original layout and key information. ## **init** Function ```python -def __init__(self, intermediate_dir: str = "intermediate", lang: str = "en", mineru_backend: str = "vlm-vllm-engine", ): +def __init__(self, intermediate_dir: str = "intermediate", mineru_backend: str = "vlm", api_key:str = None): ``` ### init Parameter Description @@ -19,8 +19,8 @@ def __init__(self, intermediate_dir: str = "intermediate", lang: str = "en", min | Parameter | Type | Default | Description | | :------------------- | :--- | :------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **intermediate_dir** | str | "intermediate" | Directory path used to store intermediate files generated during the conversion process. | -| **lang** | str | "en" | Specifies the main language of the document (e.g., 'zh' for Chinese, 'en' for English) to optimize parsing performance. | -| **mineru_backend** | str | "vlm-sglang-engine" | Specifies the backend engine for MinerU, used for handling complex documents such as PDFs. Options include "pipeline", "vlm-transformers", "vlm-vllm-engine", and "vlm-http-client". | +| **api_key** | str | None | Specifies the API key for accessing external MinerU services. | +| **mineru_backend** | str | "vlm" | Specifies the backend engine for MinerU, used for handling complex documents such as PDFs. Options include "pipeline", "vlm-transformers", "MinerU-HTML". | ### Prompt Template Description @@ -45,10 +45,10 @@ def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: s ## 🧠 Example Usage ```python -self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterBatch( +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterAPI( intermediate_dir="../example_data/KBCleaningPipeline/raw/", - lang="en", - mineru_backend="vlm-vllm-engine", + api_key="your-api-key-here", + mineru_backend="vlm", ) self.knowledge_cleaning_step1.run( storage=self.storage.step(), diff --git a/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterFlash.md b/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterFlash.md new file mode 100644 index 000000000..9754e68bf --- /dev/null +++ b/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterFlash.md @@ -0,0 +1,98 @@ +--- +title: FileOrURLToMarkdownConverterFlash +createTime: 2025/10/09 16:52:48 +permalink: /en/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterflash/ +--- + +## 📘 Overview + +`FileOrURLToMarkdownConverterFlash` is an operator that utilizes the local installation of MinerU model for knowledge extraction, it supports extracting structured content from multiple file formats (e.g., PDF, Office documents, web pages, plain text) and URLs, converting them into a unified Markdown format. The operator automatically detects the file type and invokes the optimal parsing engine (such as MinerU or trafilatura) to preserve the original layout and key information. + +## **init** Function + +```python +def __init__( + self, + intermediate_dir: str = "intermediate", + mineru_model_path=None, + batch_size:int = 4, + replicas:int = 1, + num_gpus_per_replica:float = 1, + engine_gpu_util_rate_to_ray_cap:float = 0.9 +): + +``` + +### init Parameter Description + +| Parameter | Type | Default | Description | +| :------------------- | :--- | :------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **intermediate_dir** | str | "intermediate" | Directory path for storing intermediate files generated during the conversion process. | +| **mineru_model_path** | str | None | Model path used by FlashMinerU (required; e.g., the directory of MinerU2.5-xxx weights). | +| **batch_size** | int | 4 | Batch size for processing. | +| **replicas** | int | 1 | Number of processes for multi-process inference. | +| **num_gpus_per_replica** | float | 1 | Number of GPUs occupied by each replica. | +| **engine_gpu_util_rate_to_ray_cap** | float | 0.9 | Upper limit coefficient for Ray resource utilization (flash-mineru essentially uses ray for multi-process inference), for example, setting it to 0.9 means that ray will reserve 10% of the resources, since it is necessary to leave some resources for ray's management processes while preventing OOM under the condition of ensuring computational efficiency, it is usually set between 0.8 and 1.0. | + + +### Prompt Template Description + +| Prompt Template Name | Main Purpose | Applicable Scenario | Characteristics | +| -------------------- | ------------ | ------------------- | --------------- | +| -- | -- | -- | -- | + +## run Function + +```python +def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: str = "text_path"): +``` + +#### Parameters + +| Name | Type | Default | Description | +| :------------- | :-------------- | :---------- | :---------------------------------------------------------------------- | +| **storage** | DataFlowStorage | Required | Data flow storage instance responsible for reading and writing data. | +| **input_key** | str | "source" | Input column name containing the file path or URL to be processed. | +| **output_key** | str | "text_path" | Output column name that stores the path to the generated Markdown file. | + +## 🧠 Example Usage + +```python +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterFlash( + intermediate_dir = "intermediate", + mineru_model_path="/MinerU2.5-2509-1.2B", + batch_size = 4, + replicas = 2, + num_gpus_per_replica = 1, + engine_gpu_util_rate_to_ray_cap = 0.9 +) +self.knowledge_cleaning_step1.run( + storage=self.storage.step(), + # input_key=, + # output_key=, +) +``` + +#### 🧾 Default Output Format + +| Field | Type | Description | +| :-------- | :--- | :----------------------------------- | +| source | str | Input source file path or URL. | +| text_path | str | Path to the generated Markdown file. | + +Example Input: + +```json +{ +"source":"/path/to/your/document.pdf" +} +``` + +Example Output: + +```json +{ +"source":"/path/to/your/document.pdf", +"text_path":"intermediate/document_pdf.md" +} +``` diff --git a/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterLocal.md b/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterLocal.md new file mode 100644 index 000000000..02eba19e9 --- /dev/null +++ b/docs/en/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterLocal.md @@ -0,0 +1,92 @@ +--- +title: FileOrURLToMarkdownConverterLocal +createTime: 2025/10/09 16:52:48 +permalink: /en/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterlocal/ +--- + +## 📘 Overview + +`FileOrURLToMarkdownConverterLocal` is an operator that utilizes the local installation of MinerU model for knowledge extraction, it supports extracting structured content from multiple file formats (e.g., PDF, Office documents, web pages, plain text) and URLs, converting them into a unified Markdown format. The operator automatically detects the file type and invokes the optimal parsing engine (such as MinerU or trafilatura) to preserve the original layout and key information. + +## **init** Function + +```python + def __init__(self, + intermediate_dir: str = "intermediate", + mineru_backend: str = "vlm-auto-engine", + mineru_source: str = "local", + mineru_model_path:str = None, + mineru_download_model_type:str = "vlm" + ): +``` + +### init Parameter Description + +| Parameter | Type | Default | Description | +| :------------------- | :--- | :------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **intermediate_dir** | str | "intermediate" | Directory path used to store intermediate files generated during the conversion process. | +| **mineru_backend** | str | "vlm" | Specifies the backend engine for MinerU, used for handling complex documents such as PDFs. Options include "pipeline", "vlm-transformers", "MinerU-HTML". | +| **mineru_source** | str | "local" | Specifies the source of the MinerU model, corresponding to MINERU_MODEL_SOURCE. Options include "modelscope", "huggingface", "local". | +| **mineru_model_path** | str | None | Local model directory, required when `mineru_source='local'`. | +| **mineru_download_model_type** | str | "vlm" | Specifies the type of MinerU model to download. | + +### Prompt Template Description + +| Prompt Template Name | Main Purpose | Applicable Scenario | Characteristics | +| -------------------- | ------------ | ------------------- | --------------- | +| -- | -- | -- | -- | + +## run Function + +```python +def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: str = "text_path"): +``` + +#### Parameters + +| Name | Type | Default | Description | +| :------------- | :-------------- | :---------- | :---------------------------------------------------------------------- | +| **storage** | DataFlowStorage | Required | Data flow storage instance responsible for reading and writing data. | +| **input_key** | str | "source" | Input column name containing the file path or URL to be processed. | +| **output_key** | str | "text_path" | Output column name that stores the path to the generated Markdown file. | + +## 🧠 Example Usage + +```python +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterLocal( + intermediate_dir="../example_data/KBCleaningPipeline/raw/", + mineru_backend="vlm-auto-engine", + mineru_source="local", + mineru_model_path="/MinerU2.5-2509-1.2B", + mineru_download_model_type="vlm" +) +self.knowledge_cleaning_step1.run( + storage=self.storage.step(), + # input_key=, + # output_key=, +) +``` + +#### 🧾 Default Output Format + +| Field | Type | Description | +| :-------- | :--- | :----------------------------------- | +| source | str | Input source file path or URL. | +| text_path | str | Path to the generated Markdown file. | + +Example Input: + +```json +{ +"source":"/path/to/your/document.pdf" +} +``` + +Example Output: + +```json +{ +"source":"/path/to/your/document.pdf", +"text_path":"intermediate/document_pdf.md" +} +``` diff --git a/docs/en/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md b/docs/en/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md index f06496193..3051ebb94 100644 --- a/docs/en/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md +++ b/docs/en/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md @@ -21,7 +21,9 @@ The Knowledge Base Cleaning Operator can perform extraction, organization, and c | Name | Applicable Type | Description | Official Repository/Paper | | --------------------- | :-------------- | ------------------------------------------------------------ | ------------------------------------------------------ | -| FileOrURLToMarkdownConverterBatch🚀✨ | Knowledge Extraction | This operator extracts various heterogeneous text knowledge into markdown format for subsequent processing. | - | +| FileOrURLToMarkdownConverterFlash🚀🚀✨ | Knowledge Extraction | This operator is used to extract various heterogeneous text knowledge into Markdown format for easy subsequent processing. (Based on Flash-MinerU) | [Flash-MinerU](https://github.com/OpenDCAI/Flash-MinerU) | +| FileOrURLToMarkdownConverterAPI🚀✨ | Knowledge Extraction | This operator is used to extract various heterogeneous text knowledge into Markdown format for easy subsequent processing. (Based on MinerU Official API) | [MinerU](https://github.com/opendatalab/MinerU) | +| FileOrURLToMarkdownConverterLocal✨ | Knowledge Extraction | This operator is used to extract various heterogeneous text knowledge into Markdown format for easy subsequent processing. (Based on MinerU) | [MinerU](https://github.com/opendatalab/MinerU) | | KBCChunkGenerator✨ | Corpus Segmentation | This operator provides multiple methods to split full texts into appropriately sized segments for subsequent operations like indexing. | - | | KBCTextCleaner🚀✨ | Knowledge Cleaning | This operator uses LLM to clean organized raw text, including but not limited to normalization and privacy removal. | - | | Text2MultiHopQAGenerator🚀✨ | Knowledge Paraphrasing | This operator uses a three-sentence sliding window to paraphrase cleaned knowledge bases into a series of multi-step reasoning QAs, which better facilitates accurate RAG reasoning. | [MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD) | diff --git a/docs/en/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md b/docs/en/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md index a838960d2..c693f2088 100644 --- a/docs/en/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md +++ b/docs/en/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md @@ -24,10 +24,45 @@ The main workflow of the pipeline includes: ### 1. Information Extraction -The first step of the pipeline is to extract textual knowledge from users' original documents or URLs using FileOrURLToMarkdownConverter. This step is crucial as it converts various formats of raw documents into unified markdown text, facilitating subsequent cleaning processes. +The first step of the pipeline is to extract textual knowledge from the user's original documents or URLs using one of three operators: FileOrURLToMarkdownConverterFlash, FileOrURLToMarkdownConverterAPI, or FileOrURLToMarkdownConverterLocal. This step is critical, as it extracts raw documents in various formats into a unified markdown format text, facilitating subsequent cleaning steps. - +#### 1.1 FileOrURLToMarkdownConverterFlash operator +If you use the FileOrURLToMarkdownConverterFlash operator, PDF extraction is based on [Flash-MinerU](https://github.com/OpenDCAI/Flash-MinerU), and the additional flash-mineru library needs to be installed. (flash-mineru implements multi-process inference acceleration based on mineru, and the parsing speed is much faster than mineru. If you want to parse pdfs locally, it is recommended to use this operator). + +```shell +pip install 'flash-mineru[vllm]' +# or +pip install 'open-dataflow[flash-mineru]' +``` + +Then, you also need to download the pre-trained MinerU model for local inference. You can refer to the model download method in the FileOrURLToMarkdownConverterLocal operator tutorial later in this document, or directly download from huggingface ([mineru model huggingface](https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B)), or download from modelscope ([mineru model modelscope](https://www.modelscope.ai/models/OpenDataLab/MinerU2.5-2509-1.2B)). After downloading, configure the model path into the FileOrURLToMarkdownConverterFlash operator. + +**Input**: original document file or URL (using Flash-MinerU) **Output**: extracted markdown text + +**Example**: + +```python +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterFlash( + intermediate_dir = "intermediate", # Directory for intermediate artifacts generated during processing + mineru_model_path=None, # Model path used by FlashMinerU (required; e.g., MinerU2.5-xxx weights directory) + batch_size = 4, # Batch size + replicas = 2, # Number of replicas for PDF inference + num_gpus_per_replica = 1, # Number of GPUs occupied by each replica + engine_gpu_util_rate_to_ray_cap = 0.9 # Ray Resource Utilization Upper Bound Coefficient (given that flash-mineru essentially utilizes Ray for multi-process inference). For example, setting this to 0.9 means Ray will reserve 10% of the system resources. To ensure computational efficiency while leaving sufficient resources for Ray's management processes(raylet) and preventing OOM (Out of Memory) errors, this value is typically set between 0.8 and 1.0. +) +self.knowledge_cleaning_step1.run( + storage=self.storage.step(), + # input_key=, + # output_key=, +) +``` + +#### 1.2 FileOrURLToMarkdownConverterLocal operator + +If the FileOrURLToMarkdownConverterLocal operator is used in this system, PDF extraction is based on [MinerU](https://github.com/opendatalab/MinerU), and additional configuration is required. Users can configure it as follows. + + ```shell conda create -n dataflow python=3.10 @@ -38,93 +73,90 @@ pip install -e . pip install 'mineru[all]' ``` -PDF file extraction in this system is based on [MinerU](https://github.com/opendatalab/MinerU), and requires additional configuration. Users can configure it using the following steps. - - -> #### Using the Local Model -> -> To run the `MinerU` model locally, you need to first download the model files to your local storage. `MinerU` provides an interactive command-line tool to simplify this process. -> -> #### 1. Download Tool Guide: -> -> You can view the help information for the model download tool using the following command: -> +> #### Using local models +> +> To run `MinerU` models locally, you need to download them to local storage first. `MinerU` provides an interactive command-line tool to simplify this process. +> +> #### 1. Download tool instructions: +> +> You can use the following command to view the help information of the model download tool: +> > ```bash > mineru-models-download --help > ``` -> -> #### 2. Start the Model Download: -> -> Run the following command to begin the download process: -> +> +> #### 2. Start model download: +> +> Execute the following command to start the download process: +> > ```bash > mineru-models-download > ``` -> -> During the download process, you will encounter the following interactive prompts: -> -> * **Choose Model Download Source**: -> -> ```bash -> Please select the model download source: (huggingface, modelscope) [huggingface]: -> ``` -> -> *It is recommended to choose `modelscope` as the source for a better download experience.* -> -> * **Select `MinerU` Version**: -> -> `MinerU1` uses a `pipeline` approach — slower but with lower GPU memory requirements. `MinerU2.5` uses a `vlm` (Vision-Language Model) approach — faster but requires more GPU memory. Users can choose the MinerU version based on their needs and download it locally. -> ```bash -> Please select the model type to download: (pipeline, vlm, all) [all]: -> ``` -> -> *It is recommended to choose the `vlm` (MinerU2) version for faster parsing. If you have strict GPU memory limitations or prefer the traditional pipeline approach, choose `pipeline` (MinerU1). You can also select `all` to download all available versions.* -> -> #### 3. Model Path Configuration -> -> The `mineru.json` configuration file will be automatically generated when you run the `mineru-models-download` command for the first time. After the download completes, the local path to the model will be displayed in the terminal and automatically written to your `mineru.json` file in your user directory for future use. -> -> #### 4. Environment Verification -> -> You can verify your setup using the simplest command-line call: -> +> +> During the download process, you will see the following interactive prompts: +> +> * **Select model download source**: +> +> ```bash +> Please select the model download source: (huggingface, modelscope) [huggingface]: +> ``` +> +> *It is recommended to select `modelscope` as the download source for a better download experience.* +> * **Select `MinerU` version**: +> +> `MinerU1` uses `pipeline`-based parsing, which is slower but has lower VRAM requirements. +> `MinerU2.5` uses `vlm`-based parsing, which is faster but has higher VRAM requirements. Users can freely select the desired MinerU parsing version as needed and download it locally. +> +> ```bash +> Please select the model type to download: (pipeline, vlm, all) [all]: +> ``` +> +> *It is recommended to select the `vlm` (MinerU2) version, as it provides faster parsing speed. If you have strict VRAM requirements or prefer traditional pipeline processing, you can select `pipeline` (MinerU1). You can also select `all` to download all available versions.* +> +> #### 3. Model path configuration +> +> The `mineru.json` configuration file will be automatically generated when you use the `mineru-models-download` command for the first time. After the model download is complete, its local path will be displayed in the current terminal window and automatically written into the `mineru.json` file in your user directory for convenient subsequent use. +> +> #### 4. MinerU environment verification +> +> The simplest command-line invocation method for environment verification: +> > ```bash > mineru -p -o -b --source local > ``` -> -> * ``: Local PDF/image file or directory (`./demo.pdf` or `./image_dir`) -> * ``: Output directory -> * ``: Backend engine of the MinerU version. For `MinerU2.5`, set `MinerU_Backend` to `"vlm-vllm-engine"`or`"vlm-transformers"`or`"vlm-http-client"`; for `MinerU1`, set it to `"pipeline"`. -> -> #### 5. Tool Usage -> -> The `FileOrURLToMarkdownConverter` operator allows you to choose the desired backend engine of MinerU. -> -> * If using `MinerU1`: set the `MinerU_Backend` parameter to `"pipeline"`, which uses the traditional pipeline approach. -> * If using `MinerU2.5` **(recommended by default)**: set the `MinerU_Backend` parameter to `"vlm-vllm-engine"`or`"vlm-transformers"`or`"vlm-http-client"` to enable the vision-language model engine. -> +> +> * ``: local PDF/image file or directory (`./demo.pdf` or `./image_dir`) +> * ``: output directory +> * ``: MinerU version selection interface. To use `MinerU2.5`, set the `MinerU_Backend` parameter to `"vlm-vllm-engine"` or `"vlm-transformers"` or `"vlm-http-client"`; to use `MinerU1`, set the `MinerU_Backend` parameter to `"pipeline"`. +> +> #### 5. Tool usage +> +> The `FileOrURLToMarkdownConverterLocal` operator provides a MinerU version selection interface, allowing users to select the appropriate backend engine according to their needs. +> +> * If the user uses `MinerU1`: set the `MinerU_Backend` parameter to `"pipeline"`. This will enable the traditional pipeline processing method. +> * If the user uses `MinerU2.5` **(default recommended)**: set the `MinerU_Backend` parameter to `"vlm-vllm-engine"` or `"vlm-transformers"` or `"vlm-http-client"`. This will enable the new engine based on a multimodal language model. +> > ```python -> self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverter( +> self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterLocal( > intermediate_dir="../example_data/KBCleaningPipeline/raw/", -> lang="en", -> mineru_backend="vlm-sglang-engine", -> raw_file = raw_file, +> mineru_backend="vlm-auto-engine", +> mineru_model_path="/MinerU2.5-2509-1.2B", > ) > ``` -> -> 🌟 **More Info**: For detailed information about MinerU, please refer to its GitHub repository: [MinerU Official Documentation](https://github.com/opendatalab/MinerU) +> +> 🌟**More details**: For detailed information about MinerU, please refer to its GitHub repository: [MinerU official documentation](https://github.com/opendatalab/MinerU). - -**Input**: Original document files or URL (Using MinerU2) - ​**​Output​**: Extracted markdown text +**Input**: original document file or URL (using MinerU2) **Output**: extracted markdown text **Example**: ```python -self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterBatch( - intermediate_dir="../example_data/KBCleaningPipeline/raw/", - lang="en", - mineru_backend="vlm-vllm-engine", +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterLocal(self, + intermediate_dir="intermediate", + mineru_backend="vlm-auto-engine", + mineru_source="local", + mineru_model_path="/MinerU2.5-2509-1.2B", + mineru_download_model_type="vlm" ) self.knowledge_cleaning_step1.run( storage=self.storage.step(), @@ -239,7 +271,7 @@ The following provides an example pipeline configured for the `Dataflow[vllm]` e ```python from dataflow.operators.knowledge_cleaning import ( KBCChunkGenerator, - FileOrURLToMarkdownConverterBatch, + FileOrURLToMarkdownConverterFlash, KBCTextCleaner, # KBCMultiHopQAGenerator, ) @@ -257,10 +289,13 @@ class KBCleaning_PDFvllm_GPUPipeline(): cache_type="json", ) - self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterBatch( - intermediate_dir="../../example_data/KBCleaningPipeline/raw/", - lang="en", - mineru_backend="vlm-vllm-engine", + self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterFlash( + intermediate_dir = "intermediate", + mineru_model_path = "/MinerU2.5-2509-1.2B", + batch_size = 8, + replicas = 2, + num_gpus_per_replica = 1, + engine_gpu_util_rate_to_ray_cap = 0.9 ) self.knowledge_cleaning_step2 = KBCChunkGenerator( diff --git a/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterBatch.md b/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterAPI.md similarity index 61% rename from docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterBatch.md rename to docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterAPI.md index ba3839ec6..40d4bb870 100644 --- a/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterBatch.md +++ b/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterAPI.md @@ -1,17 +1,17 @@ --- -title: FileOrURLToMarkdownConverterBatch +title: FileOrURLToMarkdownConverterAPI createTime: 2025/10/09 17:09:04 -permalink: /zh/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterbatch/ +permalink: /zh/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterapi/ --- ## 📘 概述 -`FileOrURLToMarkdownConverterBatch` 是一个知识提取算子,它支持从多种文件格式(如PDF、Office文档、网页、纯文本)以及URL中提取结构化内容,并统一转换为标准的Markdown格式。算子能够自动识别文件类型并调用最优的解析引擎(如MinerU、trafilatura等)进行处理,保留原文的布局与核心信息。 +`FileOrURLToMarkdownConverterAPI` 是一个使用MinerU官方API进行知识提取的算子,它支持从多种文件格式(如PDF、Office文档、网页、纯文本)以及URL中提取结构化内容,并统一转换为标准的Markdown格式。算子能够自动识别文件类型并调用最优的解析引擎(如MinerU、trafilatura等)进行处理,保留原文的布局与核心信息。 ## __init__函数 ```python -def __init__(self, intermediate_dir: str = "intermediate", lang: str = "en", mineru_backend: str = "vlm-vllm-engine", ): +def __init__(self, intermediate_dir: str = "intermediate", mineru_backend: str = "vlm", api_key:str = None): ``` ### init参数说明 @@ -19,8 +19,8 @@ def __init__(self, intermediate_dir: str = "intermediate", lang: str = "en", min | 参数名 | 类型 | 默认值 | 说明 | | :--- | :--- | :--- | :--- | | **intermediate_dir** | str | "intermediate" | 用于存储转换过程中生成的中间文件的目录路径。 | -| **lang** | str | "en" | 指定文档的主要语言(如'zh'为中文,'en'为英文),用于优化解析效果。 | -| **mineru_backend** | str | "vlm-sglang-engine" | 设置 MinerU 的后端引擎,用于处理PDF等复杂文档。可选值为 "pipeline" 或 "vlm-transformers", 'vlm-vllm-engine', vlm-http-client'。 | +| **api_key** | str | None | 指定API密钥,用于访问MinerU外部服务。 | +| **mineru_backend** | str | "vlm" | 设置 MinerU 的后端引擎,用于处理PDF等复杂文档。可选值为 "pipeline" 或 "vlm", 'MinerU-HTML'。 | ### Prompt模板说明 @@ -45,10 +45,10 @@ def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: s ## 🧠 示例用法 ```python -self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterBatch( +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterAPI( intermediate_dir="../example_data/KBCleaningPipeline/raw/", - lang="en", - mineru_backend="vlm-vllm-engine", + api_key="your-api-key-here", + mineru_backend="vlm", ) self.knowledge_cleaning_step1.run( storage=self.storage.step(), diff --git a/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterFlash.md b/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterFlash.md new file mode 100644 index 000000000..ed7a2008d --- /dev/null +++ b/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterFlash.md @@ -0,0 +1,97 @@ +--- +title: FileOrURLToMarkdownConverterFlash +createTime: 2025/10/09 17:09:04 +permalink: /zh/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterflash/ +--- + +## 📘 概述 + +`FileOrURLToMarkdownConverterFlash` 是一个在本地使用Flash-MinerU进行知识提取的算子,它支持从多种文件格式(如PDF、Office文档、网页、纯文本)以及URL中提取结构化内容,并统一转换为标准的Markdown格式。算子能够自动识别文件类型并调用最优的解析引擎(如MinerU、trafilatura等)进行处理,保留原文的布局与核心信息。 + +## __init__函数 + +```python +def __init__( + self, + intermediate_dir: str = "intermediate", + mineru_model_path=None, + batch_size:int = 4, + replicas:int = 1, + num_gpus_per_replica:float = 1, + engine_gpu_util_rate_to_ray_cap:float = 0.9 +): + +``` + +### init参数说明 + +| 参数名 | 类型 | 默认值 | 说明 | +| :--- | :--- | :--- | :--- | +| **intermediate_dir** | str | "intermediate" | 用于存储转换过程中生成的中间文件的目录路径。 | +| **mineru_model_path** | str | None | FlashMinerU 使用的模型路径(必填;如 MinerU2.5-xxx 权重目录)。 | +| **batch_size** | int | 4 | 批处理大小。 | +| **replicas** | int | 1 | 多进程推理的进程数。 | +| **num_gpus_per_replica** | float | 1 | 每个副本占用的 GPU 数。 | +| **engine_gpu_util_rate_to_ray_cap** | float | 0.9 | Ray 资源利用率上限系数(flash-mineru本质上是利用ray实现多进程推理),例如设置成0.9表示ray会预留10%的资源,由于需要在保证计算效率的条件下留出一些资源给ray的管理进程同时防止OOM,通常设置在0.8~1.0之间。 | + +### Prompt模板说明 + +| Prompt 模板名称 | 主要用途 | 适用场景 | 特点说明 | +| --- | --- | --- | --- | +|-- |-- |-- |-- | + +## run函数 + +```python +def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: str = "text_path"): +``` + +#### 参数 + +| 名称 | 类型 | 默认值 | 说明 | +| :--- | :--- | :--- | :--- | +| **storage** | DataFlowStorage | 必需 | 数据流存储实例,负责读取与写入数据。 | +| **input_key** | str | "source" | 输入列名,该列应包含待处理的本地文件路径或URL。 | +| **output_key** | str | "text_path" | 输出列名,该列将用于存储生成的Markdown文件的路径。 | + +## 🧠 示例用法 + +```python +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterFlash( + intermediate_dir = "intermediate", + mineru_model_path="/MinerU2.5-2509-1.2B", + batch_size = 4, + replicas = 2, + num_gpus_per_replica = 1, + engine_gpu_util_rate_to_ray_cap = 0.9 +) +self.knowledge_cleaning_step1.run( + storage=self.storage.step(), + # input_key=, + # output_key=, +) +``` + +#### 🧾 默认输出格式(Output Format) + +| 字段 | 类型 | 说明 | +| :--- | :--- | :--- | +| source | str | 输入的源文件路径或URL。 | +| text_path | str | 生成的Markdown文件的存储路径。 | + +示例输入: + +```json +{ +"source":"/path/to/your/document.pdf" +} +``` + +示例输出: + +```json +{ +"source":"/path/to/your/document.pdf", +"text_path":"intermediate/document_pdf.md" +} +``` diff --git a/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterLocal.md b/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterLocal.md new file mode 100644 index 000000000..45145a959 --- /dev/null +++ b/docs/zh/notes/api/operators/knowledge_cleaning/generate/FileOrURLToMarkdownConverterLocal.md @@ -0,0 +1,92 @@ +--- +title: FileOrURLToMarkdownConverterLocal +createTime: 2025/10/09 17:09:04 +permalink: /zh/api/operators/knowledge_cleaning/generate/fileorurltomarkdownconverterlocal/ +--- + +## 📘 概述 + +`FileOrURLToMarkdownConverterLocal` 是一个在本地使用MinerU模型进行知识提取的算子,它支持从多种文件格式(如PDF、Office文档、网页、纯文本)以及URL中提取结构化内容,并统一转换为标准的Markdown格式。算子能够自动识别文件类型并调用最优的解析引擎(如MinerU、trafilatura等)进行处理,保留原文的布局与核心信息。 + +## __init__函数 + +```python + def __init__(self, + intermediate_dir: str = "intermediate", + mineru_backend: str = "vlm-auto-engine", + mineru_source: str = "local", + mineru_model_path:str = None, + mineru_download_model_type:str = "vlm" + ): +``` + +### init参数说明 + +| 参数名 | 类型 | 默认值 | 说明 | +| :--- | :--- | :--- | :--- | +| **intermediate_dir** | str | "intermediate" | 用于存储转换过程中生成的中间文件的目录路径。 | +| **mineru_backend** | str | "vlm-auto-engine" | 设置 MinerU 的后端引擎,用于处理PDF等复杂文档。可选值为 "pipeline" 或 "vlm-sglang-engine", 'vlm-auto-engine'。 | +| **mineru_source** | str | "local" | 设置 MinerU 的模型来源,对应 MINERU_MODEL_SOURCE。可选值为"modelscope","huggingface","local"。 | +| **mineru_model_path** | str | None | 本地模型目录,需要配合`mineru_source='local'`使用。 | +| **mineru_download_model_type** | str | "vlm" | 指定MinerU模型下载类型。 | + +### Prompt模板说明 + +| Prompt 模板名称 | 主要用途 | 适用场景 | 特点说明 | +| --- | --- | --- | --- | +|-- |-- |-- |-- | + +## run函数 + +```python +def run(self, storage: DataFlowStorage, input_key: str = "source", output_key: str = "text_path"): +``` + +#### 参数 + +| 名称 | 类型 | 默认值 | 说明 | +| :--- | :--- | :--- | :--- | +| **storage** | DataFlowStorage | 必需 | 数据流存储实例,负责读取与写入数据。 | +| **input_key** | str | "source" | 输入列名,该列应包含待处理的本地文件路径或URL。 | +| **output_key** | str | "text_path" | 输出列名,该列将用于存储生成的Markdown文件的路径。 | + +## 🧠 示例用法 + +```python +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterLocal( + intermediate_dir="../example_data/KBCleaningPipeline/raw/", + mineru_backend="vlm-auto-engine", + mineru_source="local", + mineru_model_path="/MinerU2.5-2509-1.2B", + mineru_download_model_type="vlm" +) +self.knowledge_cleaning_step1.run( + storage=self.storage.step(), + # input_key=, + # output_key=, +) +``` + +#### 🧾 默认输出格式(Output Format) + +| 字段 | 类型 | 说明 | +| :--- | :--- | :--- | +| source | str | 输入的源文件路径或URL。 | +| text_path | str | 生成的Markdown文件的存储路径。 | + +示例输入: + +```json +{ +"source":"/path/to/your/document.pdf" +} +``` + +示例输出: + +```json +{ +"source":"/path/to/your/document.pdf", +"text_path":"intermediate/document_pdf.md" +} +``` diff --git a/docs/zh/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md b/docs/zh/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md index e6735b45d..ec22ea2b1 100644 --- a/docs/zh/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md +++ b/docs/zh/notes/guide/domain_specific_operators/knowledgebase_QA_operators.md @@ -21,7 +21,9 @@ permalink: /zh/guide/Knowledgebase_QA_operators/ | 名称 | 适用类型 | 简介 | 官方仓库或论文 | | --------------------- | :------- | ------------------------------------------------------------ | ------------------------------------------------------ | -| FileOrURLToMarkdownConverterBatch🚀✨ | 知识提取 | 该算子用于将各种异构文本知识提取成markdown格式,方便后续处理。 | - | +| FileOrURLToMarkdownConverterFlash🚀🚀✨ | 知识提取 | 该算子用于将各种异构文本知识提取成markdown格式,方便后续处理。(基于Flash-MinerU) | [Flash-MinerU](https://github.com/OpenDCAI/Flash-MinerU) | +| FileOrURLToMarkdownConverterAPI🚀✨ | 知识提取 | 该算子用于将各种异构文本知识提取成markdown格式,方便后续处理。(基于MinerU官方API) | [MinerU](https://github.com/opendatalab/MinerU) | +| FileOrURLToMarkdownConverterLocal✨ | 知识提取 | 该算子用于将各种异构文本知识提取成markdown格式,方便后续处理。(基于MinerU) | [MinerU](https://github.com/opendatalab/MinerU) | | KBCChunkGenerator✨ | 语料分段 | 该算子提供多种方式,用于将文本全文切分成合适大小的片段,方便后续索引等操作。 | - | | KBCTextCleaner🚀✨ | 知识清洗 | 该算子利用LLM对整理好的原始文本进行清洗,包括但不限于规范化,去隐私等操作。 | - | | Text2MultiHopQAGenerator🚀✨ | 知识转述 | 该算子利用长度为三个句子的滑动窗口,将清洗好的知识库转写成一系列需要多步推理的QA,更有利于RAG准确推理。 | [MIRAID](https://github.com/eth-medical-ai-lab/MIRIAD) | diff --git a/docs/zh/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md b/docs/zh/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md index eb72b9393..f00ebbc5e 100644 --- a/docs/zh/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md +++ b/docs/zh/notes/guide/pipelines/KnowledgeBaseCleaningPipeline.md @@ -14,7 +14,7 @@ permalink: /zh/guide/kbcpipeline/ 流水线的主要流程包括: -1. 信息提取:借助[MinerU](https://github.com/opendatalab/MinerU), [trafilatura](https://github.com/adbar/trafilatura)等工具从原始文档中提取文本信息。 +1. 信息提取:借助[MinerU](https://github.com/opendatalab/MinerU), [Flash-MinerU](https://github.com/OpenDCAI/Flash-MinerU), [trafilatura](https://github.com/adbar/trafilatura)等工具从原始文档中提取文本信息。 2. 文本分段:借助[chonkie](https://github.com/chonkie-inc/chonkie)将文本切分成片段,支持通过Token,字符,句子等分段方式。 3. 知识清洗:从冗余标签,格式错误,屏蔽隐私信息和违规信息等角度对原始文本信息进行清洗,使文本信息更加清洁可用。 4. QA构建:利用长度为三个句子的滑动窗口,将清洗好的知识库转写成一系列需要多步推理的QA,更有利于RAG准确推理。 @@ -23,7 +23,43 @@ permalink: /zh/guide/kbcpipeline/ ### 1. 信息提取 -流水线第一步是通过FileOrURLToMarkdownConverterBatch从用户原始文档或URL中提取文本知识。此步骤至关重要,它将各种格式的原始文档提取成统一的markdown格式文本,方便后续清洗步骤进行。 +流水线第一步是通过FileOrURLToMarkdownConverterFlash、FileOrURLToMarkdownConverterAPI或者FileOrURLToMarkdownConverterLocal三个算子从用户原始文档或URL中提取文本知识。此步骤至关重要,它将各种格式的原始文档提取成统一的markdown格式文本,方便后续清洗步骤进行。 + +#### 1.1 FileOrURLToMarkdownConverterFlash算子 + +本系统中若采用FileOrURLToMarkdownConverterFlash算子则PDF文件的提取基于[Flash-MinerU](https://github.com/OpenDCAI/Flash-MinerU),需要额外安装flash-mineru库。(flash-mineru在mineru的基础上实现了多进程的推理加速,解析速度远快于mineru,若要进行本地解析推荐使用此算子)。 + +```shell +pip install 'flash-mineru[vllm]' +# 或者 +pip install 'open-dataflow[flash-mineru]' +``` + +然后还需要下载预训练好的MinerU模型进行本地推理,可以参照后文FileOrURLToMarkdownConverterLocal算子教程中的模型下载方式,也可以直接从huggingface下载([mineru模型huggingface](https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B)),或者从modelscope下载([mineru模型modelscope](https://www.modelscope.ai/models/OpenDataLab/MinerU2.5-2509-1.2B))。下载完成后,将模型路径配置到FileOrURLToMarkdownConverterFlash算子中即可。 + +**输入**:原始文档文件或URL(使用Flash-MinerU) **输出**:提取后的markdown文本 + +**示例**: + +```python +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterFlash( + intermediate_dir = "intermediate", # 处理产生的中间产物目录 + mineru_model_path=None, # FlashMinerU 使用的模型路径(必填;如 MinerU2.5-xxx 权重目录) + batch_size = 4, # 批处理大小 + replicas = 2, # 对pdf进行推理的副本数量 + num_gpus_per_replica = 1, # 每个副本占用的GPU数量 + engine_gpu_util_rate_to_ray_cap = 0.9 # Ray 资源利用率上限系数(flash-mineru本质上是利用ray实现多进程推理),例如设置成0.9表示ray会预留10%的资源,由于需要在保证计算效率的条件下留出一些资源给ray的管理进程同时防止OOM,通常设置在0.8~1.0之间。 +) +self.knowledge_cleaning_step1.run( + storage=self.storage.step(), + # input_key=, + # output_key=, +) +``` + +#### 1.2 FileOrURLToMarkdownConverterLocal算子 + +本系统中若采用FileOrURLToMarkdownConverterLocal算子则PDF文件的提取基于[MinerU](https://github.com/opendatalab/MinerU),需进行额外配置,用户可通过如下方式配置。 @@ -36,8 +72,6 @@ pip install -e . pip install 'mineru[all]' ``` -本系统中PDF文件的提取基于[MinerU](https://github.com/opendatalab/MinerU),需进行额外配置,用户可通过如下方式配置。 - > #### 使用本地模型 > > 为了在本地运行 `MinerU` 模型,您需要先将它们下载到本地存储。`MinerU` 提供了一个交互式命令行工具来简化此过程。 @@ -95,17 +129,16 @@ pip install 'mineru[all]' > > #### 5. 工具使用 > -> `FileOrURLToMarkdownConverter` 算子提供了 MinerU 版本的选择接口,允许用户根据需求选择合适的后端引擎。 +> `FileOrURLToMarkdownConverterLocal` 算子提供了 MinerU 版本的选择接口,允许用户根据需求选择合适的后端引擎。 > > * 如果用户使用 `MinerU1`:请将 `MinerU_Backend` 参数设置为 `"pipeline"`。这将启用传统的流水线处理方式。 > * 如果用户使用 `MinerU2.5` **(默认推荐)**:请将 `MinerU_Backend` 参数设置为 `"vlm-vllm-engine"`或`"vlm-transformers"`或`"vlm-http-client"`。这将启用基于多模态语言模型的新引擎。 > > ```python -> self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverter( +> self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterLocal( > intermediate_dir="../example_data/KBCleaningPipeline/raw/", -> lang="en", -> mineru_backend="vlm-sglang-engine", -> raw_file = raw_file, +> mineru_backend="vlm-auto-engine", +> mineru_model_path="/MinerU2.5-2509-1.2B", > ) > ``` > @@ -116,10 +149,12 @@ pip install 'mineru[all]' **示例**: ```python -self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterBatch( - intermediate_dir="../example_data/KBCleaningPipeline/raw/", - lang="en", - mineru_backend="vlm-vllm-engine", +self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterLocal(self, + intermediate_dir="intermediate", + mineru_backend="vlm-auto-engine", + mineru_source="local", + mineru_model_path="/MinerU2.5-2509-1.2B", + mineru_download_model_type="vlm" ) self.knowledge_cleaning_step1.run( storage=self.storage.step(), @@ -128,6 +163,7 @@ self.knowledge_cleaning_step1.run( ) ``` + ### 2. 文本分块 文档被提取之后,文本分块(KBCChunkGenerator)步骤将提取中的长文本切分成块,系统支持通过token, character, sentence, semantic维度进行分块。 @@ -238,7 +274,7 @@ pip install "numpy>=1.24,<2.0.0" ```python from dataflow.operators.knowledge_cleaning import ( KBCChunkGenerator, - FileOrURLToMarkdownConverterBatch, + FileOrURLToMarkdownConverterFlash, KBCTextCleaner, # KBCMultiHopQAGenerator, ) @@ -256,10 +292,13 @@ class KBCleaning_PDFvllm_GPUPipeline(): cache_type="json", ) - self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterBatch( - intermediate_dir="../../example_data/KBCleaningPipeline/raw/", - lang="en", - mineru_backend="vlm-vllm-engine", + self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterFlash( + intermediate_dir = "intermediate", + mineru_model_path = "/MinerU2.5-2509-1.2B", + batch_size = 8, + replicas = 2, + num_gpus_per_replica = 1, + engine_gpu_util_rate_to_ray_cap = 0.9 ) self.knowledge_cleaning_step2 = KBCChunkGenerator(