diff --git a/docs/en/notes/api/operators/pdf2vqa/generate/LLMOutputParser.md b/docs/en/notes/api/operators/pdf2vqa/generate/LLMOutputParser.md index 249001a00..68662c931 100644 --- a/docs/en/notes/api/operators/pdf2vqa/generate/LLMOutputParser.md +++ b/docs/en/notes/api/operators/pdf2vqa/generate/LLMOutputParser.md @@ -1,7 +1,7 @@ --- title: LLMOutputParser createTime: 2026/01/20 20:15:00 -permalink: /en/api/operators/core_text/parse/llmoutputparser/ +permalink: /en/api/operators/pdf2vqa/generate/llmoutputparser/ --- ## 📘 Overview @@ -16,8 +16,7 @@ The core functionalities of this operator include: ## `__init__` Function ```python -def __init__(self, - mode: Literal['question', 'answer'], +def __init__(self, output_dir: str, intermediate_dir: str = "intermediate" ) @@ -28,7 +27,6 @@ def __init__(self, | Parameter | Type | Default | Description | | --- | --- | --- | --- | -| **mode** | str | Required | Parsing mode. Options are `'question'` or `'answer'`, which affects the output filename and the image subdirectory name. | | **output_dir** | str | Required | The final root directory for structured data and images. | | **intermediate_dir** | str | "intermediate" | The intermediate directory where original image resources processed by MinerU are located. | @@ -76,7 +74,7 @@ Suppose the LLM returns: `1, 3` The operator looks up entries with `id` 1 and 3 in the layout JSON: * If `id: 1` is the text "What is AI?" and `id: 3` is the image `path/to/img.png`. -* The restored content will be: `What is AI?\n![image](images/img.png)`. +* The restored content will be: `What is AI?\n![image](vqa_images/img.png)`. ### 2. Output File Structure @@ -86,7 +84,7 @@ After execution, the directory structure under `output_dir` (referenced as `cach output_dir/ └── {name}/ ├── extracted_questions.jsonl # Structured data - └── question_images/ # Automatically synchronized images + └── vqa_images/ # Automatically synchronized images ├── img1.png └── ... @@ -96,7 +94,7 @@ output_dir/ ```json { - "question": "Please analyze the image below:\n![image](question_images/fig1.png)", + "question": "Please analyze the image below:\n![image](vqa_images/img1.png)", "answer": "This is the parsed answer text.", "solution": "Detailed step-by-step solution...", "label": "1", diff --git a/docs/en/notes/api/operators/pdf2vqa/generate/MineruToLLMInputOperator.md b/docs/en/notes/api/operators/pdf2vqa/generate/MineruToLLMInputOperator.md index 6d056fe5a..523dbafee 100644 --- a/docs/en/notes/api/operators/pdf2vqa/generate/MineruToLLMInputOperator.md +++ b/docs/en/notes/api/operators/pdf2vqa/generate/MineruToLLMInputOperator.md @@ -1,7 +1,7 @@ --- title: MinerU2LLMInputOperator createTime: 2026/01/20 20:10:00 -permalink: /en/api/operators/core_text/convert/mineru2llminputoperator/ +permalink: /en/api/operators/pdf2vqa/generate/mineru2llminputoperator/ --- ## 📘 Overview diff --git a/docs/en/notes/api/operators/pdf2vqa/generate/QAMerger.md b/docs/en/notes/api/operators/pdf2vqa/generate/QAMerger.md index 3b7a70e01..6dfadb0d4 100644 --- a/docs/en/notes/api/operators/pdf2vqa/generate/QAMerger.md +++ b/docs/en/notes/api/operators/pdf2vqa/generate/QAMerger.md @@ -1,7 +1,7 @@ --- title: QA_Merger createTime: 2026/01/20 20:25:00 -permalink: /en/api/operators/core_text/merge/qamerger/ +permalink: /en/api/operators/pdf2vqa/generate/qamerger/ --- ## 📘 Overview diff --git a/docs/en/notes/guide/quickstart/PDFVQAExtract.md b/docs/en/notes/guide/quickstart/PDFVQAExtract.md index cedfea0f7..c80527992 100644 --- a/docs/en/notes/guide/quickstart/PDFVQAExtract.md +++ b/docs/en/notes/guide/quickstart/PDFVQAExtract.md @@ -22,7 +22,7 @@ Major stages: ## 2. Quick Start -### Step 1: Install Dataflow (and MinerU) +### Step 1: Install Dataflow Install Dataflow: ```shell pip install "open-dataflow[pdf2vqa]" @@ -35,12 +35,6 @@ cd Dataflow pip install -e ".[pdf2vqa]" ``` -Then install MinerU and download models: -```shell -pip install "mineru[vllm]>=2.5.0,<2.7.0" -mineru-models-download -``` - ### Step 2: Create a workspace ```shell cd /your/working/directory @@ -55,13 +49,18 @@ dataflow init You can then add your pipeline script under `pipelines/` or any custom path. ### Step 4: Configure API credentials +`DF_API_KEY` is for calling LLM API, and `MINERU_API_KEY` is for calling MinerU for layout analysis. +`MINERU_API_KEY` can be obtained from https://mineru.net/apiManage/token, and `DF_API_KEY` can be obtained from your LLM provider (e.g., OpenAI, Google Gemini, etc.). Set them as environment variables: + Linux / macOS: ```shell export DF_API_KEY="sk-xxxxx" +export MINERU_API_KEY="sk2-xxxxx" ``` Windows PowerShell: ```powershell $env:DF_API_KEY = "sk-xxxxx" +$env:MINERU_API_KEY = "sk2-xxxxx" ``` In the pipeline script, set your API endpoint: ```python @@ -72,12 +71,7 @@ self.llm_serving = APILLMServing_request( max_workers=100, ) ``` -and set MinerU backend ('vlm-vllm-engine' or 'vlm-transformers') and LLM max token length (recommended not to exceed 128000 to avoid LLM forgetting details). -**Caution: The pipeline was only tested with the `vlm` backend; compatibility with the `pipeline` backend is uncertain due to format differences. Using the `vlm` backend is recommended.** -The `vlm-vllm-engine` backend requires GPU support. -```python -self.mineru_executor = FileOrURLToMarkdownConverterBatch(intermediate_dir = "intermediate", mineru_backend="vlm-vllm-engine") -``` +and set LLM max token length (recommended not to exceed 128000 to avoid LLM forgetting details). ```python self.vqa_extractor = ChunkedPromptedGenerator( @@ -97,16 +91,12 @@ You can also import the operators into other workflows; the remainder of this do ### 1. Input data -Each job is defined by a JSONL row. Two modes are supported: +Each job is defined by a JSONL row. `input_pdf_paths` can be a single PDF or a list of PDFs (questions appear before answers). `name` is an identifier for the job. Questions and answers can be interleaved or separated; they can come from the same PDF or different PDFs. -- **QA-Separated PDFs** - ```jsonl - {"question_pdf_path": "/abs/path/questions.pdf", "answer_pdf_path": "/abs/path/answers.pdf", "subject": "math", "output_dir": "./output/math"} - ``` -- **QA-Interleaved PDFs** - ```jsonl - {"question_pdf_path": "/abs/path/qa.pdf", "answer_pdf_path": "/abs/path/qa.pdf", "name": "math2"} - ``` +```jsonl +{"input_pdf_paths": "./example_data/PDF2VQAPipeline/questionextract_test.pdf", "name": "math1"} +{"input_pdf_paths": ["./example_data/PDF2VQAPipeline/math_question.pdf", "./example_data/PDF2VQAPipeline/math_answer.pdf"], "name": "math2"} +``` `FileStorage` handles batching/cache management: ```python @@ -120,15 +110,48 @@ self.storage = FileStorage( ### 2. Document layout extraction (MinerU) -For each PDF (question, answer, or mixed), the pipeline calls `_parse_file_with_mineru` inside `FileOrURLToMarkdownConverterBatch`. MinerU outputs: +For each PDF (question, answer, or mixed), the pipeline calls `_parse_file_with_mineru` inside `FileOrURLToMarkdownConverterAPI`. MinerU outputs: -- `//_content_list.json`: structured layout tokens (texts, figures, tables, IDs) -- `//images/`: cropped page images +- `*_content_list.json`: structured layout tokens (texts, figures, tables, IDs) +- `images/`: cropped page images -The backend can be: +--- +**Note**: +If you want to use a locally deployed MinerU model, you can replace the operator with `FileOrURLToMarkdownConverterLocal` (original version from opendatalab) or `FileOrURLToMarkdownConverterFlash` (our accelerated version), and provide the corresponding model path and deployment parameters. -- `vlm-transformers`: CPU/GPU compatible -- `vlm-vllm-engine`: high-throughput GPU mode (requires CUDA) +For example: + +```python +self.mineru_executor = FileOrURLToMarkdownConverterAPI(intermediate_dir = "intermediate") +``` + +can be replaced with + +```python +self.mineru_executor = FileOrURLToMarkdownConverterLocal( + intermediate_dir = "intermediate", + mineru_model_path = "path/to/mineru/model", +) +``` + +or + +```python +self.mineru_executor = FileOrURLToMarkdownConverterFlash( + intermediate_dir = "intermediate", + mineru_model_path = "path/to/mineru/model", + batch_size = 4, + replicas = 1, + num_gpus_per_replica = 1, + engine_gpu_util_rate_to_ray_cap = 0.9 +) +``` + +You can refer to https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/knowledge_cleaning/generate/mineru_operators.py for specific parameters and usage. + +--- + +Afterwards, the `MinerU2LLMInputOperator` flattens list items and re-indexes them to create LLM-friendly input. ### 3. QA extraction (VQAExtractor) @@ -136,7 +159,7 @@ The backend can be: - Grouping and pairing Q&A based, and inserting images to proper positions. - Supports QA separated or interleaved PDFs. -- Copies rendered images into `output_dir/question_images` and/or `answer_images`. +- Copies rendered images into `cache_path/name/vqa_images`. - Parses ``, ``, ``, ``, ``, `