-
Notifications
You must be signed in to change notification settings - Fork 63
feat: add vqa pipeline #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
ee2e59a
docs: replace vqa_demo.json
ChenZiHong-Gavin 90f3a72
fix: support content type for input data
ChenZiHong-Gavin 12ee557
feat: filter non-exist content
ChenZiHong-Gavin 5cee7f2
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin 341231a
docs: add test data
ChenZiHong-Gavin cbbd2ae
refactor: turn log level to DEBUG when extracting KG
ChenZiHong-Gavin b854079
refactor: turn log level to DEBUG when extracting KG
ChenZiHong-Gavin 7c66cd7
feat: add support for multi-modal chunk
ChenZiHong-Gavin 30c43db
Update graphgen/models/reader/jsonl_reader.py
ChenZiHong-Gavin 7c71be7
fix: DEBUG log level for FileHandler & INFO log level for RichHandler
ChenZiHong-Gavin 5a624ac
Merge branch 'feature/vqa-pipeline' of https://github.com/open-scienc…
ChenZiHong-Gavin 8042f03
fix: fix language check
ChenZiHong-Gavin 6b0c8a3
Update graphgen/models/reader/json_reader.py
ChenZiHong-Gavin 16c0d85
feat: add mm_kg_builder
ChenZiHong-Gavin 8b05bb3
feat: add anchor_bfs_partitioner
ChenZiHong-Gavin 4df2948
fix: fix language check
ChenZiHong-Gavin 6fa1537
feat: add vqa_generator
ChenZiHong-Gavin c8c6979
Update graphgen/models/reader/csv_reader.py
ChenZiHong-Gavin 3ee98a9
Update graphgen/models/partitioner/anchor_bfs_partitioner.py
ChenZiHong-Gavin 22aae9a
feat: add vqa_generator
ChenZiHong-Gavin 122cd4c
Merge branch 'feature/vqa-pipeline' of https://github.com/open-scienc…
ChenZiHong-Gavin aa87906
fix: fix aggregated template
ChenZiHong-Gavin d5bbdcb
Update graphgen/operators/partition/partition_kg.py
ChenZiHong-Gavin ef2e109
fix: fix fetching img_path in vqa_generator
ChenZiHong-Gavin b2db994
Merge branch 'feature/vqa-pipeline' of https://github.com/open-scienc…
ChenZiHong-Gavin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,9 @@ | ||
| import os | ||
| from abc import ABC, abstractmethod | ||
| from typing import Any, Dict, List | ||
|
|
||
| import requests | ||
|
|
||
|
|
||
| class BaseReader(ABC): | ||
| """ | ||
|
|
@@ -18,3 +21,45 @@ def read(self, file_path: str) -> List[Dict[str, Any]]: | |
| :param file_path: Path to the input file. | ||
| :return: List of dictionaries containing the data. | ||
| """ | ||
|
|
||
| @staticmethod | ||
| def filter(data: List[dict]) -> List[dict]: | ||
| """ | ||
| Filter out entries with empty or missing text in the specified column. | ||
|
|
||
| :param data: List of dictionaries containing the data. | ||
| :return: Filtered list of dictionaries. | ||
| """ | ||
|
|
||
| def _image_exists(path_or_url: str, timeout: int = 3) -> bool: | ||
| """ | ||
| Check if an image exists at the given local path or URL. | ||
| :param path_or_url: Local file path or remote URL of the image. | ||
| :param timeout: Timeout for remote URL requests in seconds. | ||
| :return: True if the image exists, False otherwise. | ||
| """ | ||
| if not path_or_url: | ||
| return False | ||
| if not path_or_url.startswith(("http://", "https://", "ftp://")): | ||
| path = path_or_url.replace("file://", "", 1) | ||
| path = os.path.abspath(path) | ||
| return os.path.isfile(path) | ||
| try: | ||
| resp = requests.head(path_or_url, allow_redirects=True, timeout=timeout) | ||
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return resp.status_code == 200 | ||
| except requests.RequestException: | ||
| return False | ||
|
Comment on lines
+47
to
+51
|
||
|
|
||
| filtered_data = [] | ||
| for item in data: | ||
| if item.get("type") == "text": | ||
| content = item.get("content", "").strip() | ||
| if content: | ||
| filtered_data.append(item) | ||
| elif item.get("type") in ("image", "table", "equation"): | ||
| img_path = item.get("img_path") | ||
| if _image_exists(img_path): | ||
| filtered_data.append(item) | ||
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| else: | ||
| filtered_data.append(item) | ||
| return filtered_data | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,22 +1,18 @@ | ||
| read: | ||
| input_file: resources/input_examples/pdf_demo.pdf # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples | ||
| input_file: resources/input_examples/vqa_demo.json # input file path, support json, jsonl, txt, pdf. See resources/input_examples for examples | ||
| split: | ||
| chunk_size: 1024 # chunk size for text splitting | ||
| chunk_overlap: 100 # chunk overlap for text splitting | ||
| search: # web search configuration | ||
| enabled: false # whether to enable web search | ||
| search_types: ["google"] # search engine types, support: google, bing, uniprot, wikipedia | ||
| quiz_and_judge: # quiz and test whether the LLM masters the knowledge points | ||
| enabled: true | ||
| quiz_samples: 2 # number of quiz samples to generate | ||
| re_judge: false # whether to re-judge the existing quiz samples | ||
| enabled: false | ||
| partition: # graph partition configuration | ||
| method: ece # ece is a custom partition method based on comprehension loss | ||
| method: anchor_bfs # partition method | ||
| method_params: | ||
| max_units_per_community: 20 # max nodes and edges per community | ||
| min_units_per_community: 5 # min nodes and edges per community | ||
| max_tokens_per_community: 10240 # max tokens per community | ||
| unit_sampling: max_loss # unit sampling strategy, support: random, max_loss, min_loss | ||
| anchor_type: image # node type to select anchor nodes | ||
| max_units_per_community: 10 # atomic partition, one node or edge per community | ||
| generate: | ||
| mode: vqa # atomic, aggregated, multi_hop, cot, vqa | ||
| data_format: ChatML # Alpaca, Sharegpt, ChatML |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
_image_existsfunction makes a network request for every URL with a 3-second timeout. For documents with many images, this could significantly slow down the filtering process. Consider implementing caching or batch validation to improve performance.