Optimized PDF extraction function #15572
Open
+7,952
−119
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
介绍:优化dify知识库的PDF文档提取能力,支持纯图片pdf的提取,支持pdf中图像、表格、公式的识别。
使用方法:
使用OpenDataLab开源项目优化dify在知识库部分的PDF文档识别功能。
https://github.com/opendatalab/PDF-Extract-Kit
本地部署
在api文件夹下下载模型文件
git clone https://www.modelscope.cn/opendatalab/pdf-extract-kit-1.0.git
如果不想下在api文件夹下可以修改pdf_extractor_config.yaml文件
其余部分按照官方文档安装即可
Docker 部署
进入docker目录
docker compose build
然后更改docker-compose.yaml文件,换到刚刚编译好的docker镜像
services:
api:
image: docker-api
worker:
image: docker-api
执行命令
docker compose up -d
等待5分种
进入127.0.0.1就可以使用