Skip to content

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

@wissamharoun

Description

@wissamharoun

environment
llmware v0.3.8
macos 15
active db: sqlite
vector db: chromadb
for illustration of issue using example file: slicing_and_dicing_office_docs.py and the Microsoft Investor Relations data - However, issue was discovered initially on our private data - which is very OCR heavy.

issue:
run lib.add_files()

and ingest documents that the C parser will extract images pending downstream OCR with lib.run_ocr_on_images(add_to_library=True)
next, perform the ocr with llmware's "convenience" method on the images extracted to the image directory,
lib.run_ocr_on_images(add_to_library=True, other_params)
The result will be a new collection written to the db each entry per image referencing originating doc by 'doc_ID' (and so forth), with block_IDs starting at 100,000 and incrementing, and where the text chunks extracted by tesseract OCR populate only 'text_search'
perform a new embedding with llmware's
lib.install_new_embedding(params)
chunks/sentences for embedding are retrieved and collated into batches from 'text_search'
so far so good

at Query time -
Query.query(query="a query highly pertaining to the corpus", query_type="semantic", other_params)

would return results where 'text' is empty! - a little digging reveals that while the query text is indeed being compared to embedded chunks that are bonafide -- returned results for 'text' are retrieved from 'text_block' which remain empty after OCR.

the following images show this clearly...

Screenshot 2024-12-01 at 19 02 43

Screenshot 2024-12-02 at 17 25 10

Screenshot 2024-12-01 at 19 01 45

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions