Large Document Processing #1590

Shawboy167 · 2024-02-08T02:24:47Z

Shawboy167
Feb 8, 2024

I downloaded privateGPT because of it's ability to process large documents/text.

My goal is to upload a 370 page congressional spending bill, and have privateGPT list the number of times a dollar sign ($) is used.

After this, I hope to find the right prompt to get it to analyze the entire document and create a list of every benefactor in the bill, paired with the dollar amount proposed.

At this point, I'm unable to get an accurate answer for the question "How many dollar signs are in this document"
For reference, there are 152 instances across 56 pages. However, each time I ask this question, the response is always less than 34.

Initially, I uploaded the document as a .pdf.
After that failed to work, I converted the .pdf to .txt, copied the entire document, and pasted it into the prompt box.
I also included "How many instances of "$"" in the Additional Inputs System Prompt section

This particular example (and every other congressional spending bill) has every line on every page numbered for future reference. PrivateGPT is having a hard time differentiating these reference numbers from the content of the text.

Even so, it should be able to count the number of "$"

Am I expecting too much of this system?
Any thoughts?

Thank you,
Shawboy167

emergency_national_security_supplemental_bill_text.pdf
emergency_national_security_supplemental_bill_text.txt

jonmach · 2024-02-12T17:28:56Z

jonmach
Feb 12, 2024

I would suggest that you a) use a different default embedding model that allows larger chunk sizes, b) use a different LLM model with larger context, and/or c) look at the ingest module to see if you can change the default naive RAG approach and consider sentence/hierarchical contexts. You shouldn't need this, because llamaindex (under the covers) should be able to manage a 350pg document quite well. That said, this is doing naive RAG, and only get so much bang for your buck.

Looking back at your original question though, counting the number of $s is a very difficult task for a RAG system to do, that typically looks for semantic matches. If you use an LLM with a very very large context (e.g. 64k, or 128k), you might have better luck, but this isn't typically the kind of question that you should expect good accuracy with.

1 reply

KenFlack Mar 4, 2024

Might be better to simply ask who all the benefactors are in the bill and to provide a list with the funds allocated to each. i.e. This is one case when asking a higher level question may result in a more comprehensive answer. The RAG model will vectorize the benefactors and their funding as it is a more relevant piece of information than the number of $ in the document. Also, if you have already converted the entire document to text make it ingested rather than putting it in the prompt. Large prompts pose significant problems for most models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Document Processing #1590

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Large Document Processing #1590

Shawboy167 Feb 8, 2024

Replies: 1 comment · 1 reply

jonmach Feb 12, 2024

KenFlack Mar 4, 2024

Shawboy167
Feb 8, 2024

Replies: 1 comment 1 reply

jonmach
Feb 12, 2024