Large Document Processing #1590
Replies: 1 comment 1 reply
-
I would suggest that you a) use a different default embedding model that allows larger chunk sizes, b) use a different LLM model with larger context, and/or c) look at the ingest module to see if you can change the default naive RAG approach and consider sentence/hierarchical contexts. You shouldn't need this, because llamaindex (under the covers) should be able to manage a 350pg document quite well. That said, this is doing naive RAG, and only get so much bang for your buck. Looking back at your original question though, counting the number of $s is a very difficult task for a RAG system to do, that typically looks for semantic matches. If you use an LLM with a very very large context (e.g. 64k, or 128k), you might have better luck, but this isn't typically the kind of question that you should expect good accuracy with. |
Beta Was this translation helpful? Give feedback.
-
I downloaded privateGPT because of it's ability to process large documents/text.
My goal is to upload a 370 page congressional spending bill, and have privateGPT list the number of times a dollar sign ($) is used.
After this, I hope to find the right prompt to get it to analyze the entire document and create a list of every benefactor in the bill, paired with the dollar amount proposed.
At this point, I'm unable to get an accurate answer for the question "How many dollar signs are in this document"
For reference, there are 152 instances across 56 pages. However, each time I ask this question, the response is always less than 34.
Initially, I uploaded the document as a .pdf.
After that failed to work, I converted the .pdf to .txt, copied the entire document, and pasted it into the prompt box.
I also included "How many instances of "$"" in the Additional Inputs System Prompt section
This particular example (and every other congressional spending bill) has every line on every page numbered for future reference. PrivateGPT is having a hard time differentiating these reference numbers from the content of the text.
Even so, it should be able to count the number of "$"
Am I expecting too much of this system?
Any thoughts?
Thank you,
Shawboy167
emergency_national_security_supplemental_bill_text.pdf
emergency_national_security_supplemental_bill_text.txt
Beta Was this translation helpful? Give feedback.
All reactions