phdGPT: A Specialized Article Summarizer
phdGPT is a Python-based tool tailored to automatically summarize PDF articles, with a specialized focus on articles in analytic philosophy. Utilizing the advanced capabilities of OpenAI's GPT-3.5 Turbo model, phdGPT generates summaries that are comprehensive, coherent, and technical, ensuring the preservation of complexities and intricate details inherent to the subject matter. This tool is invaluable for researchers and students, especially those undertaking PhD projects, who need a precise and meticulous understanding of varied articles.
-
PDF Text Extraction: Efficiently extracts text from PDF articles stored in a designated directory.
-
Adaptive Chunking: Divides lengthy articles into manageable chunks to maintain contextual relevance during summarization.
-
Analytic Philosophy Focused Summarization: Uses tailor-made prompts to ensure summaries are systematic, detailed, and relevant to analytic philosophy.
-
Cohesive Synthesis: For extensive articles, combines summaries of different sections to deliver a cohesive and encompassing final summary.
-
Adaptable Prompts: Employs customizable prompts to adapt summarization techniques for articles in different fields.
-
Setup:
- Store the articles intended for summarization in the
./articlesdirectory. - Obtain the necessary API key from OpenAI and set it as an environment variable
OPENAI_API_KEY.
- Store the articles intended for summarization in the
-
Run: Execute the
phd-gpt.pyscript in your Python3 environment. The generated summaries will be stored in the./summariesdirectory.
$ python3 phd-gpt.py- Review Summaries:
Examine the synthesized summaries located in the
./summariesdirectory, each summary being appropriately named after its source article.
-
Dependencies:
- PyPDF2: For processing PDF files.
- openai: For leveraging OpenAI's API.
-
Functions:
procure_text(article_name: str) -> str: Retrieves and returns text content from the specified PDF article.lentext(article: str) -> int: Calculates and returns the word count of the provided article.summarize_article(article: str, prompt: str) -> str: Generates a summary for the input article using the provided prompt and returns the summary text.summarize_long(article: str, article_name: str) -> str: Oversees the summarization of extensive articles, orchestrating the chunking and final synthesis processes.main() -> None: The principal function initiating the summarization process for each article in the specified directory.
-
Execution: The script operates by processing each PDF file in the
./articlesdirectory and formulating summaries in the./summariesdirectory.
- Python 3.x
- PyPDF2
- openai
$ pip3 install PyPDF2 openaiOPENAI_API_KEY: Essential for authenticating with the OpenAI API.
To set up the OPENAI_API_KEY, you need to add it to your environment variables. Here's how you can do it:
- Open your terminal.
- Open your shell profile file (usually
~/.bashrc,~/.bash_profile, or~/.zshrc) with a text editor. For example:$ nano ~/.bashrc - Add the following line at the end of the file:
export OPENAI_API_KEY='your_api_key_here'
- Save and close the file.
- Reload the profile file to apply the changes:
$ source ~/.bashrc
- Open the Start Menu and search for “Environment Variables.”
- Click on “Edit the system environment variables.”
- In the System Properties window, click on “Environment Variables…”.
- In the Environment Variables window, click on “New…” under the User variables section.
- Enter
OPENAI_API_KEYas the Variable name and your API key as the Variable value. - Click OK to close each window.
To adapt the prompts for a field other than analytic philosophy, modify the reg_prompt, chunk_prompt, and final_summary_prompt strings in the phd-gpt.py script to suit the terminologies, intricacies, and emphasis needed for that particular field.
- Proper configuration of the OpenAI API key is crucial; failure to do so will impede the summarization process.
- While the tool is optimized for analytic philosophy articles, modifying the prompts can adapt the tool to other domains, but the effectiveness may vary.
- The summarization quality is dependent on the clarity and quality of the input article.
- It is developed to work optimally with well-structured PDFs and may encounter difficulties with poorly formatted or scanned PDFs.
Contributions for refining and enhancing phdGPT are warmly welcomed. Feel free to open issues or create pull requests for improvements.
GPL-3.0 License