Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 49 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,50 @@
# Text-Analysis-Project
# README

Please read the [instructions](instructions.md).
## Project Overview:
To give some general context about my project: Two of my favorite works of literature I read in high school were
The Odyssey and Iliad, both by Homer. As I was scrolling through Project Gutenberg, looking for texts to analyze, they were amongst the most popular downloaded on the site. Thus, I thought it would be fun to compare the two pieces of work. What drew me to the works initially was the vast world Homer details within the texts, the gods, the heroes, battles, places, etc. The "world-building" was very appealing to me, and that same draw became the foundation of my project. I wanted to see what kinds of words show up most often in each work, and then go a step further and try to automatically pull out the characters, places, and gods mentioned in different passages. To do that, I cleaned the text (lowercasing, removing punctuation, removing stopwords) and ran word-frequency analysis. Then I used the OpenAI API to label small chunks of the text with entities. The goal was to build a basic text-analysis pipeline and to practice using API's and an external AI service inside a Python script.


## 2. Implementation

Before I started writing any code, I outlined my system architecture and project requirements to make sure I tackled every part of it.

My requirements were:
- Download the texts
- Clean and filter the texts
- Analyze the text through word frequency and chart the findings
- Use the OpenAI API to show the most mentioned characters, places, and gods
- Use the API to compare the two texts and see what/if characters, places, and gods appear in both texts
- Return the API findings in json format

I split the project into two files: `api.py` and `project.py`. `api.py` is responsible for anything “external”: loading my OpenAI API key from a `.env` file, and providing the function that calls the model to extract entities. `project.py` is the main function. It downloads the two texts from Gutenberg, strips out the header/footer, cleans the text, removes stopwords (I ended up making my own stopwords list, albeit to varying success due to the nature of the texts), counts word frequencies, and then makes repeated calls to the API on fixed-size text chunks. The API results are combined into dictionaries that count how many times each character/place/god was mentioned.

One design decision was how to do the visualization. The instructions said I needed at least one visualization, so I decided to print a simple ASCII bar chart of the top 15 words for each text, showcasing the relative frequencies. Another design decision was chunk size vs. number of API calls: sending the whole epic in one request isn’t practical (and as I discovered a very easy way to get yourself rate limited by OpenAI), so I picked a chunk size (~1200 words) and a max number of chunks per run to avoid hammering the API and to keep the cost of calling the API down. ChatGPT was extremely helpful, especially in implementing the OpenAI API to do what I sought out to do, and diagnosing the numerous errors I encountered along the way.


## 3. Results

After cleaning and removing stopwords, the top words for each text started looking more meaningful (fewer “the/and/my/thou”) as I iterated the stopwords list. The ASCII bar chart shows the most common content words in the *Iliad* and in the *Odyssey* separately.

I explored using other people's stopping words lists for the texts, like [https://github.com/aurelberra/stopwords/blob/master/stopwords_greek_odyssey.txt] - but I didn't really find what I was looking for (for example, that person made stopping words for the ancient greek version of the text, obviously I was not using the ancient greek version of the text...) From there, I considered using the NLTK library for default stopping words, but I thought it could be more fun to try to do it myself.

On the “advanced” side, I used the OpenAI API to return structured JSON for passages like:

```json
{
"characters": ["Achilles"],
"places": ["Troy"],
"gods": ["Athena"]
}
```

I ran this over multiple chunks and then tallied the results, so I could print “most mentioned characters (partial)” for each work. Because API rate limits can slow things down, I limited how many chunks to send in one run, but the structure is there to process the whole text. This means the program doesn’t just count plain words, it can also name the important figures in the epic, which is something basic frequency analysis can’t do on its own.

## 4. Reflection

I personally found this project to definitely be a learning curve, but I had a lot of fun completing it. What went well was the basic text pipeline of downloading the text -> cleaning it -> filtering stopwords -> counting the words -> plotting with the ASCII chart. That part I had a good idea of how to do, and ChatGPT helped answer some questions along the way. The hardest part by far was getting the OpenAI API to work smoothly; that took a lot of work, and a lot of prompting ChatGPT to return what I wanted it to do. My biggest takeaway (and how I solved the implementing the OpenAI API issue) was that it's very important to be very clear in what you're trying to do, how you're currently doing it, and what is wrong when asking ChatGPT for help. I found that a lot of the code it was giving me I didn't really understand, and that felt detrimental to my learning and to the project, so I only wanted to use code I felt like I at least had some understanding of what it was doing.

# A note on the OpenAI API:
Before this project I always wanted to experiment with the API, so this felt like a good time to do so. I had to consider some trade-offs in using it, like which model I was using to perform my analysis and it's price per 1M tokens. I opted to use the cheapest model, gpt-5-nano, as I felt it was good enough for the scope of this project. The current rates for each model are [https://platform.openai.com/docs/pricing]. I know you offered to cover the cost of the tokens, but I didn't mind paying for it myself as I plan on exploring other uses for the API in my free time. For transparency, I ended up running 122 total requests, using 167,886 tokens at a cost of 26 cents.

Ultimately, I learned how to combine simple local text analysis with a cloud model that can understand entities. The local part is fast and free; the API part is flexible and “smarter,” but you have to be careful about cost and rate limits. AI tools helped me implement my vision for the project, and bring it to life while making sure I understood what each line was doing. If I were to do this project again, I would try to find ways to optimize it, as during my testing it took a long time for the API to return what I wanted from it, even after limiting the input through breaking the texts into chunks.
104 changes: 104 additions & 0 deletions api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
import os
import json
from dotenv import load_dotenv
from openai import OpenAI

# load .env and get key
load_dotenv()
if "OPENAI_API_KEY" not in os.environ:
raise RuntimeError("Please set OPENAI_API_KEY in a .env file or as an environment variable.")

client = OpenAI()

# stopwords tailored to Iliad / Odyssey style text
STOPWORDS = {
# core English
"the", "and", "to", "of", "a", "in", "that", "it", "is", "was", "for",
"on", "with", "as", "by", "this", "but", "from", "or", "not", "are",
"at", "be", "an", "which", "so", "we", "were", "have", "has", "had",

# pronouns
"i", "you", "he", "she", "they", "them", "their", "theirs", "his",
"her", "hers", "him", "me", "my", "mine", "our", "ours", "us", "your",
"yours", "who", "whom", "whose",

# connectors / often-seen words
"what", "when", "where", "why", "how", "then", "now", "there", "here",
"thus", "all", "one",

# helper verbs
"shall", "will", "would", "may", "might", "can", "could", "must", "should",

# archaic
"thy", "thou", "thee", "ye", "o", "unto", "doth", "hath",

# dialogue
"said", "say", "says",

# iterated words from testing
"no", "nor", "more", "these", "some", "yet", "o'er", "come", "own", "into", "if", "went", "been", "up",
"out", "do", "about", "one", "man", "us", "went", "go", "upon", "men", "did", "tell", "see", "any", "made",
"other", "good", "much", "back",

# stray unicode
"“", "”", "—"
}


def extract_entities_with_openai(text):
"""
Call OpenAI to get characters / places / gods from a passage.
"""
prompt = (
"You will be given a passage from Homer (Iliad or Odyssey). "
"Extract the names of characters, places, and gods that are explicitly mentioned. "
"Return ONLY valid JSON like this:\n"
"{\n"
' \"characters\": [\"Achilles\"],\n'
' \"places\": [\"Troy\"],\n'
' \"gods\": [\"Athena\"]\n'
"}\n"
"If none are found, return empty lists.\n\n"
"Passage:\n" + text
)

resp = client.chat.completions.create(
model="gpt-5-nano",
messages=[
{"role": "system", "content": "You extract entities from classical literature and return strict JSON."},
{"role": "user", "content": prompt}
],
temperature=1
)

content = resp.choices[0].message.content
try:
data = json.loads(content)
except json.JSONDecodeError:
data = {"characters": [], "places": [], "gods": []}
return data


def tally_entities(all_passage_entities):
"""
Combine entity dicts into frequency dicts.
"""
char_counts = {}
place_counts = {}
god_counts = {}

for ent in all_passage_entities:
for c in ent.get("characters", []):
c = c.strip()
if c:
char_counts[c] = char_counts.get(c, 0) + 1
for p in ent.get("places", []):
p = p.strip()
if p:
place_counts[p] = place_counts.get(p, 0) + 1
for g in ent.get("gods", []):
g = g.strip()
if g:
god_counts[g] = god_counts.get(g, 0) + 1

return char_counts, place_counts, god_counts
161 changes: 161 additions & 0 deletions project.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
import urllib.request
import string

from api import STOPWORDS, extract_entities_with_openai, tally_entities

# source texts
ILIAD_URL = "https://www.gutenberg.org/cache/epub/6130/pg6130.txt"
ODYSSEY_URL = "https://www.gutenberg.org/cache/epub/1727/pg1727.txt"

# chunk settings for OpenAI API
CHUNK_SIZE = 1200
MAX_API_CHUNKS = 2

# download the texts from project gutenberg
def download_text(url):
try:
with urllib.request.urlopen(url) as f:
text = f.read().decode('utf-8')
return text
except Exception as e:
print("An error occurred:", e)

# strip texts of headers
def strip_gutenberg(text):
start_token = "*** START OF"
end_token = "*** END OF"
s = text.find(start_token)
e = text.find(end_token)
if s != -1 and e != -1:
return text[s:e]
return text

def clean_and_split(text):
# convert to lowercase
text = text.lower()
# remove basic punctuation
for p in string.punctuation:
text = text.replace(p, " ")
# split to words
words = text.split()
return words

# filter text to remove stopwords
def remove_stopwords(words):
new_words = []
for w in words:
if w not in STOPWORDS:
new_words.append(w)
return new_words

def count_words(words):
counts = {}
for w in words:
if w in counts:
counts[w] = counts[w] + 1
else:
counts[w] = 1
return counts

def top_words(counts, n=15):
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
return items[:n]

def ascii_bar_chart(pairs, title):
"""
pairs: list of (word, count)
prints a simple bar chart
"""
print("\n" + title)
if not pairs:
print("no data")
return
max_val = pairs[0][1]
for word, count in pairs:
# scale bar to 30 chars
bar_len = int(count / max_val * 30)
bar = "#" * bar_len
print(f"{word:15} | {bar}")

def chunks_from_words(words, chunk_size):
for i in range(0, len(words), chunk_size):
yield words[i:i + chunk_size]

def analyze_work(name, url):
print(f"--- downloading {name} ---")
text = download_text(url)
text = strip_gutenberg(text)

# 1. basic text analysis
words = clean_and_split(text)
filtered_words = remove_stopwords(words)
counts = count_words(filtered_words)
top15 = top_words(counts, 15)

# visualization requirement: show a bar chart of top words
ascii_bar_chart(top15, f"Top words in {name}")

# 2. advanced: OpenAI entity extraction
print(f"\nExtracting entities from {name} with OpenAI (up to {MAX_API_CHUNKS} chunks)...")
all_entities = []
original_words = text.split()

chunk_number = 0
for chunk_words in chunks_from_words(original_words, CHUNK_SIZE):
if chunk_number >= MAX_API_CHUNKS:
break
passage = " ".join(chunk_words)
try:
data = extract_entities_with_openai(passage)
all_entities.append(data)
except Exception as e:
print("OpenAI error:", e)
break
chunk_number = chunk_number + 1

if all_entities:
char_counts, place_counts, god_counts = tally_entities(all_entities)
else:
char_counts, place_counts, god_counts = {}, {}, {}

print(f"\n(Partial) most mentioned characters in {name}:")
for word, cnt in sorted(char_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
print(word, cnt)

print(f"\n(Partial) most mentioned places in {name}:")
for word, cnt in sorted(place_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
print(word, cnt)

print(f"\n(Partial) most mentioned gods in {name}:")
for word, cnt in sorted(god_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
print(word, cnt)

return {
"word_counts": counts,
"entities": {
"characters": char_counts,
"places": place_counts,
"gods": god_counts
}
}


def main():
iliad_result = analyze_work("Iliad", ILIAD_URL)
odyssey_result = analyze_work("Odyssey", ODYSSEY_URL)

# compare if we got entities
print("\n=== Comparison (partial) ===")
if iliad_result["entities"]["characters"] and odyssey_result["entities"]["characters"]:
print("\nCharacters mentioned more in Iliad than Odyssey:")
for name, cnt in iliad_result["entities"]["characters"].items():
other = odyssey_result["entities"]["characters"].get(name, 0)
if cnt > other:
print(f"{name}: Iliad {cnt}, Odyssey {other}")
else:
print("Skipped comparison because no entities were extracted.")


if __name__ == "__main__":
main()