Skip to content

Latest commit

 

History

History
597 lines (393 loc) · 32.1 KB

File metadata and controls

597 lines (393 loc) · 32.1 KB

Text Analysis Project

Introduction

In this project, you will learn how to use computational techniques to analyze text. You will access text from a variety of sources, including websites and APIs, and run computational analyses to create some sort of deliverable, such as interesting results from a text analysis, a visualization, or even a Python program that manipulates language in some interesting way. As part of the project, you are encouraged to use AI tools to explore how to talk to APIs and how to use Python libraries that have not been covered in class yet. This assignment is an individual project.

Skills Emphasized:

  • Accessing data programmatically from various sources on the Internet
  • Parsing text and storing it in appropriate data structures
  • Selecting the most suitable data structures for a specific task (e.g. dictionaries versus lists)
  • Applying computational methods to analyze, characterize, and compare text
  • Experimenting with AI tools to enhance the learning process and explore new tools and techniques.

How to Proceed

To get started on the assignment, you should first fork this base repository. Once you've forked the repository, clone the forked repository (the one under your GitHub profile) to your computer. You need to create one or multiple .py files in the forked repository.

You should read this document in a somewhat non-linear/spiral fashion:

  1. Scan through Part 1 to get a sense of what data sources are available. You can select one or two sources that interests you and try to retrieve text from them. Note that you do not need to try all the data sources.
  2. Scan through Part 2 to see a bunch of cool examples for what you can do with your text. You can also ask AI tools what else you can do with Python to process, analyze or visulize the text.
  3. Choose (at least) one data source from Part 1 and apply required techniques from Part 2, plus any additional techniques that interest you, to analyze, manipulate, transform or visualize the text.
  4. Make sure there is one clear entry .py file for the entire project. Multiple .py files are encouraged to break the project into smaller, modular components.
  5. Use the if __name__ == "__main__" idiom in the .py files. Your code should be executed when the entry Python file is run.
    if __name__ == "__main__":
        main()
  6. You are required to experiment with learning from AI tools (see more in Part 3).
  7. Write a brief document (Part 4) describing your process and your reflection.
  8. If you use any code or solutions that is not written by you (or that you learned from other places such as StackOverFlow/GitHub), please add Python comments (before the block of code) describing where you got/learned it from.
  9. Generally I DO NOT recommend using numpy, pandas, sklearn or matplotlib in this project, unless there is no other alternative way of processing and analyazing your data. For instance, if you need to perform complex matrix computations, text clustering (like MDS), or advanced visualizations, it is acceptable to use these libraries. Please justify your choice in your project documentation.

Jupyter Notebook vs .py Files: Which to Use?

Required for Submission: Your final project must include .py files as described above. This is a mandatory requirement.

Optional for Development: You are encouraged to use Jupyter Notebooks (.ipynb files) during the development and exploration phase. Here's a recommended workflow:

Use Jupyter Notebooks for:

  • Exploratory Data Analysis: Quickly test API connections, explore text data characteristics, and experiment with different processing methods
  • Interactive Visualization: Adjust chart parameters in real-time and immediately see the results
  • Learning and Experimentation: Test new libraries (NLTK, TextBlob, etc.) and debug code step-by-step
  • Documentation: Keep notes and screenshots showing how you used AI tools to learn and solve problems

Use .py Files for:

  • Final Submission: This is required by the project specifications
  • Production Code: Well-organized, modular functions with proper error handling
  • Code Reusability: Functions and classes that can be imported and reused
  • Version Control: Git-friendly format for tracking changes

Workflow Recommendation:

  1. Explore in Jupyter: Test APIs, experiment with text processing techniques, create visualizations interactively
  2. Refactor to .py: Move successful code into well-organized Python modules with proper functions and documentation
  3. Submit Both (optional): Include both .ipynb files (to show your learning process) and .py files (required for grading). You can reference your notebooks in the README to demonstrate how you used AI tools for learning.

This approach allows you to leverage the interactive benefits of Jupyter Notebooks while meeting the project's requirements for well-structured Python code.


Part 1: Harvesting text from the Internet

The goal for Part 1 is to collect some text from the Internet that you can later use for text analysis. Before diving deep into any particular method of text acquisition, it is recommended that you explore the different APIs and Python libraries available to extract text from the web. However, before spending too much time going down a particular path on the text acquisition component, you should look ahead to Part 2 to understand some of the things you can do with text you are harvesting. The key to a successful project is combining a relevant source of text with an appropriate technique for analysis (see Part 2).

Note: Some APIs (such as Twitter and Reddit) may require a paid subscription or a lengthy application process. It is recommended to apply for API credentials in advance or choose alternative free data sources to avoid delays later in the project.

Installing Python Packages

Throughout this project, you will need to install various Python libraries. Here are the recommended methods:

If you are using Anaconda (recommended for this course):

# Use conda to install packages (preferred method for Anaconda users)
conda install -c conda-forge package_name

# If the package is not available in conda, use pip with Anaconda's Python
python -m pip install package_name

If you are using standard Python installation (not Anaconda, not for this course):

# For Windows users
python -m pip install package_name

# For macOS/Linux users
python3 -m pip install package_name

Important Notes:

  • Always use python -m pip install instead of just pip install to ensure you're installing to the correct Python environment
  • If you're using Anaconda, try conda install first, as it handles dependencies better
  • You can check which Python you're using by running python --version or which python (macOS/Linux) or where python (Windows)

Data Source: Project Gutenberg

Project Gutenberg (http://www.gutenberg.org/) is a website that provides over 55,000 e-books that are freely available to the public. Unlike some sites, all of the texts on Project Gutenberg are in the public domain, which means they are no longer protected by copyright. For example, the site offers 171 works by Charles Dickens. The best thing about these texts is that they are available in plain text format, which makes them easy to analyze using Python.

To download a book from Project Gutenberg, first use the search engine on the Project Gutenberg website to find a book you are interested in downloading. For example, if you want to download Oliver Twist by Charles Dickens, search for it on the website. Once you have found the book you want to download, go to its page on the Project Gutenberg website. Find the "Plain Text UTF-8" link on the book's page. Copy the link to the plain text version of the book. In the case of Oliver Twist, the link to the plain text version is "https://www.gutenberg.org/cache/epub/730/pg730.txt".

To download the text inside Python, you can use the following code:

import urllib.request

url = 'https://www.gutenberg.org/cache/epub/730/pg730.txt'
try:
    with urllib.request.urlopen(url) as f:
        text = f.read().decode('utf-8')
        print(text)  # for testing
except Exception as e:
    print("An error occurred:", e)

Security Note: When working with APIs that require credentials (Twitter, Reddit, News API, etc.), never commit your API keys to version control. Use environment variables or separate configuration files (e.g., config.py) and add them to .gitignore.

Note that there is a preamble (boilerplate on Project Gutenberg, table of contents, etc.) that has been added to the text that you might want to strip out using Python code when you do your analysis. There is similar material at the end of the file.

One limitation of using Project Gutenberg is that they impose a limit on how many texts you can download in a 24-hour period. If you are analyzing many texts, it may be more efficient to download them once and load them off disk, rather than fetching them from Project Gutenberg's servers every time you run your program. See the Pickling Data section below on how to save data to files and load it back into your program. Additionally, there are many mirrors of the Project Gutenberg site available if you want to get around the download restriction.

Data Source: Wikipedia

Wikipedia is another valuable source of data that can be easily accessed and parsed using the mediawiki library which is a python wrapper and parser for the MediaWiki API. To install the library:

# If using Anaconda (recommended)
conda install -c conda-forge pymediawiki

# Or using pip
python -m pip install pymediawiki

Once you have installed the library, you can use it to search Wikipedia, get article summaries, and extract data like links and images from a page. To fetch a particular article and print out its sections, you can use the following Python code:

from mediawiki import MediaWiki

wikipedia = MediaWiki()
babson = wikipedia.page("Babson College")
print(babson.title)
print(babson.content)

This code will fetch the article with the given title and print its title and content. The output will look like this:

Babson College (Babson) is a private business school in Wellesley, Massachusetts. Established in 1919, Babson's central focus is on entrepreneurship education and its use in creating economic and social value. The college was founded by Roger W. Babson as an all-male business institute and became coeducational in 1970.
...

You can also access other properties of a page, such as its categories, sections, and links. See the mediawiki package documentation for more information on available properties and methods.

Data Source: Twitter

(Note: I have not tested this API since the announcment of shutting down free Twitter API. The free version of Twitter API has been deprecated and replaced with a new version that requires application approval and authentication with a paid subscription. To use the Twitter API, you need to apply to Twitter for a developer account and explain the purpose of what you are doing with the data, which Twitter will manually review.)

(Update: You'll need at least the Basic access tier to search recent tweets, which isn't free. You can subscribe to it in your Dashboard in the Developer Portal.)

If you have access to a Twitter developer account and the necessary API keys and tokens, you can use the tweepy library to search for tweets.

To install tweepy:

# If using Anaconda (recommended)
conda install -c conda-forge tweepy

# Or using pip
python -m pip install tweepy

Here is a simple example for searching tweets containing Babson College:

import tweepy

# Replace the following strings with your own keys and secrets
TOKEN = 'Your TOKEN'
TOKEN_SECRET = 'Your TOKEN_SECRET'
CONSUMER_KEY = 'Your CONSUMER_KEY'
CONSUMER_SECRET = 'Your CONSUMER_SECRET'

# Authenticate to Twitter
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(TOKEN,TOKEN_SECRET)

api = tweepy.API(auth)

for tweet in api.search_tweets(q="babson college", lang="en", count=10):
    print(f"{tweet.user.name}: {tweet.text}")

Data Source: Reddit

Note: Reddit also requires users to register and create an application in order to obtain API credentials. After creating an application, you can obtain the necessary credentials such as client_id, client_secret, username, password, and user_agent. You can learn more about this process on the Reddit API documentation.

To get reddit data, you need to install the PRAW library:

# If using Anaconda (recommended)
conda install -c conda-forge praw

# Or using pip
python -m pip install praw

Here's an example from the PRAW docs page:

import praw
import config


reddit = praw.Reddit(client_id=config.client_id,
                     client_secret=config.client_secret,
                     username=config.username,
                     password=config.password,
                     user_agent=config.user_agent)

sub = 'learnpython'
submissions = reddit.subreddit(sub).top('day', limit=5)
for submission in submissions:
    print(submission.title)
    print(submission.selftext)

Data Source: News API

You can use newsapi-python library to fetch news articles from News API. You need to install the newsapi-python library:

# If using Anaconda (recommended)
conda install -c conda-forge newsapi-python

# Or using pip
python -m pip install newsapi-python

Here's an example from the Python client library page in News API Documentation:

from newsapi import NewsApiClient

# Init
newsapi = NewsApiClient(api_key='API_KEY')

# /v2/top-headlines
top_headlines = newsapi.get_top_headlines(q='bitcoin',
                                          sources='bbc-news,the-verge',
                                          category='business',
                                          language='en',
                                          country='us')

# /v2/everything
all_articles = newsapi.get_everything(q='bitcoin',
                                      sources='bbc-news,the-verge',
                                      domains='bbc.co.uk,techcrunch.com',
                                      from_param='2017-12-01',
                                      to='2017-12-12',
                                      language='en',
                                      sort_by='relevancy',
                                      page=2)

# /v2/top-headlines/sources
sources = newsapi.get_sources()

Data Source: Newspaper Articles

You can also use Newspaper4k package to scrape and curate news articles. You need to install the Newspaper4k library:

# If using Anaconda (recommended)
conda install -c conda-forge newspaper4k

# Or using pip
python -m pip install newspaper4k

Here's an example from the Newspaper4k Docs page:

import newspaper

article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')

print(article.authors)
# ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']

print(article.publish_date)
# 2023-10-29 09:00:15.717000+00:00

print(article.text)
# New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...

Data Source: IMDB Movie Reviews

To get the IMDB data, you need to install cinemagoer library:

# If using Anaconda (recommended)
conda install -c conda-forge cinemagoer

# Or using pip
python -m pip install cinemagoer

Here's an example to print the first review of the movie "The Dark Knight":

from imdb import Cinemagoer

# create an instance of the Cinemagoer class
ia = Cinemagoer()

# search movie
movie = ia.search_movie("The Dark Knight")[0]
print(movie.movieID)
# '0468569'

# Get reviews
movie = ia.get_movie('0468569', info=['reviews']) # Make sure to add the second argument
reviews = movie.get('reviews', [])

for review in reviews:
    print(review['content'])
    print()


# Get actor
matt_damon = ia.get_person_filmography('0000354')

# Get Matt Damon's movies
data = matt_damon['data']
filmography = data['filmography']
films_as_actor = filmography['actor']
print(films_as_actor)

Data Source: More Data Sources

There are many other data sources that you can utilize in your project:

Feel free to explore and choose the data source that fits your project's needs.

Pickling Data

When you download text data from the Internet, it is often useful to save it to disk so that you do not have to re-download it every time you run your program. One way to do this in Python is to use the built-in pickle library, which allows you to serialize Python objects and save them to a file.

In addition to pickling, you can also save files using JSON format. To explore more about the built-in json library, feel free to ask AI tools or visit the official Python documentation website.


Part 2: Analyzing Your Text

This part consists of required steps that all students must complete, and optional techniques that you can choose from to extend your analysis.

Required Steps

All students must complete the following steps:

1. Text Cleaning and Preprocessing

(Note: This step is required.)

Before analyzing your text, you need to clean and preprocess it. This includes:

  • Removing unwanted content (e.g., Project Gutenberg preambles, HTML tags, special characters)
  • Converting text to lowercase for consistency
  • Handling punctuation appropriately
  • Dealing with encoding issues if any

Real-world text data is often messy, and proper cleaning is essential for accurate analysis.

2. Removing Stop Words

(Note: This step is required.)

Stop words are words that occur frequently in text but do not provide useful information for analysis. Examples of stop words include "the", "and", "a", "is", etc. Removing stop words helps to:

  • Reduce the size of the text data
  • Improve the accuracy of analysis
  • Focus on meaningful words that carry semantic value

You can use built-in stop word lists from libraries like NLTK, or create your own custom list.

3. Word Frequency Analysis

(Note: This step is required.)

One way to begin to process your text is to take each unit of text (for instance, books from Project Gutenberg, or perhaps a collection of movie reviews) and summarize it by counting the number of times a particular word appears in the text. A natural way to approach this in Python would be to use a dictionary where the keys are words that appear and the values are frequencies of words in the text. If you want to do something fancier, you can use TF-IDF features.

4. Computing Summary Statistics

(Note: This step is required.)

Beyond calculating word frequencies, there are other methods to summarize text. For instance, you may want to:

  • Identify the top 10 or top 20 most frequent words in each text
  • Determine words that appear frequently in one text but not in others
  • Calculate average word length, sentence length, or document length
  • Compare vocabulary richness across different texts

We practiced this using Jane Austen's novel in class, so I recommend starting with that example.

5. Data Visualization

(Note: This step is required.)

You must create at least one visualization to present your analysis results. This could include:

  • Bar charts showing top N most frequent words
  • Word clouds (can use the wordcloud library)
  • Line graphs comparing statistics across different texts
  • Simple ASCII-based visualizations if you want to avoid matplotlib

Visualizations help communicate your findings effectively and make your analysis more engaging.

Optional Techniques

Choose at least one of the following advanced techniques to extend your analysis:

Optional Technique 1: Natural Language Processing

(Note: Choose at least one optional technique.)

NLTK - the Natural Language Toolkit - is a powerful tool for processing human language data. It provides a wide range of capabilities, such as part-of-speech tagging, sentiment analysis, and full sentence parsing.

To use NLTK, you need to install nltk:

# If using Anaconda (recommended)
conda install -c conda-forge nltk

# Or using pip
python -m pip install nltk

Here is an example of doing sentiment analysis using the VADER library in NLTK:

import nltk
nltk.download('vader_lexicon')  # Download required data
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentence = 'Software Design is my favorite class because learning Python is so cool!'
score = SentimentIntensityAnalyzer().polarity_scores(sentence)
print(score)
# Output
# {'neg': 0.0, 'neu': 0.614, 'pos': 0.386, 'compound': 0.7417}

Notice: If you receive Resource vader_lexicon not found error when using nltk, you need to enter python in Command Prompt (or python3 in Terminal on macOS), then enter import nltk and nltk.download('vader_lexicon') in Python interactive shell.

You can also use TextBlob library, which is built on top of NLTK, for almost everything that NLTK does. Below is the brief introduction of TextBlob from its GitHub page:

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and more.

If you perform natural language processing, you can draw interesting insights from text data collected from the web. For instance, if you monitor a specific subreddit related to a political topic, you can gauge the sentiment of the community by analyzing the text of each post and comment. Similarly, you can analyze discussions on subreddits dedicated to movies to identify which recent movies have received the most negative reviews. There are tons of cool options here!

Optional Technique 2: Text Similarity

(Note: Choose at least one optional technique.)

It is potentially quite useful to be able to compute the similarity of two texts. Suppose that we have characterized some texts from Project Gutenberg using word frequency analysis. One way to compute the similarity of two texts is to test to what extent when one text has a high count for a particular word the other text also a high count for a particular word. Specifically, we can compute the cosine similarity between the two texts. This strategy involves thinking of the word counts for each text as being high-dimensional vectors where the number of dimensions is equal to the total number of unique words in your text dataset and the entry in a particular element of the vector is the count of how frequently the corresponding word appears in a specific document. If you find this approach unclear and wish to try it, you can either reach out to the professor, or ask AI tools for assistance.

For a simple text similarity task, you can use external libraries, like TheFuzz library, which uses Levenshtein Distance to calculate the differences between sequences.

from thefuzz import fuzz

print(fuzz.ratio("this is a test", "this is a test!")) # 97
print(fuzz.partial_ratio("this is a test", "this is a test!")) # 100
print(fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")) # 91
print(fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")) # 100

Optional Technique 3: Text Clustering

(Note: Choose at least one optional technique.)

If you can generate pairwise similarities (say using the technique above), you can use Metric Multi-dimensional Scaling (MDS) to visualize the texts in a 2-dimensional space. This can help identify clusters of similar texts.

In order to apply MDS to your data, you can use the machine learning toolkit scikit-learn. Here is some code that uses the similarity matrix defined in the previous section to create a 2-dimensional embedding of the four Charles Dickens and 1 Charles Darwin texts.

import numpy as np
from sklearn.manifold import MDS
import matplotlib.pyplot as plt

# these are the similarities computed from the previous section
S = np.asarray([[1., 0.90850572, 0.96451312, 0.97905034, 0.78340575],
    [0.90850572, 1., 0.95769915, 0.95030073, 0.87322494],
    [0.96451312, 0.95769915, 1., 0.98230284, 0.83381607],
    [0.97905034, 0.95030073, 0.98230284, 1., 0.82953109],
    [0.78340575, 0.87322494, 0.83381607, 0.82953109, 1.]])

# dissimilarity is 1 minus similarity
dissimilarities = 1 - S

# compute the embedding
coord = MDS(dissimilarity='precomputed').fit_transform(dissimilarities)

plt.scatter(coord[:, 0], coord[:, 1])

# Label the points
for i in range(coord.shape[0]):
    plt.annotate(str(i), (coord[i, :]))

plt.show()

This will generate the following plot. The coordinates don't have any special meaning, but the embedding tries to maintain the similarity relationships that we computed via comparing word frequencies. Keep in mind that the point labeled 4 is the work by Charles Darwin and the other are by Charles Dickens. text clustering

Optional Technique 4: Markov Text Synthesis

(Note: Choose at least one optional technique.)

You can use Markov analysis to learn a generative model of the text that you collect from the web and use it to generate new texts. You can even use it to create mashups of multiple texts. One of possibilities in this space would be to create literary mashups automatically. Again, let professor know if you go this route and we can provide more guidance.

Optional Technique 5: LLM (Large Language Model) Text Generation

(Note: Choose at least one optional technique.)

You can explore further possibilities by using the OpenAI API. Feel free to ask for an API token if you're interested, and I'd be happy to provide it. I highly encourage you to give this a try!


Part 3: Learning with AI

As you work through this project and experiment with different libraries in Python, you may encounter roadblocks or have questions about your code. That's when you can use AI tools, like ChatGPT to clear out any issues. You are also encouraged to learn other approaches, besides the techniques mentioned above, to process, analyze and visualize your own text dataset in Python from ChatGPT or other AI tools, who will serve as your assistant, providing helpful suggestions, aiding your learning process.

Reminder: While AI tools can be incredibly helpful in resolving issues or suggesting new approaches, it’s important not to rely too heavily on them. Always test and validate the generated code, making sure it meets the project requirements and that you fully understand how the code works. Include comments in your code that indicate which parts were generated with AI assistance, and provide links or references to the sources if applicable. This practice not only helps maintain academic integrity but also demonstrates your learning process.

Here's how to make the most out of AI tools (using ChatGPT as an example):

  • Clearly Define Your Problem: Take detailed notes on where you're stuck or what you're trying to achieve before asking ChatGPT for assistance.
  • Craft Detailed Prompts: When asking ChatGPT for help, provide a clear and thorough description of the issue. The better you frame your question, the more helpful the response will be.
  • Review and Verify: After receiving a response, carefully read the suggestions. Remember, AI-generated solutions may not always be accurate, so it's important to test the code and consult additional official documentation if needed.
  • Document Your Learning Process: To track your progress, include ChatGPT Shared Links in your code comments or maintain a separate document. You may also take screenshots during your ChatGPT session and include them in your project write-up.

Part 4: Project Writeup and Reflection

Write a summary of your project and your reflections on it in README.md, using Markdown format. There is no need to use fancy words or ChatGPT. The README.md file should consist of the following sections:

1. Project Overview (~1 paragraph)

What data source(s) did you use? What technique(s) did you use to process or analyze them? What did you hope to create or learn through this project?

2. Implementation (~1-2 paragraphs)

Describe your implementation at a system architecture level. You should NOT walk through your code line by line, or explain every function (we can get that from your docstrings). Instead, talk about the major components, algorithms, data structures and how they fit together. You should also discuss at least one design decision where you had to choose between multiple alternatives, and explain why you made the choice. Use shared links and/or screenshots to describe how you used AI tools to help you or learn new things.

3. Results (~1-3 paragraphs + figures/examples)

Present what you accomplished in your project:

  • If you did some text analysis, what interesting things did you find? Graphs or other visualizations may be very useful here for showing your results.
  • If you created a program that does something interesting (e.g. a Markov text synthesizer), be sure to provide a few interesting examples of the program's output.

4. Reflection (~1-2 paragraphs)

From a process point of view, what went well? What was the biggest challenge? How did you solve it? What could you improve? Was your project appropriately scoped? Did you have a good testing plan?

From a learning perspective, what was your biggest takeaway from this project? How did AI tools help you? How will you use what you learned going forward? What do you wish you knew beforehand that would have helped you succeed?


Submitting your Project

  1. Push all the code and updated README.md to the GitGub repository.
  2. Create a pull request to the upstream repository. Please learn how to create a pull request by following this instruction.
  3. Submit your project's GitHub repository URL to Canvas.

Updated: 2025/10/26