Open
Description
📊 Homework 2: Codeforces Data Analysis
For this homework, you’ll analyze data from the Codeforces API. The goal is to apply skills in data extraction, wrangling, visualization, and interpretation—all while exploring behavior on one of the largest competitive programming platforms.
✅ Objectives
Each student must:
-
Extract Data from Codeforces
- Use the Codeforces API to gather relevant contest, user, or submission data.
- You must restrict your analysis to contests and user activity that occurred between July and December of 2024.
- Also, only use contests that contain the strings "Hello", "Round", and "Good Bye" in their title.
- Steps:
- First, you will have to create a list of the contests using the contest.list endpoint
- Second, using the information you got in the first step, you can extract:
- The contest's information and problems using the contest.standings endpoint
- The users that participated in the contest using the user.ratedList endpoint
- The submissions using the contest.status endpoint
- The users' changes in points rating using the contest.ratingChanges endpoint.
- It can be helpful to create a folder for each one of the contests that contains tables with the results from the endpoints
-
Describe the Dataset
- In your notebook, include a markdown cell that describes:
- The API endpoints used
- The structure of your dataset
- A short explanation of each variable you’re analyzing
- In your notebook, include a markdown cell that describes:
-
Data Wrangling
- Clean and transform your data:
- Handle missing or nested fields
- Convert timestamps
- Create derived variables (explained further in the example table section):
- finished_n
- relative_time_n
- time_to_answer_n: can be calculated from the difference between relative_time_n by sorting the values and finding the lagged differences.
- rating_achieved
- Clean and transform your data:
-
Descriptive Figures and Analysis
- Histogram for submission times
- Density kernel figure for users' maxmimum ratings
- Boxplots of language vs. time_to_answer
- Binscatter for rating vs time_to_answer
- Basic linear regression for rating vs rating_achieved (with scatter plot and regression line)
- For each plot, include an interpretation
🧾 Example of Cleaned Dataset
Your cleaned dataset may look something like this:
author_handle | finished_1 | finished_2 | finished_3 | ... | 1_language | 2_language | ... | relative_time_1 | relative_time_2 | ... | time_to_answer_1 | ... | rating_1 | rating_2 | ... | rating_achieved | contest_id | contest_name | contest_start_time | country | city | rating | max_rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
78442 | ---0_0--- | True | True | False | ... | C++20 (GCC 13-64) | C++20 (GCC 13-64) | ... | 286 | 1965 | ... | 286 | ... | 800 | 1100 | ... | 1900 | 1995 | Codeforces Round 961 (Div. 2) | 1721745300 | India | nan | 1615 | 1670 |
78443 | --Accepted-- | True | True | False | ... | C++17 (GCC 7-32) | C++17 (GCC 7-32) | ... | 675 | 5186 | ... | 675 | ... | 800 | 1100 | ... | 1900 | 1995 | Codeforces Round 961 (Div. 2) | 1721745300 | nan | nan | 1260 | 1260 |
Explanation:
finished_n
: whether the user solved problem n in the contest.n_language
: language used to solve problem n.relative_time_n
: time in seconds from contest start to user's submission for problem n.time_to_answer_n
: difference between the time to answer question n and whichever question was answered before it.rating_n
: difficulty rating of problem n.rating_achieved
: sum of the rating of the problems the user was able to solve.contest_name
,contest_start_time
: contextual info for labeling.country
,city
,rating
,max_rating
: profile metadata.
Use this structure as a guiding example, but feel free to adapt based on the focus of your analysis.
📌 Deliverables
- Submit a Jupyter or Colab notebook with:
- Code for API access and analysis
- Markdown explanations for each section
- Clear, labeled plots
- If working in a group: include names + a short section on what each person contributed
💡 Tips for Success
- Start small! Focus on one user group or contest division.
- Use
requests
andpandas
for API access and processing. - Think of economic or behavioral interpretations of what you're seeing (e.g., trade-offs, learning curves, productivity).
Deadline April 17, 23:59.
Metadata
Metadata
Assignees
Labels
No labels