An analysis of Stack Overflow's annual survey responses. The project uses the CRISP-DM process and is divided into the following parts:
Ever since 2011, Stack Overflow makes an annual developer survey with the intention of evaluating the demographics, usage of technologies, education level, employment status, salary range and much more from the developer community. In their own words:
Our Annual Developer Survey examines all aspects of the developer experience from career satisfaction and job search to education and opinions on open source software.
In 2021, it was fielded from May 25 to June 15. With nearly 80,000 responses from over 180 countries and dependent territories. The results are already public, alongside a detailed report containing the main insights. You can check them here:
Stack Overflow Annual Developer Survery - 2021 Report
This project focuses on the most unusual and curious questions made since the beginning of the surveys. Besides that, I also take this opportunity to apply some data visualizations good practices. The analysis was made taking the raw data from 2021 and previous years, exploring and processing it with Python in a Jupyter Notebook and then plotting the graphs using the Matplotlib library.
Import the required libraries and load the datasets from the local directory. The CSV files used are available in the data/
folder from this repository.
Explore the questions and answers to be worked with throughout the rest of the notebook, with a quick Pandas summary of them. Eight questions were selected to be ploted, they are:
- How many caffeinated beverages per day?
- Tabs or spaces? (2015 and 2017)
- How much do you agree or disagree with the following statement? I want to go to Mars right now, even if there's a chance I never come back.
- Star Wars or Star Trek?
- Dogs or cats?
- Do you believe in aliens?
- How do you pronouce "GIF"?
- Are you the "IT support person" for your family?
Some questions have a relatively high percentage of null values, like Star Wars vs. Star Trek, with 38% of nulls. However, the absolute number is still high enough (34,398 respondents, for the previous example) to provide representative descriptive statistics. Since any input method for the missing data would add a bias, the null values will just be removed.
Transform the data into the correct format to be ploted. Since the datasets provided by Stack Overflow are already clean, this means mainly making aesthetical adjustments to strings, extrating the x and y values from the DataFrames and turning some absolute values into percentages.
Create functions to plot the data with Matplotlib. The visualizations also use a custom Matplotlib style sheet, available in style/minimal.mplstyle
, as a starting point for the plots.
The visualizations are inspired by the book Storytelling with Data, from Cole Nussbaumer Knaflic. Some examples of the graphs produced can be seen below.
Save the plots as PNG images in the images/en-us/
folder.
The descriptive statistics helped to understand developers' preferences and opinions on unusual topics and, in general, better understand the average developer persona.
The results of this project were published in an article on Medium:
- Python 3.11.3
- Pandas 1.5.3
- Matplotlib 3.7.1
- Jupyter notebook 6.5.4
- Install the dependencies.
- Clone the git repository:
git clone https://github.com/gabrieltempass/stack-overflow-survey.git
- Unzip the file
data/zipped_folder.zip
. - Go to the project's directory.
- Create the environment:
conda env create -f environment.yml
- Activate the environment:
conda activate stack-overflow-survey
- Open the Jupyter Notebook:
jupyter notebook "stack_overflow_survey_en-us.ipynb"
All the eleven datasets used in this analysis (from 2011 to 2021) are publicly available for download from Stack Overflow's website: