Hi, my name is Harrison Jansma.
I am an avid FPS competitive gamer, sci-fi reader, and student. I love to learn about how things work, whether that be studying good coding practices, engineering techniques, or machine learning methods. Much of my experience revolves around building machine learning applications, but I also strive to gain a deeper understanding of the world so that I can expand my skillset and build new and amazing things.
I built the site in HTML, CSS, and Javascript using pieces of an existing design on Colorlib. Though I am not interested in Front-End Development, I created and deployed this website on a private DigitalOcean server so that I could learn more about web app design and back-end development. Though I did learn a lot from making this website, much of the joy that came from this project was found making something of my own and putting it out into the world.
I write because I love to teach. Though writing has fallen on my backburner with work and school, in the past I have been published in multiple major data-science and analytics publications; including freeCodeCamp (500k subscribers), Towards Data Science, and KDNuggets..
For the past 6 months I have been working as a Data Science Intern at Sprint. I was able to get much experience in database manipulation/extraction, ML model development, and business communication. Much of my work was centered around building machine learning applications to predict when business systems fail.
Tested the use of Word2Vec embeddings with a variety of sequential input deep learning models towards the task of language modeling (predicting the next word in a sentence).
December 10, 2019
A fully functional, SQL-compliant database implemented from scratch in Python. DavisBase compresses data to a custom-designed bit-level encoding for maximal data compression. By utilizing a file size of 512Kb, DavisBase performs well in low memory environments while also maximizing query time.
December 02, 2019
My implementation of dynamic policy gradients in Python. This reinforcement learning algorithm was then used to train an agent to traverse a dangerous environment.
November 06, 2019
Custom implementation of Hidden Markov Models to assign parts of speech labels to a free text dataset. Model was coded from scratch in base Python and utilizes the Viterbi Algorithm for decoding the probability of a given sequence.
October 22, 2019
Applied deep neural networks to the task of finding named entities (like "Google", or "Harrison") within the CoNLL (2003) dataset. A logistic regression and LSTM were trained on the data and achieved F1 scores of 0.926 and 0.899 respectively.
September 20, 2019
Utilized Python's XGBoost package to implement gradient boosting on a textual dataset. A similar code was later used during my work at Sprint to build machine learning models for text classification.
June 01, 2019
Worked on an AWS EMR cluster to learn the basics of PySpark and HIVE while working at Sprint. These tools allowed me to collect massive amounts of data from Sprint's production data lake.
May 10, 2019
The only way to become a not-garbage-coder is to code a lot. This repository contains some of the coursework I've completed over the last few months. As the semester winds down, I hope to start a new passion project soon!
March 19, 2019
Medium is a blogging platform where writers and readers share their ideas. This purpose of this project was to give Medium writers a benchmark to measure their own performance, as well as a goal that might increase the rankings of their stories in Medium's recommendation engine. With more than two hundred thousand writers in my dataset, this project has the potential to ease the creative process for thousands, and increase the quality of Medium's stories for its readers.
By collecting data on one million Medium stories, I was able to analyze the performance Medium's articles. As a result of this project, I found that the top 1% of Medium articles receive two thousand claps. Authors can use this metric as a goal when writing future stories. By achieving the top 1% of claps, a writer's story is more likely to stand out to Medium's recommendation engine, and as a result, reach new and diverse audiences.
The results of my analysis, along with an extensive exploratory data analysis of Medium, can be found in this repository.
I also wrote a story detailing my findings in Medium's largest tech publication, freeCodeCamp (496k subscribers). The full article can be found here. I then published the full data-set for public use by Medium's data-science community. All 1.4 million data points are freely available on Kaggle. My introductory article, describing the dataset and how I collected it, can be found here.
October 10, 2018
This experiment tests whether convolutional neural networks with dropout or batch normalization are more performant in image recognition tasks. The notebook in this repository is experimental evidence supporting the Medium post I wrote explaining how to more effectively build convolutional neural networks.
The above blog post has been published and featured in Towards Data Science, with 3K reads on Medium in 2 weeks. It has also been reposted as a guest blog on KDNuggets, a leading site on Analytics, Big Data, Data Science, and Machine Learning, reaching over 500K unique visitors per month and over 230K subscribers/followers via email and social media.
August 15, 2018

Object Localization Featuring my dog, Huckleberry.
In this project, I implemented the deep learning method for object localization (finding objects in an image) proposed in this research paper. I improved code written by Alexis Cook to handle multi-class localization of images.
Computer vision has innumerable real-world applications. This project was my introduction to the world of computer vision research. Since the conclusion of this project, I have focused heavily on researching recent advances in convolutional neural network architectures. Furthermore, I have made an emphasis on learning how to apply these concepts using Tensorflow and Keras.
July 16, 2018
This project was motivated by my drive to learn about the best practices of predictive modeling in text data. In the write-up, I cleaned and vectorized Twitter data, visualized and examined patterns, and created a linear classifier to predict document sentiment with 89% accuracy on a validation set.
In the future, I would like to productionize this NLP model by creating a REST API to allow others access to my predictions.
June 20, 2018
In this project, I used unsupervised learning to cluster forum discussions. Specifically, I performed LDA clustering on Wikipedia forum comments to see if I could isolate clusters of toxic comments.(insults, slurs,...)
I was successful in isolating toxic comments into one group. Furthermore, I gained valuable knowledge about the discussions held within the forum dataset, labeling forum posts into nine distinct categories. These nine categories could be further grouped as either relevant discussion, side conversations, or outright toxic comments.
June 13, 2018
In this write-up I sought to answer whether a survey of mental health benefits of tech industry employees could be used to cluster employees into groups with good and bad mental health coverage.
By cleaning survey data and performing an exploratory data analysis I was able to analyze the demographics of the tech industry. I found the average respondent was a male, aged 35, located in the United States. By performing KMeans and agglomerative clustering (with scikit-learn) I attempted to cluster the data but found the survey setup prevented any meaningful insight into the data.
In completing this project, I learned how to encode categorical data and create an insightful EDA with great visualizations. I also learned how to implement clustering methods on data, and analyze the appropriateness of the clustering method with various techniques.
May 23, 2018
In this write-up, I sought to practice the entire data science lifecycle. This includes defining project end goals, data cleaning, exploratory data analysis, model comparisons, and model tuning.
After a brief EDA, I visualized the Titanic dataset via a 2D projection. I then compared several machine learning algorithms and found the most accurate model to be a Gradient Boosted Machine. After a model tuning phase I increased model accuracy from 77% to 79%.
May 3, 2018
-
An Overview of Linear Models in Scikit Learn Jupyter Notebook
-
Web Scraping Overwatch Data from MasterOverwatch.com Jupyter Notebook
-
FICO Score Statistical Regression Jupyter Notebook





