Portfolio

Hi, my name is Harrison Jansma.

I am an avid FPS competitive gamer, sci-fi reader, and student. I love to learn about how things work, whether that be studying good coding practices, engineering techniques, or machine learning methods. Much of my experience revolves around building machine learning applications, but I also strive to gain a deeper understanding of the world so that I can expand my skillset and build new and amazing things.

Get to Know Me

My Portfolio Website

I built the site in HTML, CSS, and Javascript using pieces of an existing design on Colorlib. Though I am not interested in Front-End Development, I created and deployed this website on a private DigitalOcean server so that I could learn more about web app design and back-end development. Though I did learn a lot from making this website, much of the joy that came from this project was found making something of my own and putting it out into the world.

My Writing

I write because I love to teach. Though writing has fallen on my backburner with work and school, in the past I have been published in multiple major data-science and analytics publications; including freeCodeCamp (500k subscribers), Towards Data Science, and KDNuggets..

My Professional Experience

For the past 6 months I have been working as a Data Science Intern at Sprint. I was able to get much experience in database manipulation/extraction, ML model development, and business communication. Much of my work was centered around building machine learning applications to predict when business systems fail.

My Past Projects

RNNs, LSTMs, and Attention Mechanisms for Language Modelling (PyTorch)

Tested the use of Word2Vec embeddings with a variety of sequential input deep learning models towards the task of language modeling (predicting the next word in a sentence).

December 10, 2019

DavisBase: A Custom Designed SQL Database

A fully functional, SQL-compliant database implemented from scratch in Python. DavisBase compresses data to a custom-designed bit-level encoding for maximal data compression. By utilizing a file size of 512Kb, DavisBase performs well in low memory environments while also maximizing query time.

December 02, 2019

Reinforcement Learning: Dynamic Policy Gradients (Numpy)

My implementation of dynamic policy gradients in Python. This reinforcement learning algorithm was then used to train an agent to traverse a dangerous environment.

November 06, 2019

Hidden Markov Models for Parts of Speech Tagging (Python)

Custom implementation of Hidden Markov Models to assign parts of speech labels to a free text dataset. Model was coded from scratch in base Python and utilizes the Viterbi Algorithm for decoding the probability of a given sequence.

October 22, 2019

Named Entity Recognition with Scikit-Learn and PyTorch

Applied deep neural networks to the task of finding named entities (like "Google", or "Harrison") within the CoNLL (2003) dataset. A logistic regression and LSTM were trained on the data and achieved F1 scores of 0.926 and 0.899 respectively.

September 20, 2019

XGBoost for Text Classification

Utilized Python's XGBoost package to implement gradient boosting on a textual dataset. A similar code was later used during my work at Sprint to build machine learning models for text classification.

June 01, 2019

PySpark and HIVE for Data Analysis

Worked on an AWS EMR cluster to learn the basics of PySpark and HIVE while working at Sprint. These tools allowed me to collect massive amounts of data from Sprint's production data lake.

May 10, 2019

Implementing Common Algorithms in C++

The only way to become a not-garbage-coder is to code a lot. This repository contains some of the coursework I've completed over the last few months. As the semester winds down, I hope to start a new passion project soon!

March 19, 2019

Scraping and Analyzing 1.4 Million Medium Stories

Medium is a blogging platform where writers and readers share their ideas. This purpose of this project was to give Medium writers a benchmark to measure their own performance, as well as a goal that might increase the rankings of their stories in Medium's recommendation engine. With more than two hundred thousand writers in my dataset, this project has the potential to ease the creative process for thousands, and increase the quality of Medium's stories for its readers.

By collecting data on one million Medium stories, I was able to analyze the performance Medium's articles. As a result of this project, I found that the top 1% of Medium articles receive two thousand claps. Authors can use this metric as a goal when writing future stories. By achieving the top 1% of claps, a writer's story is more likely to stand out to Medium's recommendation engine, and as a result, reach new and diverse audiences.

The results of my analysis, along with an extensive exploratory data analysis of Medium, can be found in this repository.

I also wrote a story detailing my findings in Medium's largest tech publication, freeCodeCamp (496k subscribers). The full article can be found here. I then published the full data-set for public use by Medium's data-science community. All 1.4 million data points are freely available on Kaggle. My introductory article, describing the dataset and how I collected it, can be found here.

October 10, 2018

Experiment: Batch Norm vs. Dropout in ConvNets

This experiment tests whether convolutional neural networks with dropout or batch normalization are more performant in image recognition tasks. The notebook in this repository is experimental evidence supporting the Medium post I wrote explaining how to more effectively build convolutional neural networks.

The above blog post has been published and featured in Towards Data Science, with 3K reads on Medium in 2 weeks. It has also been reposted as a guest blog on KDNuggets, a leading site on Analytics, Big Data, Data Science, and Machine Learning, reaching over 500K unique visitors per month and over 230K subscribers/followers via email and social media.

August 15, 2018

Global Average Pooling: Object Localization

Object Localization Featuring my dog, Huckleberry.

In this project, I implemented the deep learning method for object localization (finding objects in an image) proposed in this research paper. I improved code written by Alexis Cook to handle multi-class localization of images.

Computer vision has innumerable real-world applications. This project was my introduction to the world of computer vision research. Since the conclusion of this project, I have focused heavily on researching recent advances in convolutional neural network architectures. Furthermore, I have made an emphasis on learning how to apply these concepts using Tensorflow and Keras.

July 16, 2018

Apple Sentiment Analysis

This project was motivated by my drive to learn about the best practices of predictive modeling in text data. In the write-up, I cleaned and vectorized Twitter data, visualized and examined patterns, and created a linear classifier to predict document sentiment with 89% accuracy on a validation set.

In the future, I would like to productionize this NLP model by creating a REST API to allow others access to my predictions.

June 20, 2018

Toxic Topic Modelling

In this project, I used unsupervised learning to cluster forum discussions. Specifically, I performed LDA clustering on Wikipedia forum comments to see if I could isolate clusters of toxic comments.(insults, slurs,...)

I was successful in isolating toxic comments into one group. Furthermore, I gained valuable knowledge about the discussions held within the forum dataset, labeling forum posts into nine distinct categories. These nine categories could be further grouped as either relevant discussion, side conversations, or outright toxic comments.

June 13, 2018

Clustering Mental Health

In this write-up I sought to answer whether a survey of mental health benefits of tech industry employees could be used to cluster employees into groups with good and bad mental health coverage.

By cleaning survey data and performing an exploratory data analysis I was able to analyze the demographics of the tech industry. I found the average respondent was a male, aged 35, located in the United States. By performing KMeans and agglomerative clustering (with scikit-learn) I attempted to cluster the data but found the survey setup prevented any meaningful insight into the data.

In completing this project, I learned how to encode categorical data and create an insightful EDA with great visualizations. I also learned how to implement clustering methods on data, and analyze the appropriateness of the clustering method with various techniques.

May 23, 2018

The End-To-End Data Science Lifecycle

In this write-up, I sought to practice the entire data science lifecycle. This includes defining project end goals, data cleaning, exploratory data analysis, model comparisons, and model tuning.

After a brief EDA, I visualized the Titanic dataset via a 2D projection. I then compared several machine learning algorithms and found the most accurate model to be a Gradient Boosted Machine. After a model tuning phase I increased model accuracy from 77% to 79%.

May 3, 2018

Mini-Projects

An Overview of Linear Models in Scikit Learn Jupyter Notebook
Web Scraping Overwatch Data from MasterOverwatch.com Jupyter Notebook
FICO Score Statistical Regression Jupyter Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
Apple Sentiment Analysis		Apple Sentiment Analysis
Clustering		Clustering
Experiment-BatchNorm-vs-Dropout		Experiment-BatchNorm-vs-Dropout
GAP Object Localization		GAP Object Localization
Project Titanic		Project Titanic
Toxic Comments		Toxic Comments
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Portfolio

Get to Know Me

My Portfolio Website

My Writing

My Professional Experience

My Past Projects

RNNs, LSTMs, and Attention Mechanisms for Language Modelling (PyTorch)

DavisBase: A Custom Designed SQL Database

Reinforcement Learning: Dynamic Policy Gradients (Numpy)

Hidden Markov Models for Parts of Speech Tagging (Python)

Named Entity Recognition with Scikit-Learn and PyTorch

XGBoost for Text Classification

PySpark and HIVE for Data Analysis

Implementing Common Algorithms in C++

Scraping and Analyzing 1.4 Million Medium Stories

Experiment: Batch Norm vs. Dropout in ConvNets

Global Average Pooling: Object Localization

Apple Sentiment Analysis

Toxic Topic Modelling

Clustering Mental Health

The End-To-End Data Science Lifecycle

Mini-Projects

About

Uh oh!

Releases

Packages

Languages

harrisonjansma/Portfolio

Folders and files

Latest commit

History

Repository files navigation

Portfolio

Get to Know Me

My Portfolio Website

My Writing

My Professional Experience

My Past Projects

Mini-Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages