lordkevinmo
diff --git a/‎3D_distribution_based_on_Education_age_Income.png
97.5 KB b/‎3D_distribution_based_on_Education_age_Income.png
97.5 KB
diff --git a/‎README.txt
+147 b/‎README.txt
+147
diff --git a/‎Regression.txt
+12 b/‎Regression.txt
+12
diff --git a/‎agglomerative_clusterin_with_scatter_plot.png
4.36 KB b/‎agglomerative_clusterin_with_scatter_plot.png
4.36 KB
diff --git a/‎based_content_filtering.py
+224 b/‎based_content_filtering.py
+224
@@ -0,0 +1,147 @@
+Summary
+=======
+
+This dataset (ml-latest) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 22884377 ratings and 586994 tag applications across 34208 movies. These data were created by 247753 users between January 09, 1995 and January 29, 2016. This dataset was generated on January 29, 2016.
+
+Users were selected at random for inclusion. All selected users had rated at least 1 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
+
+The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.
+
+This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.
+
+This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.
+
+
+Usage License
+=============
+
+Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:
+
+* The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
+* The user must acknowledge the use of the data set in publications resulting from the use of the data set (see below for citation information).
+* The user may not redistribute the data without separate permission.
+* The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.
+* The executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.
+
+In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).
+
+If you have any further questions or comments, please email <[email protected]>
+
+
+Citation
+========
+
+To acknowledge use of the dataset in publications, please cite the following paper:
+
+> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=<http://dx.doi.org/10.1145/2827872>
+
+
+Further Information About GroupLens
+===================================
+
+GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:
+
+* recommender systems
+* online communities
+* mobile and ubiquitious technologies
+* digital libraries
+* local geographic information systems
+
+GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit <http://movielens.org> to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at <[email protected]> - we are always interested in working with external collaborators.
+
+
+Content and Use of Files
+========================
+
+Formatting and Encoding
+-----------------------
+
+The dataset files are written as [comma-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a single header row. Columns that contain commas (`,`) are escaped using double-quotes (`"`). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.
+
+User Ids
+--------
+
+MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).
+
+Movie Ids
+---------
+
+Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).
+
+
+Ratings Data File Structure (ratings.csv)
+-----------------------------------------
+
+All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
+
+    userId,movieId,rating,timestamp
+
+The lines within this file are ordered first by userId, then, within user, by movieId.
+
+Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
+
+Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
+
+Tags Data File Structure (tags.csv)
+-----------------------------------
+
+All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:
+
+    userId,movieId,tag,timestamp
+
+The lines within this file are ordered first by userId, then, within user, by movieId.
+
+Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
+
+Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
+
+Movies Data File Structure (movies.csv)
+---------------------------------------
+
+Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:
+
+    movieId,title,genres
+
+Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.
+
+Genres are a pipe-separated list, and are selected from the following:
+
+* Action
+* Adventure
+* Animation
+* Children's
+* Comedy
+* Crime
+* Documentary
+* Drama
+* Fantasy
+* Film-Noir
+* Horror
+* Musical
+* Mystery
+* Romance
+* Sci-Fi
+* Thriller
+* War
+* Western
+* (no genres listed)
+
+Links Data File Structure (links.csv)
+---------------------------------------
+
+Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:
+
+    movieId,imdbId,tmdbId
+
+movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.
+
+imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.
+
+tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.
+
+Use of the resources listed above is subject to the terms of each provider.
+
+Cross-Validation
+----------------
+
+Prior versions of the MovieLens dataset included either pre-computed cross-folds or scripts to perform this computation. We no longer bundle either of these features with the dataset, since most modern toolkits provide this as a built-in feature. If you wish to learn about standard approaches to cross-fold computation in the context of recommender systems evaluation, see [LensKit](http://lenskit.org) for tools, documentation, and open-source code examples.
@@ -0,0 +1,12 @@
+Algorithmes de regression
+
+Ordinal regression
+Poisson regression
+Fast forest quantile regression
+Linear, Polynomial, Lasso, Stepwise, Ridge regression
+Bayesian linear regression
+Neural network regression
+Decision forest regression
+Boosted decision tree regression
+K-Nearest Neighbors
+
@@ -0,0 +1,224 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Thu Jan 17 21:36:10 2019
+
+@author: Koffi Moïse AGBENYA
+
+CONTENT-BASED FILTERING
+
+Recommendation systems are a collection of algorithms used to recommend items 
+to users based on information taken from the user. These systems have become 
+ubiquitous can be commonly seen in online stores, movies databases and job 
+finders. In this notebook, we will explore Content-based recommendation systems 
+and implement a simple version of one using Python and the Pandas library.
+
+ABOUT DATASET
+
+This dataset (ml-latest) describes 5-star rating and free-text tagging activity 
+from [MovieLens](http://movielens.org), a movie recommendation service. It 
+contains 22884377 ratings and 586994 tag applications across 34208 movies. 
+These data were created by 247753 users between January 09, 1995 and January 
+29, 2016. This dataset was generated on January 29, 2016.
+
+Users were selected at random for inclusion. All selected users had rated at 
+least 1 movies. No demographic information is included. Each user is 
+represented by an id, and no other information is provided.
+
+The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` 
+and `tags.csv`. More details about the contents and use of all these files 
+follows.
+
+This is a *development* dataset. As such, it may change over time and is not an 
+appropriate dataset for shared research results.
+
+"""
+
+#Dataframe manipulation library
+import pandas as pd
+#Math functions, we'll only need the sqrt function so let's import only that
+from math import sqrt
+import numpy as np
+import matplotlib.pyplot as plt
+
+#Storing the movie information into a pandas dataframe
+movies_df = pd.read_csv('movies.csv')
+#Storing the user information into a pandas dataframe
+ratings_df = pd.read_csv('ratings.csv')
+#Head is a function that gets the first N rows of a dataframe. N's default is 5.
+movies_df.head()
+
+#Let's remove the year from the title column by using pandas' replace 
+#function and store in a new year column.
+
+#Using regular expressions to find a year stored between parentheses
+#We specify the parantheses so we don't conflict with movies that have years in their titles
+movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
+#Removing the parentheses
+movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
+#Removing the years from the 'title' column
+movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
+#Applying the strip function to get rid of any ending whitespace characters that may have appeared
+movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
+movies_df.head()
+
+#Every genre is separated by a | so we simply have to call the split function on |
+movies_df['genres'] = movies_df.genres.str.split('|')
+movies_df.head()
+
+"""
+
+Since keeping genres in a list format isn't optimal for the content-based 
+recommendation system technique, we will use the One Hot Encoding technique to 
+convert the list of genres to a vector where each column corresponds to one 
+possible value of the feature. This encoding is needed for feeding categorical 
+data. In this case, we store every different genre in columns that contain 
+either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. 
+Let's also store this dataframe in another variable since genres won't be 
+important for our first recommendation system.
+
+"""
+
+#Copying the movie dataframe into a new one since we won't need to use the 
+#genre information in our first case.
+moviesWithGenres_df = movies_df.copy()
+
+#For every row in the dataframe, iterate through the list of genres and place a 
+#1 into the corresponding column
+for index, row in movies_df.iterrows():
+    for genre in row['genres']:
+        moviesWithGenres_df.at[index, genre] = 1
+#Filling in the NaN values with 0 to show that a movie doesn't have that 
+#column's genre
+moviesWithGenres_df = moviesWithGenres_df.fillna(0)
+moviesWithGenres_df.head()
+
+#Lets look at the ratings dataframe
+ratings_df.head()
+
+#Every row in the ratings dataframe has a user id associated with at least one 
+#movie, a rating and a timestamp showing when they reviewed it. We won't be 
+#needing the timestamp column, so let's drop it to save on memory.
+
+#Drop removes a specified row or column from a dataframe
+ratings_df = ratings_df.drop('timestamp', 1)
+ratings_df.head()
+
+#Content-Based recommendation system
+
+"""
+
+Now, let's take a look at how to implement Content-Based or Item-Item 
+recommendation systems. This technique attempts to figure out what a user's 
+favourite aspects of an item is, and then recommends items that present those 
+aspects. In our case, we're going to try to figure out the input's favorite 
+genres from the movies and ratings given.
+
+Let's begin by creating an input user to recommend movies to:
+
+Notice: To add more movies, simply increase the amount of elements in the 
+userInput. Feel free to add more in! Just be sure to write it in with capital 
+letters and if a movie starts with a "The", like "The Matrix" then write it in 
+like this: 'Matrix, The' .
+
+"""
+
+userInput = [
+            {'title':'Breakfast Club, The', 'rating':5},
+            {'title':'Toy Story', 'rating':3.5},
+            {'title':'Jumanji', 'rating':2},
+            {'title':"Pulp Fiction", 'rating':5},
+            {'title':'Akira', 'rating':4.5}
+         ] 
+inputMovies = pd.DataFrame(userInput)
+inputMovies
+
+"""
+
+Add movieId to input user
+With the input complete, let's extract the input movies's ID's from the movies 
+dataframe and add them into it.
+
+We can achieve this by first filtering out the rows that contain the input 
+movies' title and then merging this subset with the input dataframe. We also 
+drop unnecessary columns for the input to save memory space.
+
+"""
+
+#Filtering out the movies by title
+inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
+#Then merging it so we can get the movieId. It's implicitly merging it by title.
+inputMovies = pd.merge(inputId, inputMovies)
+#Dropping information we won't use from the input dataframe
+inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
+#Final input dataframe
+#If a movie you added in above isn't here, then it might not be in the original 
+#dataframe or it might spelled differently, please check capitalisation.
+inputMovies
+
+#We're going to start by learning the input's preferences, so let's get the 
+#subset of movies that the input has watched from the Dataframe containing 
+#genres defined with binary values.
+#Filtering out the movies from the input
+userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
+userMovies
+
+#We'll only need the actual genre table, so let's clean this up a bit by 
+#resetting the index and dropping the movieId, title, genres and year columns.
+
+#Resetting the index to avoid future issues
+userMovies = userMovies.reset_index(drop=True)
+#Dropping unnecessary issues due to save memory and to avoid issues
+userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
+userGenreTable
+
+"""
+
+Now we're ready to start learning the input's preferences!
+
+To do this, we're going to turn each genre into weights. We can do this by 
+using the input's reviews and multiplying them into the input's genre table and 
+then summing up the resulting table by column. This operation is actually a dot 
+product between a matrix and a vector, so we can simply accomplish by calling 
+Pandas's "dot" function.
+
+"""
+
+inputMovies['rating']
+
+#Dot produt to get weights
+userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
+#The user profile
+userProfile
+
+#Now, we have the weights for every of the user's preferences. This is known as
+#the User Profile. Using this, we can recommend movies that satisfy the user's 
+#preferences.
+#Let's start by extracting the genre table from the original dataframe:
+
+#Now let's get the genres of every movie in our original dataframe
+genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
+#And drop the unnecessary information
+genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
+genreTable.head()
+
+genreTable.shape
+
+#With the input's profile and the complete list of movies and their genres in 
+#hand, we're going to take the weighted average of every movie based on the 
+#input profile and recommend the top twenty movies that most satisfy it.
+
+#Multiply the genres by the weights and then take the weighted average
+recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
+recommendationTable_df.head()
+
+#Sort our recommendations in descending order
+recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
+#Just a peek at the values
+recommendationTable_df.head()
+
+#Now here's the recommendation table
+
+#The final recommendation table
+movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]
+
+