Skip to content
This repository was archived by the owner on Jul 22, 2019. It is now read-only.

Commit 10eb074

Browse files
committed
Update
1 parent d5983ee commit 10eb074

40 files changed

+2363
-0
lines changed
Loading

README.txt

+147
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
Summary
2+
=======
3+
4+
This dataset (ml-latest) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 22884377 ratings and 586994 tag applications across 34208 movies. These data were created by 247753 users between January 09, 1995 and January 29, 2016. This dataset was generated on January 29, 2016.
5+
6+
Users were selected at random for inclusion. All selected users had rated at least 1 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
7+
8+
The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.
9+
10+
This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.
11+
12+
This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.
13+
14+
15+
Usage License
16+
=============
17+
18+
Neither the University of Minnesota nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set. The data set may be used for any research purposes under the following conditions:
19+
20+
* The user may not state or imply any endorsement from the University of Minnesota or the GroupLens Research Group.
21+
* The user must acknowledge the use of the data set in publications resulting from the use of the data set (see below for citation information).
22+
* The user may not redistribute the data without separate permission.
23+
* The user may not use this information for any commercial or revenue-bearing purposes without first obtaining permission from a faculty member of the GroupLens Research Project at the University of Minnesota.
24+
* The executable software scripts are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The entire risk as to the quality and performance of them is with you. Should the program prove defective, you assume the cost of all necessary servicing, repair or correction.
25+
26+
In no event shall the University of Minnesota, its affiliates or employees be liable to you for any damages arising out of the use or inability to use these programs (including but not limited to loss of data or data being rendered inaccurate).
27+
28+
If you have any further questions or comments, please email <[email protected]>
29+
30+
31+
Citation
32+
========
33+
34+
To acknowledge use of the dataset in publications, please cite the following paper:
35+
36+
> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=<http://dx.doi.org/10.1145/2827872>
37+
38+
39+
Further Information About GroupLens
40+
===================================
41+
42+
GroupLens is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Since its inception in 1992, GroupLens's research projects have explored a variety of fields including:
43+
44+
* recommender systems
45+
* online communities
46+
* mobile and ubiquitious technologies
47+
* digital libraries
48+
* local geographic information systems
49+
50+
GroupLens Research operates a movie recommender based on collaborative filtering, MovieLens, which is the source of these data. We encourage you to visit <http://movielens.org> to try it out! If you have exciting ideas for experimental work to conduct on MovieLens, send us an email at <[email protected]> - we are always interested in working with external collaborators.
51+
52+
53+
Content and Use of Files
54+
========================
55+
56+
Formatting and Encoding
57+
-----------------------
58+
59+
The dataset files are written as [comma-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a single header row. Columns that contain commas (`,`) are escaped using double-quotes (`"`). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.
60+
61+
User Ids
62+
--------
63+
64+
MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between `ratings.csv` and `tags.csv` (i.e., the same id refers to the same user across the two files).
65+
66+
Movie Ids
67+
---------
68+
69+
Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).
70+
71+
72+
Ratings Data File Structure (ratings.csv)
73+
-----------------------------------------
74+
75+
All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
76+
77+
userId,movieId,rating,timestamp
78+
79+
The lines within this file are ordered first by userId, then, within user, by movieId.
80+
81+
Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
82+
83+
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
84+
85+
Tags Data File Structure (tags.csv)
86+
-----------------------------------
87+
88+
All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:
89+
90+
userId,movieId,tag,timestamp
91+
92+
The lines within this file are ordered first by userId, then, within user, by movieId.
93+
94+
Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
95+
96+
Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
97+
98+
Movies Data File Structure (movies.csv)
99+
---------------------------------------
100+
101+
Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:
102+
103+
movieId,title,genres
104+
105+
Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.
106+
107+
Genres are a pipe-separated list, and are selected from the following:
108+
109+
* Action
110+
* Adventure
111+
* Animation
112+
* Children's
113+
* Comedy
114+
* Crime
115+
* Documentary
116+
* Drama
117+
* Fantasy
118+
* Film-Noir
119+
* Horror
120+
* Musical
121+
* Mystery
122+
* Romance
123+
* Sci-Fi
124+
* Thriller
125+
* War
126+
* Western
127+
* (no genres listed)
128+
129+
Links Data File Structure (links.csv)
130+
---------------------------------------
131+
132+
Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:
133+
134+
movieId,imdbId,tmdbId
135+
136+
movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.
137+
138+
imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.
139+
140+
tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.
141+
142+
Use of the resources listed above is subject to the terms of each provider.
143+
144+
Cross-Validation
145+
----------------
146+
147+
Prior versions of the MovieLens dataset included either pre-computed cross-folds or scripts to perform this computation. We no longer bundle either of these features with the dataset, since most modern toolkits provide this as a built-in feature. If you wish to learn about standard approaches to cross-fold computation in the context of recommender systems evaluation, see [LensKit](http://lenskit.org) for tools, documentation, and open-source code examples.

Regression.txt

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
Algorithmes de regression
2+
3+
Ordinal regression
4+
Poisson regression
5+
Fast forest quantile regression
6+
Linear, Polynomial, Lasso, Stepwise, Ridge regression
7+
Bayesian linear regression
8+
Neural network regression
9+
Decision forest regression
10+
Boosted decision tree regression
11+
K-Nearest Neighbors
12+
4.36 KB
Loading

based_content_filtering.py

+224
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
Created on Thu Jan 17 21:36:10 2019
4+
5+
@author: Koffi Moïse AGBENYA
6+
7+
CONTENT-BASED FILTERING
8+
9+
Recommendation systems are a collection of algorithms used to recommend items
10+
to users based on information taken from the user. These systems have become
11+
ubiquitous can be commonly seen in online stores, movies databases and job
12+
finders. In this notebook, we will explore Content-based recommendation systems
13+
and implement a simple version of one using Python and the Pandas library.
14+
15+
ABOUT DATASET
16+
17+
This dataset (ml-latest) describes 5-star rating and free-text tagging activity
18+
from [MovieLens](http://movielens.org), a movie recommendation service. It
19+
contains 22884377 ratings and 586994 tag applications across 34208 movies.
20+
These data were created by 247753 users between January 09, 1995 and January
21+
29, 2016. This dataset was generated on January 29, 2016.
22+
23+
Users were selected at random for inclusion. All selected users had rated at
24+
least 1 movies. No demographic information is included. Each user is
25+
represented by an id, and no other information is provided.
26+
27+
The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv`
28+
and `tags.csv`. More details about the contents and use of all these files
29+
follows.
30+
31+
This is a *development* dataset. As such, it may change over time and is not an
32+
appropriate dataset for shared research results.
33+
34+
"""
35+
36+
#Dataframe manipulation library
37+
import pandas as pd
38+
#Math functions, we'll only need the sqrt function so let's import only that
39+
from math import sqrt
40+
import numpy as np
41+
import matplotlib.pyplot as plt
42+
43+
#Storing the movie information into a pandas dataframe
44+
movies_df = pd.read_csv('movies.csv')
45+
#Storing the user information into a pandas dataframe
46+
ratings_df = pd.read_csv('ratings.csv')
47+
#Head is a function that gets the first N rows of a dataframe. N's default is 5.
48+
movies_df.head()
49+
50+
#Let's remove the year from the title column by using pandas' replace
51+
#function and store in a new year column.
52+
53+
#Using regular expressions to find a year stored between parentheses
54+
#We specify the parantheses so we don't conflict with movies that have years in their titles
55+
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
56+
#Removing the parentheses
57+
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
58+
#Removing the years from the 'title' column
59+
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
60+
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
61+
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
62+
movies_df.head()
63+
64+
#Every genre is separated by a | so we simply have to call the split function on |
65+
movies_df['genres'] = movies_df.genres.str.split('|')
66+
movies_df.head()
67+
68+
"""
69+
70+
Since keeping genres in a list format isn't optimal for the content-based
71+
recommendation system technique, we will use the One Hot Encoding technique to
72+
convert the list of genres to a vector where each column corresponds to one
73+
possible value of the feature. This encoding is needed for feeding categorical
74+
data. In this case, we store every different genre in columns that contain
75+
either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't.
76+
Let's also store this dataframe in another variable since genres won't be
77+
important for our first recommendation system.
78+
79+
"""
80+
81+
#Copying the movie dataframe into a new one since we won't need to use the
82+
#genre information in our first case.
83+
moviesWithGenres_df = movies_df.copy()
84+
85+
#For every row in the dataframe, iterate through the list of genres and place a
86+
#1 into the corresponding column
87+
for index, row in movies_df.iterrows():
88+
for genre in row['genres']:
89+
moviesWithGenres_df.at[index, genre] = 1
90+
#Filling in the NaN values with 0 to show that a movie doesn't have that
91+
#column's genre
92+
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
93+
moviesWithGenres_df.head()
94+
95+
#Lets look at the ratings dataframe
96+
ratings_df.head()
97+
98+
#Every row in the ratings dataframe has a user id associated with at least one
99+
#movie, a rating and a timestamp showing when they reviewed it. We won't be
100+
#needing the timestamp column, so let's drop it to save on memory.
101+
102+
#Drop removes a specified row or column from a dataframe
103+
ratings_df = ratings_df.drop('timestamp', 1)
104+
ratings_df.head()
105+
106+
#Content-Based recommendation system
107+
108+
"""
109+
110+
Now, let's take a look at how to implement Content-Based or Item-Item
111+
recommendation systems. This technique attempts to figure out what a user's
112+
favourite aspects of an item is, and then recommends items that present those
113+
aspects. In our case, we're going to try to figure out the input's favorite
114+
genres from the movies and ratings given.
115+
116+
Let's begin by creating an input user to recommend movies to:
117+
118+
Notice: To add more movies, simply increase the amount of elements in the
119+
userInput. Feel free to add more in! Just be sure to write it in with capital
120+
letters and if a movie starts with a "The", like "The Matrix" then write it in
121+
like this: 'Matrix, The' .
122+
123+
"""
124+
125+
userInput = [
126+
{'title':'Breakfast Club, The', 'rating':5},
127+
{'title':'Toy Story', 'rating':3.5},
128+
{'title':'Jumanji', 'rating':2},
129+
{'title':"Pulp Fiction", 'rating':5},
130+
{'title':'Akira', 'rating':4.5}
131+
]
132+
inputMovies = pd.DataFrame(userInput)
133+
inputMovies
134+
135+
"""
136+
137+
Add movieId to input user
138+
With the input complete, let's extract the input movies's ID's from the movies
139+
dataframe and add them into it.
140+
141+
We can achieve this by first filtering out the rows that contain the input
142+
movies' title and then merging this subset with the input dataframe. We also
143+
drop unnecessary columns for the input to save memory space.
144+
145+
"""
146+
147+
#Filtering out the movies by title
148+
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
149+
#Then merging it so we can get the movieId. It's implicitly merging it by title.
150+
inputMovies = pd.merge(inputId, inputMovies)
151+
#Dropping information we won't use from the input dataframe
152+
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
153+
#Final input dataframe
154+
#If a movie you added in above isn't here, then it might not be in the original
155+
#dataframe or it might spelled differently, please check capitalisation.
156+
inputMovies
157+
158+
#We're going to start by learning the input's preferences, so let's get the
159+
#subset of movies that the input has watched from the Dataframe containing
160+
#genres defined with binary values.
161+
#Filtering out the movies from the input
162+
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
163+
userMovies
164+
165+
#We'll only need the actual genre table, so let's clean this up a bit by
166+
#resetting the index and dropping the movieId, title, genres and year columns.
167+
168+
#Resetting the index to avoid future issues
169+
userMovies = userMovies.reset_index(drop=True)
170+
#Dropping unnecessary issues due to save memory and to avoid issues
171+
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
172+
userGenreTable
173+
174+
"""
175+
176+
Now we're ready to start learning the input's preferences!
177+
178+
To do this, we're going to turn each genre into weights. We can do this by
179+
using the input's reviews and multiplying them into the input's genre table and
180+
then summing up the resulting table by column. This operation is actually a dot
181+
product between a matrix and a vector, so we can simply accomplish by calling
182+
Pandas's "dot" function.
183+
184+
"""
185+
186+
inputMovies['rating']
187+
188+
#Dot produt to get weights
189+
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
190+
#The user profile
191+
userProfile
192+
193+
#Now, we have the weights for every of the user's preferences. This is known as
194+
#the User Profile. Using this, we can recommend movies that satisfy the user's
195+
#preferences.
196+
#Let's start by extracting the genre table from the original dataframe:
197+
198+
#Now let's get the genres of every movie in our original dataframe
199+
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
200+
#And drop the unnecessary information
201+
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
202+
genreTable.head()
203+
204+
genreTable.shape
205+
206+
#With the input's profile and the complete list of movies and their genres in
207+
#hand, we're going to take the weighted average of every movie based on the
208+
#input profile and recommend the top twenty movies that most satisfy it.
209+
210+
#Multiply the genres by the weights and then take the weighted average
211+
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
212+
recommendationTable_df.head()
213+
214+
#Sort our recommendations in descending order
215+
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
216+
#Just a peek at the values
217+
recommendationTable_df.head()
218+
219+
#Now here's the recommendation table
220+
221+
#The final recommendation table
222+
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]
223+
224+

0 commit comments

Comments
 (0)