Example: MovieLens dataset

We will use R package recommenderlab and 100k-MovieLense dataset. The data was collected through the MovieLens web site (movielens.umn.edu) during Sept 1997 - Apr 1998. The data set contains ~100k ratings (1-5) from 943 users on 1664 movies. Each user has rated at least 19 movies. Note that the ratings matrix is stored with users corresponding to rows and movies corresponding to columns (different from what we had in the lectures).

In [2]:
library(recommenderlab)
data(MovieLense)
MovieLense
nusers=dim(MovieLense)[1]
nmovies=dim(MovieLense)[2]
943 x 1664 rating matrix of class ‘realRatingMatrix’ with 99392 ratings.
In [3]:
#check how many movies have the users rated
summary(rowCounts(MovieLense))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   19.0    32.0    64.0   105.4   147.5   735.0 
In [4]:
MovieLenseMeta[1:5,1:10] #metadata about movies (feature vectors) are also available - we don't use them here!
titleyearurlunknownActionAdventureAnimationChildren'sComedyCrime
Toy Story (1995) 1995 http://us.imdb.com/M/title-exact?Toy%20Story%20(1995) 0 0 0 1 1 1 0
GoldenEye (1995) 1995 http://us.imdb.com/M/title-exact?GoldenEye%20(1995) 0 1 1 0 0 0 0
Four Rooms (1995) 1995 http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)0 0 0 0 0 0 0
Get Shorty (1995) 1995 http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)0 1 0 0 0 1 0
Copycat (1995) 1995 http://us.imdb.com/M/title-exact?Copycat%20(1995) 0 0 0 0 0 0 1

We can visualise a part of the ratings matrix. There is lots of missing data!

In [5]:
image(MovieLense[sample(nusers,25),sample(nmovies,25)])

We will next run user-based collaborative filtering (UBCF), item-based collaborative filtering (IBCF) and alternating least squares (ALS) on this dataset. Let us first prepare the dataset. Users are split into a training set ($90\%$) and a test set ($10\%$). Thus, we will train our models on the ratings of 848 users. On the test set of 95 users, 12 ratings per user will be given to the recommender to make predictions and the other ratings are held out for computing prediction accuracy.

In [6]:
## create 90/10 split (known/unknown)
evlt <- evaluationScheme(MovieLense, method="split", train=0.9,
                         given=12)
evlt
tr <- getData(evlt, "train"); tr
tst_known <- getData(evlt, "known"); tst_known
tst_unknown <- getData(evlt, "unknown"); tst_unknown
Evaluation scheme with 12 items given
Method: ‘split’ with 1 run(s).
Training set proportion: 0.900
Good ratings: NA
Data set: 943 x 1664 rating matrix of class ‘realRatingMatrix’ with 99392 ratings.
848 x 1664 rating matrix of class ‘realRatingMatrix’ with 88557 ratings.
95 x 1664 rating matrix of class ‘realRatingMatrix’ with 1140 ratings.
95 x 1664 rating matrix of class ‘realRatingMatrix’ with 9695 ratings.

Create a UBCF recommender, using Pearson similarity and 50 nearest neighbours.

In [7]:
## create a user-based CF recommender using training data
rcmnd_ub <- Recommender(tr, "UBCF",
                        param=list(method="pearson",nn=50))

## create predictions for the test users using known ratings
pred_ub <- predict(rcmnd_ub, tst_known, type="ratings"); pred_ub

## evaluate recommendations on "unknown" ratings
acc_ub <- calcPredictionAccuracy(pred_ub, tst_unknown);
as(acc_ub,"matrix")
95 x 1664 rating matrix of class ‘realRatingMatrix’ with 156940 ratings.
RMSE1.0914818
MSE1.1913325
MAE0.8711042
In [8]:
#compare predictions with true "unknown" ratings
as(tst_unknown, "matrix")[1:8,1:5]
as(pred_ub, "matrix")[1:8,1:5]
Toy Story (1995)GoldenEye (1995)Four Rooms (1995)Get Shorty (1995)Copycat (1995)
5 4 3NANANA
6 4NANANANA
7NANANA 5NA
13 3 3NA 5 1
35NANANANANA
44 4NANANA 4
54 4NANANANA
65 3NANANANA
Toy Story (1995)GoldenEye (1995)Four Rooms (1995)Get Shorty (1995)Copycat (1995)
52.4783212.3571572.3360742.4284732.368132
63.4584883.3744293.5105383.4295483.370482
73.8598383.7033893.6587943.7287073.717585
132.7074382.5841752.4799662.5492012.564746
352.8380272.7509892.7459542.7553772.722942
443.4758233.3876163.2395033.3327243.312472
544.0078233.8103843.8158983.8630143.807695
653.8116643.6418463.5754923.6517773.683020

Now, let us repeat the same thing with IBCF. On this dataset, it does not work as well.

In [9]:
## repeat with the item-based approach
rcmnd_ib <- Recommender(tr, "IBCF",
                        param=list(method="pearson",k=50))
pred_ib <- predict(rcmnd_ib, tst_known, type="ratings")
acc_ib <- calcPredictionAccuracy(pred_ib, tst_unknown) 
acc <- rbind(UBCF = acc_ub, IBCF = acc_ib); acc
RMSEMSEMAE
UBCF1.091482 1.191332 0.8711042
IBCF1.683765 2.835066 1.2843314

We next try the alternating least squares approach (ALS). We will use latent attributes of dimension $k=20$.

In [10]:
rcmnd_als <- Recommender(tr, "ALS",
                         param=list(n_factors=20))
pred_als <- predict(rcmnd_als, tst_known, type="ratings")
acc_als <- calcPredictionAccuracy(pred_als, tst_unknown) 
acc <- rbind(UBCF = acc_ub, IBCF = acc_ib, ALS = acc_als); acc
RMSEMSEMAE
UBCF1.091482 1.191332 0.8711042
IBCF1.683765 2.835066 1.2843314
ALS1.022098 1.044684 0.8146836

The results of ALS look favourable to those of memory-based methods. However, each method has a number of tuning parameters (type of similarity, number of neighbours, number of latent factors, regularization parameters) so further comparisons are needed. There is a number of other methods -- below we query the registry of implemented methods.

In [12]:
recommenderRegistry$get_entries(dataType = "realRatingMatrix")
$ALS_realRatingMatrix
Recommender method: ALS for realRatingMatrix
Description: Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm.
Reference: Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan (2008). Large-Scale Parallel Collaborative Filtering for the Netflix Prize, 4th Int'l Conf. Algorithmic Aspects in Information and Management, LNCS 5034.
Parameters:
  normalize lambda n_factors n_iterations min_item_nr seed
1      NULL    0.1        10           10           1 NULL

$ALS_implicit_realRatingMatrix
Recommender method: ALS_implicit for realRatingMatrix
Description: Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm.
Reference: Yifan Hu, Yehuda Koren, Chris Volinsky (2008). Collaborative Filtering for Implicit Feedback Datasets, ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 263-272.
Parameters:
  lambda alpha n_factors n_iterations min_item_nr seed
1    0.1    10        10           10           1 NULL

$IBCF_realRatingMatrix
Recommender method: IBCF for realRatingMatrix
Description: Recommender based on item-based collaborative filtering.
Reference: NA
Parameters:
   k   method normalize normalize_sim_matrix alpha na_as_zero
1 30 "Cosine"  "center"                FALSE   0.5      FALSE

$POPULAR_realRatingMatrix
Recommender method: POPULAR for realRatingMatrix
Description: Recommender based on item popularity.
Reference: NA
Parameters:
  normalize    aggregationRatings aggregationPopularity
1  "center" new("standardGeneric" new("standardGeneric"

$RANDOM_realRatingMatrix
Recommender method: RANDOM for realRatingMatrix
Description: Produce random recommendations (real ratings).
Reference: NA
Parameters: None

$RERECOMMEND_realRatingMatrix
Recommender method: RERECOMMEND for realRatingMatrix
Description: Re-recommends highly rated items (real ratings).
Reference: NA
Parameters:
  randomize minRating
1         1        NA

$SVD_realRatingMatrix
Recommender method: SVD for realRatingMatrix
Description: Recommender based on SVD approximation with column-mean imputation.
Reference: NA
Parameters:
   k maxiter normalize
1 10     100  "center"

$SVDF_realRatingMatrix
Recommender method: SVDF for realRatingMatrix
Description: Recommender based on Funk SVD with gradient descend.
Reference: NA
Parameters:
   k gamma lambda min_epochs max_epochs min_improvement normalize verbose
1 10 0.015  0.001         50        200           1e-06  "center"   FALSE

$UBCF_realRatingMatrix
Recommender method: UBCF for realRatingMatrix
Description: Recommender based on user-based collaborative filtering.
Reference: NA
Parameters:
    method nn sample normalize
1 "cosine" 25  FALSE  "center"