We will use R package recommenderlab and 100k-MovieLense dataset. The data was collected through the MovieLens web site (movielens.umn.edu) during Sept 1997 - Apr 1998. The data set contains ~100k ratings (1-5) from 943 users on 1664 movies. Each user has rated at least 19 movies. Note that the ratings matrix is stored with users corresponding to rows and movies corresponding to columns (different from what we had in the lectures).
library(recommenderlab)
data(MovieLense)
MovieLense
nusers=dim(MovieLense)[1]
nmovies=dim(MovieLense)[2]
#check how many movies have the users rated
summary(rowCounts(MovieLense))
MovieLenseMeta[1:5,1:10] #metadata about movies (feature vectors) are also available - we don't use them here!
We can visualise a part of the ratings matrix. There is lots of missing data!
image(MovieLense[sample(nusers,25),sample(nmovies,25)])
We will next run user-based collaborative filtering (UBCF), item-based collaborative filtering (IBCF) and alternating least squares (ALS) on this dataset. Let us first prepare the dataset. Users are split into a training set ($90\%$) and a test set ($10\%$). Thus, we will train our models on the ratings of 848 users. On the test set of 95 users, 12 ratings per user will be given to the recommender to make predictions and the other ratings are held out for computing prediction accuracy.
## create 90/10 split (known/unknown)
evlt <- evaluationScheme(MovieLense, method="split", train=0.9,
given=12)
evlt
tr <- getData(evlt, "train"); tr
tst_known <- getData(evlt, "known"); tst_known
tst_unknown <- getData(evlt, "unknown"); tst_unknown
Create a UBCF recommender, using Pearson similarity and 50 nearest neighbours.
## create a user-based CF recommender using training data
rcmnd_ub <- Recommender(tr, "UBCF",
param=list(method="pearson",nn=50))
## create predictions for the test users using known ratings
pred_ub <- predict(rcmnd_ub, tst_known, type="ratings"); pred_ub
## evaluate recommendations on "unknown" ratings
acc_ub <- calcPredictionAccuracy(pred_ub, tst_unknown);
as(acc_ub,"matrix")
#compare predictions with true "unknown" ratings
as(tst_unknown, "matrix")[1:8,1:5]
as(pred_ub, "matrix")[1:8,1:5]
Now, let us repeat the same thing with IBCF. On this dataset, it does not work as well.
## repeat with the item-based approach
rcmnd_ib <- Recommender(tr, "IBCF",
param=list(method="pearson",k=50))
pred_ib <- predict(rcmnd_ib, tst_known, type="ratings")
acc_ib <- calcPredictionAccuracy(pred_ib, tst_unknown)
acc <- rbind(UBCF = acc_ub, IBCF = acc_ib); acc
We next try the alternating least squares approach (ALS
). We will use latent attributes of dimension $k=20$.
rcmnd_als <- Recommender(tr, "ALS",
param=list(n_factors=20))
pred_als <- predict(rcmnd_als, tst_known, type="ratings")
acc_als <- calcPredictionAccuracy(pred_als, tst_unknown)
acc <- rbind(UBCF = acc_ub, IBCF = acc_ib, ALS = acc_als); acc
The results of ALS look favourable to those of memory-based methods. However, each method has a number of tuning parameters (type of similarity, number of neighbours, number of latent factors, regularization parameters) so further comparisons are needed. There is a number of other methods -- below we query the registry of implemented methods.
recommenderRegistry$get_entries(dataType = "realRatingMatrix")