Week 8 Practical: Aerosol Prediction Challenge

Solution below is just an illustration. There are many ways in which this problem may have been tackled. I will use a combination of regularised least squares on both handcrafted features for bags, as well as on some random Fourier features, which arise from large-scale approximations for kernel methods.

In [4]:
library(glmnet)
#load data
Xdf<-read.csv("~/Rwd/X.csv")
ytr<-read.csv("~/Rwd/ytr.csv")
ytst<-read.csv("~/Rwd/ytst.csv")

pinstance <- 7 #number of instance features
pbag <- 5 #number of bag features
p<-pinstance+pbag


ntr<-dim(ytr)[1]
ntst<-dim(ytst)[1]
n<-ntr+ntst
dlevel_indices<-1+100*(0:(n-1))

#scale first
Xdf[,1:p]<-scale(Xdf[,1:p])

Let us try a very simple summarisation of the bags first. We represent each bag by the means of instance attributes and concatenate the bag-level attributes. Then we apply $L_2$-regularised least squares regression using cross-validation for the regularisation parameter $\lambda$.

In [5]:
#############################
#linear model on just means

Phi<-data.frame(matrix(0,n,12))
for(ii in 1:n){
    Phi[ii,]<-colMeans(Xdf[Xdf$bag_Id==ii,1:12])
}

Phitr<-Phi[1:ntr,]
Phitst<-Phi[(ntr+1):n,]
In [6]:
#use 10-fold cross-validation to assess accuracy
lmmean_cvfit<-cv.glmnet(as.matrix(Phitr),ytr$y,type.measure="mse",nfolds=10,alpha=0,lambda=2^((-30:6)/2))
plot(lmmean_cvfit)
cat("crossvalidation error (means only):", sqrt(min(lmmean_cvfit$cvm)))
lmmean_ypred<-predict(lmmean_cvfit,as.matrix(Phitst),s="lambda.min")
crossvalidation error (means only): 0.8011483

Now let us do the same but using also the standard deviations and the 0.2 and 0.8 quantiles of each bag.

In [7]:
#linear model on means+sdev+quantiles
qlo<-function(x) {quantile(x,0.2)}
qhi<-function(x) {quantile(x,0.8)}

Phi<-data.frame(matrix(0,n,33))
for(ii in 1:n){
  Phi[ii,1:12]<-colMeans(Xdf[Xdf$bag_Id==ii,1:12])
  Phi[ii,13:19]<-apply(Xdf[Xdf$bag_Id==ii,1:7],2,sd)
  Phi[ii,20:26]<-apply(Xdf[Xdf$bag_Id==ii,1:7],2,qlo)
  Phi[ii,27:33]<-apply(Xdf[Xdf$bag_Id==ii,1:7],2,qhi)
}

Phitr<-Phi[1:ntr,]
Phitst<-Phi[(ntr+1):n,]
In [9]:
#use 10-fold cross-validation to assess accuracy
lmmsdq_cvfit<-cv.glmnet(as.matrix(Phitr),ytr$y,type.measure="mse",nfolds=10,alpha=0,lambda=2^((-30:6)/2))
plot(lmmsdq_cvfit)
cat("crossvalidation error (means+sd+q): ", sqrt(min(lmmsdq_cvfit$cvm)))
lmmsdq_ypred<-predict(lmmsdq_cvfit,as.matrix(Phitst),s="lambda.min")
crossvalidation error (means+sd+q):  0.744375

Can also try boosting decision trees on the same representation. We also cross-validate with 10 folds and try depths from 2 to 5. Looking at the RMSE in the last 10 rounds, it appears to stabilise at the level of around 0.8.

In [20]:
library(xgboost)
dtrain <- xgb.DMatrix(as.matrix(Phitr), label = ytr$y)
for(depth in 2:5){
    cvbst <- xgb.cv(data=dtrain,nrounds=200,nfold=10,max_depth=depth,verbose=FALSE)
    cat("depth ",depth,":\n")
    print(cvbst$evaluation_log[190:200,])
}
depth  2 :
    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
 1:  190       0.3405308    0.005626707      0.8171730    0.07084886
 2:  191       0.3394063    0.005757955      0.8171693    0.07078443
 3:  192       0.3381343    0.005622461      0.8178292    0.07112857
 4:  193       0.3368606    0.005526561      0.8173886    0.07078293
 5:  194       0.3357133    0.005446230      0.8167742    0.07011427
 6:  195       0.3346425    0.005604910      0.8171054    0.07005467
 7:  196       0.3335177    0.005639652      0.8170846    0.06971750
 8:  197       0.3323905    0.005591196      0.8167892    0.06985145
 9:  198       0.3313020    0.005403414      0.8167484    0.06915680
10:  199       0.3302496    0.005314439      0.8167036    0.06905414
11:  200       0.3292970    0.005319198      0.8161652    0.06908361
depth  3 :
    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
 1:  190       0.1323271    0.003695743      0.8122805    0.07720687
 2:  191       0.1312963    0.003726922      0.8126330    0.07731688
 3:  192       0.1300931    0.003608384      0.8127688    0.07723485
 4:  193       0.1292995    0.003435556      0.8126322    0.07698781
 5:  194       0.1284610    0.003363797      0.8121975    0.07661431
 6:  195       0.1275089    0.003595304      0.8120819    0.07660368
 7:  196       0.1264874    0.003593135      0.8120990    0.07693506
 8:  197       0.1256408    0.003563828      0.8121754    0.07681704
 9:  198       0.1248102    0.003535153      0.8119923    0.07682059
10:  199       0.1239083    0.003387762      0.8120850    0.07710302
11:  200       0.1230967    0.003397207      0.8121109    0.07677651
depth  4 :
    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
 1:  190       0.0350295    0.002336821      0.7884588    0.06661102
 2:  191       0.0345539    0.002256576      0.7885534    0.06654254
 3:  192       0.0340868    0.002244931      0.7886516    0.06666072
 4:  193       0.0335425    0.002239418      0.7885448    0.06665725
 5:  194       0.0331140    0.002268100      0.7884728    0.06664551
 6:  195       0.0327095    0.002184309      0.7883999    0.06672309
 7:  196       0.0322429    0.002126396      0.7884258    0.06674836
 8:  197       0.0318014    0.002132264      0.7884654    0.06674739
 9:  198       0.0313515    0.002069476      0.7884212    0.06675904
10:  199       0.0309520    0.002127688      0.7883344    0.06685271
11:  200       0.0306224    0.002177868      0.7883446    0.06685822
depth  5 :
    iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
 1:  190       0.0066952   0.0004329565      0.8066028    0.07565139
 2:  191       0.0065598   0.0004483045      0.8066118    0.07563133
 3:  192       0.0064157   0.0004403633      0.8066043    0.07563331
 4:  193       0.0062649   0.0004342556      0.8066009    0.07557833
 5:  194       0.0061298   0.0004261572      0.8066055    0.07556719
 6:  195       0.0059766   0.0003970356      0.8066217    0.07554167
 7:  196       0.0058587   0.0004411807      0.8066171    0.07554324
 8:  197       0.0057369   0.0004389586      0.8066013    0.07556361
 9:  198       0.0056267   0.0004362777      0.8065979    0.07553457
10:  199       0.0054844   0.0004100703      0.8066031    0.07554467
11:  200       0.0053694   0.0004047864      0.8066358    0.07553597
In [21]:
bst<-xgboost(data=dtrain,nrounds=200,max_depth=4,verbose=FALSE)
bst_ypred<-predict(bst,as.matrix(Phitst))

Now, let us consider random Fourier feature (RFF) expansions. First, we generate random Fourier features for bag attributes and summarise them with the averages of RFFs (approximate kernel embeddings). We then concatenate random Fourier features for bag attributes.

In [22]:
##########################
#random feature embeddings
m1<-100
m2<-50
med.heur = function(X) {
  ##computes the median heuristic for the kernel bandwidth theta of 
  ##the Gaussian RBF kernel exp(-0.5||x-x'||^2/theta^2)
  ##heuristic is: theta=median(||x-x'||)/sqrt(2)
  n <- dim(X)[1]
  if(n>1000){
    Xs <- X[sample(1:n,1000),] 
    heur <- median(dist(Xs))/sqrt(2)
    } else { 
        heur <- median(dist(X))/sqrt(2)
      }
  return(heur)
}
rff1 <- function(X,W1){
  M<-as.matrix(X)
  return(cbind(cos(M%*%W1),sin(M%*%W1)))
}
rff2 <- function(X,W2){
  M<-as.matrix(X)
  return(cbind(cos(M%*%W2),sin(M%*%W2)))
}
#use median heuristic bandwidths
med1<-med.heur(Xdf[,1:7])
med2<-med.heur(Xdf[dlevel_indices,8:12])

#sample frequencies
W1<-matrix(rnorm(7*m1),7,m1)
W2<-matrix(rnorm(5*m2),5,m2)
In [23]:
Phi<-data.frame(matrix(0,n,m1+m2))
for(ii in 1:n){
      Phi1<-colMeans(rff1(Xdf[Xdf$bag_Id==ii,1:7],W1/med1))
      Phi2<-rff2(Xdf[dlevel_indices[ii],8:12],W2/med2)
      Phi[ii,]<-c(Phi1,Phi2)
}
    
Phitr<-Phi[1:ntr,]
Phitst<-Phi[(ntr+1):n,]
In [25]:
rffadditive_cvfit<-cv.glmnet(as.matrix(Phitr),ytr$y,type.measure="mse",nfolds=10,alpha=0, lambda=2^((-30:6)/2))
plot(rffadditive_cvfit)

cat("crossvalidation error (rff additive): ", sqrt(min(rffadditive_cvfit$cvm)))
rffadditive_ypred<-predict(rffadditive_cvfit,as.matrix(Phitst),s="lambda.min")
crossvalidation error (rff additive):  0.7754882

Now, let us try the model that captures the interactions between the instance-level attributes and the bag-level attributes. We form this by computing the outer products between the approximate kernel embeddings computed on the instance attributes and the random Fourier features for the bag attributes.

In [26]:
Phi<-data.frame(matrix(0,n,m1*m2))
for(ii in 1:n){
      Phi1<-colMeans(rff1(Xdf[Xdf$bag_Id==ii,1:7],W1/med1))
      Phi2<-rff2(Xdf[dlevel_indices[ii],8:12],W2/med2)
      Phi[ii,]<-c(Phi1%*%Phi2)
}
    
Phitr<-Phi[1:ntr,]
Phitst<-Phi[(ntr+1):n,]
In [27]:
rffouter_cvfit<-cv.glmnet(as.matrix(Phitr),ytr$y,type.measure="mse",nfolds=10,alpha=0, lambda=2^((-30:6)/2))
plot(rffouter_cvfit)

cat("crossvalidation error (rff outer product): ", sqrt(min(rffouter_cvfit$cvm)))
rffouter_ypred<-predict(rffouter_cvfit,as.matrix(Phitst),s="lambda.min")
crossvalidation error (rff outer product):  0.7316888

OK, we are now happy with our models, so let's compare them on the test set.

In [28]:
rmse<-function(ypred){sqrt(mean((ypred-ytst$y)^2))}
results<-cbind(LMMEANS=rmse(lmmean_ypred),LMMSDQ=rmse(lmmsdq_ypred),
               XGBMSDQ=rmse(bst_ypred),RFFADD=rmse(rffadditive_ypred),
               RFFOUTER=rmse(rffouter_ypred))
results
LMMEANSLMMSDQXGBMSDQRFFADDRFFOUTER
0.831179 0.81151540.82505170.79601150.7867687