Summary of Results

In evaluating recommender systems, the emphasis has shifted from an ML perspective to an IR perspective. Both Netflix & YouTube have moved away from optimizng for ratings prediction to instead optimizing for top-N rankings predictions and minutes watched after concluding that accurately predicting ratings isn’t as important as providing relevance (ease of discovery & satisfaction). Despite having a good RMSE if the users have to scroll down to search for relevant items than the recommendations aren’t useful.

Thus, for our analysis too, we have chosen the Hit Rate @ Top-10 as the primary evaluation metric. Below results tell us that SVD++ (variant of Singular Value Decomposition) with a Hit Rate of 3.28% is performing the best.

Algorithm  RMSE       MAE        HR         cHR        ARHR       Coverage   Diversity  Novelty   
ContentKNN 1.0460     0.8198     0.0015     0.0015     0.0003     0.9791     0.5843     4228.4251 
User KNN   0.9961     0.7711     0.0000     0.0000     0.0000     1.0000     0.8586     5654.1042 
Item KNN   0.9995     0.7798     0.0000     0.0000     0.0000     0.9896     0.6494     6740.0228 
SVD        0.9039     0.6984     0.0283     0.0283     0.0119     0.9478     0.0428     494.8547  
SVD++      0.8928     0.6865     0.0328     0.0328     0.0143     0.9434     0.0947     666.1976  
Random     1.4428     1.1532     0.0194     0.0194     0.0041     1.0000     0.0676     540.1683  

Legend:

RMSE:      Root Mean Squared Error. Lower values mean better accuracy.
MAE:       Mean Absolute Error. Lower values mean better accuracy.
HR:        Hit Rate; how often we are able to recommend a left-out rating. Higher is better.
cHR:       Cumulative Hit Rate; hit rate, confined to ratings above a certain threshold. Higher is better.
ARHR:      Average Reciprocal Hit Rank - Hit rate that takes the ranking into account. Higher is better.
Coverage:  Ratio of users for whom recommendations above a certain threshold exist. Higher is better.
Diversity: 1-S, where S is the average similarity score between every possible pair of recommendations
           for a given user. Higher means more diverse.
Novelty:   Average popularity rank of recommended items. Higher means more novel.

Surprisingly, user-based CF for Top-10 even outperforms SVD++, proving a simpler robust approach is often better. And models which predict the ratings for all possible user-item pairs and optimize for accuracy may not do the best for Top-N.

Note : RMSE metric is not applicable to user-based CF as the model is strictly for Top-N recommendations.

HR 0.05514157973174367

Evaluation Procedure

Accuracy

RMSE & MAE is evaluated using 75% Train & 25% Test split of the data

Top-N

Top-N (N = 10) ranking metrics are evaluated using Leave-One-Out Cross-Validation with one split (Leave-One-Out Cross-validation iterator where each user has exactly one rating in the testset)

Challenges with Offline Metrics

Accurately predicted offline metrics doesn’t necessary translate into good recommendations due to:

  • selection bias - training/testing the model on items which users chose to rate
  • accuracy doesn’t imply relevance - what matters is user actions like purchases (e-commerce), minutes watched (video streaming) and song-tracks saved (music streaming)

Instead the results of online A/B test of how users respond to the recommendations presented (on meanigful surrogate metrics) should be used as the ultimate evaluation criteria. Though, without access to an online testing platform, researchers and we too will need to rely on the offline metrics.

Updated:

Leave a Comment