Summary of Results
In evaluating recommender systems, the emphasis has shifted from an ML perspective to an IR perspective. Both Netflix & YouTube have moved away from optimizng for ratings prediction to instead optimizing for top-N rankings predictions and minutes watched after concluding that accurately predicting ratings isn’t as important as providing relevance (ease of discovery & satisfaction). Despite having a good RMSE if the users have to scroll down to search for relevant items than the recommendations aren’t useful.
Thus, for our analysis too, we have chosen the Hit Rate @ Top-10 as the primary evaluation metric. Below results tell us that SVD++ (variant of Singular Value Decomposition) with a Hit Rate of 3.28% is performing the best.
Algorithm RMSE MAE HR cHR ARHR Coverage Diversity Novelty
ContentKNN 1.0460 0.8198 0.0015 0.0015 0.0003 0.9791 0.5843 4228.4251
User KNN 0.9961 0.7711 0.0000 0.0000 0.0000 1.0000 0.8586 5654.1042
Item KNN 0.9995 0.7798 0.0000 0.0000 0.0000 0.9896 0.6494 6740.0228
SVD 0.9039 0.6984 0.0283 0.0283 0.0119 0.9478 0.0428 494.8547
SVD++ 0.8928 0.6865 0.0328 0.0328 0.0143 0.9434 0.0947 666.1976
Random 1.4428 1.1532 0.0194 0.0194 0.0041 1.0000 0.0676 540.1683
Legend:
RMSE: Root Mean Squared Error. Lower values mean better accuracy.
MAE: Mean Absolute Error. Lower values mean better accuracy.
HR: Hit Rate; how often we are able to recommend a left-out rating. Higher is better.
cHR: Cumulative Hit Rate; hit rate, confined to ratings above a certain threshold. Higher is better.
ARHR: Average Reciprocal Hit Rank - Hit rate that takes the ranking into account. Higher is better.
Coverage: Ratio of users for whom recommendations above a certain threshold exist. Higher is better.
Diversity: 1-S, where S is the average similarity score between every possible pair of recommendations
for a given user. Higher means more diverse.
Novelty: Average popularity rank of recommended items. Higher means more novel.
Surprisingly, user-based CF for Top-10 even outperforms SVD++, proving a simpler robust approach is often better. And models which predict the ratings for all possible user-item pairs and optimize for accuracy may not do the best for Top-N.
Note : RMSE metric is not applicable to user-based CF as the model is strictly for Top-N recommendations.
HR 0.05514157973174367
Evaluation Procedure
Accuracy
RMSE & MAE is evaluated using 75% Train & 25% Test split of the data
Top-N
Top-N (N = 10) ranking metrics are evaluated using Leave-One-Out Cross-Validation with one split (Leave-One-Out Cross-validation iterator where each user has exactly one rating in the testset)
Challenges with Offline Metrics
Accurately predicted offline metrics doesn’t necessary translate into good recommendations due to:
- selection bias - training/testing the model on items which users chose to rate
- accuracy doesn’t imply relevance - what matters is user actions like purchases (e-commerce), minutes watched (video streaming) and song-tracks saved (music streaming)
Instead the results of online A/B test of how users respond to the recommendations presented (on meanigful surrogate metrics) should be used as the ultimate evaluation criteria. Though, without access to an online testing platform, researchers and we too will need to rely on the offline metrics.
Leave a Comment