Here is a link that explains the cosine similarity and cosine pairwise distances.
https://stackoverflow.com/questions/35281691/scikit-cosine-similarity-vs-pairwise-distances
So the codes in the first tutorial may be wrong.It misuse distances and similarities.
This is some simple tests:
import numpy as np from sklearn.metrics.pairwise import pairwise_distances from math import * from sklearn.metrics.pairwise import cosine_similarity #construct a matrix mat = np.zeros((5,10)) mat = np.matrix( [[2, 3, 0, 0, 0, 0, 5, 0, 1, 0], [20,30,0, 0, 0, 0, 50,0, 10,0], [1, 7, 0, 0, 0, 0, 2, 0, 8, 0], [2, 3, 0, 0, 0, 0, 0, 0, 1, 0], [4, 6, 0, 0, 7, 0, 0, 0, 2, 0]]) #row is user, col is venue, intersections is checkin frequencies user_dis = pairwise_distances(mat,metric='cosine') user_sim = cosine_similarity(mat) user_dis Out[3]: array([[ 0. , 0. , 0.39561935, 0.40085531, 0.56244658], [ 0. , 0. , 0.39561935, 0.40085531, 0.56244658], [ 0.39561935, 0.39561935, 0. , 0.23729486, 0.44299892], [ 0.40085531, 0.40085531, 0.23729486, 0. , 0.26970326], [ 0.56244658, 0.56244658, 0.44299892, 0.26970326, 0. ]]) user_sim Out[4]: array([[ 1. , 1. , 0.60438065, 0.59914469, 0.43755342], [ 1. , 1. , 0.60438065, 0.59914469, 0.43755342], [ 0.60438065, 0.60438065, 1. , 0.76270514, 0.55700108], [ 0.59914469, 0.59914469, 0.76270514, 1. , 0.73029674], [ 0.43755342, 0.43755342, 0.55700108, 0.73029674, 1. ]])
We can see that the most similar(the same) items' cosine distance is 0 and their similarity is 1.
To be more clear we will use cosine_similaity function in the future.
And from the artificial matrix, we can see that cosine_similarity deals well with some kinds of situations, like usr[0] and usr[1], they two have a very similar taste, except that the frequency of usr[1] is 10 times of usr[0]. And cosine similarity thinks their similarity is one! This is consistent with human recognition.
As for other comparisons of usr[0] and other users similarity:
usr[2]≈usr[3]>usr[4]
usr[2] goes to all the places usr[0] has gone to, the only difference is that they have different frequencies, usr[3] left out on place[6] but usr[3]'s visiting frequency is actually the same as usr[0].
I think it is quite reasonable to get such a result, so using cosine_similarity may reflect the relationship between users very well.