Lastfm crawler

13
last.fm crawler RW vs RWRW Mário Almeida [email protected] Zafar Gilani [email protected] Arinto Murdopo [email protected]

description

Big social networks don't often disclose their data to other interested individuals. Their data and users add value to their business. So in order to estimate certain parameters such as the average age or other user characteristics or interests, we need to access the nodes individually. The way this nodes are accessed depends on the network. In the case of LastFM it is easy to pick a node and from there select one of its friends at random. This technique is called random walk and it generally consists of a path generated by randomly selecting nodes in a network. As soon as this LastFM crawler travels through a significative amount of nodes, it is possible to estimate an average of the pretended metrics. The problem with this simple implementation is that, although one might think that picking uniformly at random the nodes from within the friends is enough for them to have the same probability to be accessed, in the global scope, nodes with higher degrees (number of friends) will have higher probabilities to be visited. For example, if we are estimating the number of friends and during the process of picking a friend, the friends with higher degree have higher probabilities of being visited, then the estimated number of friends will be much higher than the real value. To fix this problem we introduced another type of random walks called RWRW, which, in this case, gives different weights to each nodes metric based on their degree while calculating the average. In this way, while estimating the number of friends, although nodes with high values will be visited with higher probability, their impact on the estimated value will be lower. This technique proved to estimate more accurate results, check in the presentation the graphs and values that we obtained. Also, I might add that it is very easy to experiment with LastFM using their API : http://www.last.fm/api Check my blog: www.marioalmeida.eu

Transcript of Lastfm crawler

Page 1: Lastfm crawler

last.fm crawlerRW vs RWRW

Mário Almeida [email protected] Gilani [email protected]

Arinto Murdopo [email protected]

Page 2: Lastfm crawler

Outline● Parameters● Methodology● Results● Challenges● Conclusion

Page 3: Lastfm crawler

Parameters1. Playcounts2. Playlists3. Ages4. IDs5. Number of friends (degrees)

Compare average using RW and RWRW!

Page 4: Lastfm crawler

MethodologyUsed lastfm APIs to obtain

● user info ● number of friends (degree)

RW with UIS-WRWe applied the following RW formula:

Page 5: Lastfm crawler

MethodologyFor RWRW, we apply:

The weight Wv is set to number of friends (degree)

Page 6: Lastfm crawler

ResultsCrawled for ~10 hoursNumber of samples: 48000Number of age samples: 36363, not all users show their age

Page 7: Lastfm crawler

Results - Ages

After about 25k samples, the

age stabilizes.

RW estimates

lower average age

values. There is a big

correlation between age

and the degree

Page 8: Lastfm crawler

Results - Playlists

Most users do not have playlists.

RW estimates higher numbers of playlists. Users with higher degrees tend to

have more playlists.

Page 9: Lastfm crawler

Results - Playcounts

We found some users having playcounts in the order of millions.

RW estimates higher playcounts. Users with higher degree tend to have higher playcounts

Page 10: Lastfm crawler

Results - IDs

RW estimates a lower average ID compared to RWRW. An user with lower ID has generally a higher degree

Not yet stable.

Page 11: Lastfm crawler

Results - Degrees

RWRW reduces the bias of nodes with higher probability to be visited

due to the high degree. This is indeed close to the expected degree

value.

Page 12: Lastfm crawler

Conclusion● A simple random walk in a social network

generally results into biased averages.○ A node with higher degree has a higher probability of

being discovered.● RWRW normalizes the averages.

○ High variations do not abruptly impact the estimation.

○ RWRW reduces the biases of RW.● Low variance means lower difference

between RW and RWRW.● Crawling lastfm produces many challenges

○ e.g.: 0 degree, banned user, huge playcounts

Page 13: Lastfm crawler

QuestionsCheck the code in:● http://code.google.com/p/lastfm-rwrw/