SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N....
-
Upload
ashley-maynard -
Category
Documents
-
view
212 -
download
0
Transcript of SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N....
SNFS: The design and implementation
of a Social Network File System
Ch. Kaidos, A. Pasiopoulos N. Ntarmos,
P. Triantafillou
University of Patras
Shameless plug..
If interested, please check out
eXO: Decentralized Autonomous Scalable Social Networking,
5th Conference on Innovative Data Systems Research (CIDR2011), 2011.
Social Networks
Social Networks
Our Take:
1.Search for•People (friends, experts, …)•Content (books, photos, videos, blogs, websites, …)
2.Form entities (collections)•Friends-lists, content-libs
3.Search for•entities•Using previously-formed collections…
4.SNFS currently provides the foundation for these…
Tagging
Tag 1Tag 2Tag 3Tag 4Tag 5
Profiles: sets of tags describing
entities.
“Search for”: •based on profiles.•Ranked retrieval (top-k)
Current State
5,000,000,000 photos3,000 photos/min (as of September 2010)
2,000,000,000 videos served up each day(May 2010)
600,000,000 monthly active users (January 2011)
15,000,000 books (October 2010)130,000,000 by the end of the decade
Current State
Need to access published content
22,750,000,000 queries in search engines
4,000,000,000 queries in YouTube
351,000,000 queries in Facebook
416,000,000 queries in MySpace(U.S. market figures, December 2009)
?
Current State
How do I findstuff I want?
How do I provideintresting objects
to my users?
Proposal
A content-awarefile system
for Social NetworkSystems
Usefull to users... ... And service providers too!
Previous Work on File Indexing
1991 – Semantic File Systems by Gifford
1996 – BeFS by Giampaolo and Meurillon, part of the BeOS
BeOS never had commercial success...
1998 – Indexing Service on Windows NT, not needed at the timeRemnant of the Object File System from the unmaterialized Cairo project
Typically• no ranked retrieval• No users’ input (tags)• No user relationships
Desktop Searches2004 – Windows Desktop Search, widely popular
2005... – Mac OS X's Spotlight, Google Desktop, Beagle, Strigi, Tracker...
Typically• no ranked retrieval ?• No user relationships• no exploits from relations for searching
Problems
Power tools for power users... But for average users...
Boolean operators???SQL like queries???
Previous Work on Ranked Retrieval
1968 – SMART system by Salton, introduced weights in retrieval, instead of classical Boolean retrieval
1975 – Vectors and cosine similarity by Salton
1988 – Other functions for similarity tested and evaluated by Salton and Buckley
2003 – Fagin proposes and compares several efficient algorithms for top-k retrieval
Design
Design – SNFS
Tags are extracted from object, stemmed and frequency is counted
Weights for each tag and document are calculated
Each object is associated with a unique id in a Tree
A tf-idf weighting scheme was chosen
Design – SNFS
Term Weight and Object ID are stored in an inverted index
Each posting list of the index is a B+Tree stored in secondary memory
The position of the root of the B+Tree in the index is stored in a Red Black Tree
Design – Search and retrieval
The query is split in terms and stemmed
The score of each document is calculated using a threshold algorithm and a tf-idf function
Threshold AlgorithmsInput: Posting lists sorted on weight (decreasing)
t1
t3
t2
depth 1
d1
d3
d2
NRA (No Random Access) Algorithm
d4
d5
d2
2
Doc ID ScoreDoc ID
d1 s1
d2 s2
d3 s3
d4
d5 s5
s4
+s6
d4
d3
d2
3
+s7
+s8
+s9
Threshold s1+s2+s3
s4+s5+s6
s7+s8+s9
When no score bellow the top-k objects can be improved to exceed the threshold the algorithm halts
Threshold AlgorithmsInput: Posting lists sorted on weight (decreasing)
TA (Threshold Algorithm with random accesses)
t1
t3
t2
1
d1
d3
d2
d4
d5
d2
2
d4
d3
d2
3
Threshold s1+s2+s3
s4+s5+s6
s7+s8+s9
Doc ID ScoreDoc ID
d1 s1
d2 s2
d3 s3
d4
d5 s5
s4
+s6 +s7
+s8
+s9
depth
d5
+s10
When score of the last object is bellow threshold the algorithm halts
Qualitative Comparison
NRA TA
Disk Accesses
State Keepingand computation
System Calls
We expect TA to perform many more slow disk accessesCan NRA's large state keeping keeping and computation need overcome TA's disk accesses?
We implement both, on hard disk and on RAM-disk to find out...
Implementation with FUSE
Testing
- 4 real world test sets
- files containing tags from online objects
- index is normally on secondary memory
- ram-disk used to evaluate the effect of disk accesses
Results demanded vs TimeDisk based index
NRA
TA
Results demanded vs TimeRAM based index
NRA
TA
Query Terms vs TimeDisk based index
NRA
TA
Query Terms vs TimeRAM based index
NRA
TA
Beagle vs NRA
Terms vs time
Results vs time
Conclusions
SNFS:
- Indexing, storage, and ranked retrieval of entities in a SN.
- Study of efficiency of algorithms and implementations, using real-world data, and various implementations.
- Competitive performance, (eg against Beagle).
- Many ways of further expansion
Future Work - Expansion for distributed systems and clouds
- Distributed file systems (HDFS)
- Distributed data structures
- Tagging, Indexing, and searching for entity-collections – straightforward, as our ‘object’ implementation/abstraction captures this.
- Establishing entities consisting of relationships between entities, using advanced-tagging, and searching for these…