SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N....

SNFS: The design and implementation

of a Social Network File System

Ch. Kaidos, A. Pasiopoulos N. Ntarmos,

P. Triantafillou

University of Patras

Shameless plug..

If interested, please check out

eXO: Decentralized Autonomous Scalable Social Networking,

5th Conference on Innovative Data Systems Research (CIDR2011), 2011.

Social Networks

Social Networks

Our Take:

1.Search for•People (friends, experts, …)•Content (books, photos, videos, blogs, websites, …)

2.Form entities (collections)•Friends-lists, content-libs

3.Search for•entities•Using previously-formed collections…

4.SNFS currently provides the foundation for these…

Tagging

Tag 1Tag 2Tag 3Tag 4Tag 5

Profiles: sets of tags describing

entities.

“Search for”: •based on profiles.•Ranked retrieval (top-k)

Current State

5,000,000,000 photos3,000 photos/min (as of September 2010)

2,000,000,000 videos served up each day(May 2010)

600,000,000 monthly active users (January 2011)

15,000,000 books (October 2010)130,000,000 by the end of the decade

Current State

Need to access published content

22,750,000,000 queries in search engines

4,000,000,000 queries in YouTube

351,000,000 queries in Facebook

416,000,000 queries in MySpace(U.S. market figures, December 2009)

?

Current State

How do I findstuff I want?

How do I provideintresting objects

to my users?

Proposal

A content-awarefile system

for Social NetworkSystems

Usefull to users... ... And service providers too!

Previous Work on File Indexing

1991 – Semantic File Systems by Gifford

1996 – BeFS by Giampaolo and Meurillon, part of the BeOS

BeOS never had commercial success...

1998 – Indexing Service on Windows NT, not needed at the timeRemnant of the Object File System from the unmaterialized Cairo project

Typically• no ranked retrieval• No users’ input (tags)• No user relationships

Desktop Searches2004 – Windows Desktop Search, widely popular

2005... – Mac OS X's Spotlight, Google Desktop, Beagle, Strigi, Tracker...

Typically• no ranked retrieval ?• No user relationships• no exploits from relations for searching

Problems

Power tools for power users... But for average users...

Boolean operators???SQL like queries???

Previous Work on Ranked Retrieval

1968 – SMART system by Salton, introduced weights in retrieval, instead of classical Boolean retrieval

1975 – Vectors and cosine similarity by Salton

1988 – Other functions for similarity tested and evaluated by Salton and Buckley

2003 – Fagin proposes and compares several efficient algorithms for top-k retrieval

Design

Design – SNFS

Tags are extracted from object, stemmed and frequency is counted

Weights for each tag and document are calculated

Each object is associated with a unique id in a Tree

A tf-idf weighting scheme was chosen

Design – SNFS

Term Weight and Object ID are stored in an inverted index

Each posting list of the index is a B+Tree stored in secondary memory

The position of the root of the B+Tree in the index is stored in a Red Black Tree

Design – Search and retrieval

The query is split in terms and stemmed

The score of each document is calculated using a threshold algorithm and a tf-idf function

Threshold AlgorithmsInput: Posting lists sorted on weight (decreasing)

t1

t3

t2

depth 1

d1

d3

d2

NRA (No Random Access) Algorithm

d4

d5

d2

2

Doc ID ScoreDoc ID

d1 s1

d2 s2

d3 s3

d4

d5 s5

s4

+s6

d4

d3

d2

3

+s7

+s8

+s9

Threshold s1+s2+s3

s4+s5+s6

s7+s8+s9

When no score bellow the top-k objects can be improved to exceed the threshold the algorithm halts

Threshold AlgorithmsInput: Posting lists sorted on weight (decreasing)

TA (Threshold Algorithm with random accesses)

t1

t3

t2

1

d1

d3

d2

d4

d5

d2

2

d4

d3

d2

3

Threshold s1+s2+s3

s4+s5+s6

s7+s8+s9

Doc ID ScoreDoc ID

d1 s1

d2 s2

d3 s3

d4

d5 s5

s4

+s6 +s7

+s8

+s9

depth

d5

+s10

When score of the last object is bellow threshold the algorithm halts

Qualitative Comparison

NRA TA

Disk Accesses

State Keepingand computation

System Calls

We expect TA to perform many more slow disk accessesCan NRA's large state keeping keeping and computation need overcome TA's disk accesses?

We implement both, on hard disk and on RAM-disk to find out...

Implementation with FUSE

Testing

- 4 real world test sets

- files containing tags from online objects

- index is normally on secondary memory

- ram-disk used to evaluate the effect of disk accesses

Results demanded vs TimeDisk based index

NRA

TA

Results demanded vs TimeRAM based index

NRA

TA

Query Terms vs TimeDisk based index

NRA

TA

Query Terms vs TimeRAM based index

NRA

TA

Beagle vs NRA

Terms vs time

Results vs time

Conclusions

SNFS:

- Indexing, storage, and ranked retrieval of entities in a SN.

- Study of efficiency of algorithms and implementations, using real-world data, and various implementations.

- Competitive performance, (eg against Beagle).

- Many ways of further expansion

Future Work - Expansion for distributed systems and clouds

- Distributed file systems (HDFS)

- Distributed data structures

- Tagging, Indexing, and searching for entity-collections – straightforward, as our ‘object’ implementation/abstraction captures this.

- Establishing entities consisting of relationships between entities, using advanced-tagging, and searching for these…

SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N....

Documents

Transcript of SNFS: The design and implementation of a Social Network File System Ch. Kaidos, A. Pasiopoulos N....