Download - Efficient and Scalable Archive Search Avishek Anand

Transcript
Page 1: Efficient and Scalable Archive Search Avishek Anand

Efficient and Scalable Archive SearchAvishek Anand

IS : Idealized ShardingCA : Cost Aware Sharding

time

Doc 1Doc 2

Doc 3Doc 4

Doc 5Doc 6

Doc 7

Doc 1Doc 2Doc 7

Doc 3Doc 4Doc 5Doc 6

Web archives span over a long

time

Challenge

Support search with temporal

constraints

Searching Archives

Web archives continuously

grow over time

Challenge

Scale search to growing archives

ScalingArchive Search

Index Sharding

[1] Index Sharding for Space-Time Efficiency in Archive Search : Avishek Anand, Srikanta Bedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2011. [2] Index Maintenance for Time-Travel Text Search : Avishek Anand, Srikanta Bedathur, Klaus Berberich, Ralf Schenkel. In SIGIR, 2012.[3] A Time Machine for Text Search : Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum. SIGIR 2007, July 2007.

2007 2008 2009 2010 2011 2012 2013

Doc 1Doc 2Doc 3Doc 4Doc 5Doc 6Doc 7

Index-list

Shard 1

Shard 2

Need to design index structures which efficiently process time-travel queries and can be easily maintained.

obama @ [6/2009 – 6/2011]

Idealized Sharding: Eliminates access to postings with no intersection with query-time interval.

Cost Aware Shard Merging: Merge idealized shards by reconciling random and sequential access costs.

Index Sharding: • Partitions each index-list disjointly. • No index blow-up.

Index Maintenance

References

Experiments

Active Index

Archive Index

In-memory Archive Index

External-memory Archive Index

Crawls

Doc 4: version 2

Doc 3: version 2

Doc 2: version 9

Doc 1: version 1

Doc 4: version 3

Sent to Archive Indexing System In the live index

now

Insertedincoming version

Appended popped posting

Shard buffers Archive Index Shards

System Architecture : Separate indexes for active and retired versions.

Incremental Sharding: • Online algorithm with approximation guarantee.

• Append-only operation on shards.• Retains query performance.

End-time arrival order: Versions finalized in their end-time-order.

query time-interval

SB : Vertical Partitioning with trade-off between performance and index size [3]

Approach

Avoid accessing postings

which do not overlap with query time-

interval.

Approach

Avoid re-computation of

the index by creating shards incrementally.

Wallclock-times comparison with SB Index-size comparison Index maintenance efficiencyPerformance of incremental sharding

INC : Incremental Sharding