20140702 xu jiaming hashinglearning - lite

1

Learning to Hash for Large-Scale Search

Xu Jiaming

Chinese Academe of Science

2014-07-04 @CUHK

2

Motivation

Similarity based search has been popular in many applications

– Image/video search and retrieval: finding most similar images/videos

– Audio search: find similar songs

– Product search: find shoes with similar style but different color

– Patient search: find patients with similar diagnostic status

Two key components:

– Similarity/distance measure

– Indexing scheme

Whittlesearch (Kovashka et al. 2013)

- 2013CIKM Tutorial by Jun Wang

3

A Conceptual Diagram for Hashing Based Image Search System

Indexing and Search

Image Database

Similarity Search & Retrieval

Hash Function Design

Visual Search ApplicationsVisual Search Applications

Reranking Refinement

Designing compact yet accurate hashing codes is a critical component to make the search effective


4

Outline

Background (data-independent) Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]

SimHash [2002-STOC, 2007-WWW]

Learning to Hashing (data-dependent) Unsupervised V.S. Supervised

STH [2010-SIGIR] V.S. SHK [2012-CVPR]

One-Step V.S. Two-Step

ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]

Others (data-dependent) Smart Hashing Update for Fast Response [2013-IJCAI]

Two-Stage Hashing [2014-ACL]

Semantic Hashing with Topics and Tags [2013-SIGIR]

Dual-View Hashing [2013-ICML]

Multiple View Hashing [2011-SIGIR]

LSH in MapReduce

5

Outline












LSH in MapReduce

6

LSH [1999-VLDB, 2006-FOCS, 2008-Communications]

0

1Database Items

hash function

random

101 Query

Locality Sensitive Hashing (LSH)


0

1 0

1

7


Text ……

Observed Features

W1

W2

Wn

100110 W1

110000 W2

001001 Wn……

W1 –W1 -W1 W1 W1 -W1

W2 W2 -W2 -W2 -W2 -W2

-Wn –Wn Wn –Wn –Wn Wn

……

13, 108, -22, -5, -32, 551, 1, 0, 0, 0, 1

Step1: Compute TF-IDF

Step2: Hash Function

Step3: Signature

Step4: Sum

Step5: Generate Fingerprint

8

Outline












LSH in MapReduce

9

STH [2010-SIGIR]

2min :

. . : { 1,1}

0

1

ij i jij

ki

ii

Ti i

i

S y y

s t y

y

y yn

I

min : ( ( ) )

. . : ( , ) { 1,1}

0

T

k

T

T

trace Y D W Y

s t Y i j

Y 1

Y Y I

Laplacian Eigenmap

Self Taught Hashing (STH)

Unsupervised Learning

Supervised Learning

10

SHK [2012-CVPR]

Pairwise similarity

Code inner product approximates pairwise similarity

Supervised Hashing with Kernels


11

Outline












LSH in MapReduce

12

ITQ [2011-CVPR, 2013-TPAMI]

Iterative Quantization Apply PCA for dimensionality reduction, find to maximize:

Keep top c eigenvectors of the data covariance matrix to

obtain , projected data is Note that if is an optimal solution then is also optimal for

any orthogonal matrix Key idea: Find to minimize the quantization loss:

nc and V are fixed so this is equivalent to maximizing ( ) :

13

TSH [2013-ICCV]

Two-Step Hashing

14

Outline












LSH in MapReduce

15

SHU [2013-IJCAI]

Smart Hashing Update

1. Consistency-based Selection;

2. Similarity-based Selection.

( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j

2

{ 1,1}

1min

l rl

Tl l

HF

Q H H Sr

2

1 1{1,2,...,r}min k k T

k r r FkR rS H H

16

TSH [2014-ACL]

Two-Stage Hashing

LSH for neighbor candidate pruning; ITQ for

effective re-ranking. LSH captures term similarity; ITQ captures

topic similarity Advantages: High hash lookup success rate is attained by the LSH stage; High search precision due to the ITQ re-ranking stage; Scan only a small portion of an entire dataset Integrate two similarity measures

17

SHTTM [2013-SIGIR]

Semantic Hashing Using Tags and Topic Modeling

Hash Code Learning Hash Function Learning

2 2*

1

* 1

( )

arg min

( )

j j j

n

j jj

T T

y f x x

y x

W

W

W W W

W Y X X X I

Tag Consistency

12

2 2 2min ( )

. . { 1,1} , 0

T

F

k n

C

s t

Y,U

T U Y U Y θ

Y Y1

Similarity Preservation

18

DVH [2013-ICML]

Predictable Dual-View HashingThe goal is to find two sets of hyperplanes that map the visual and textual space into a common subspace.

CCA

Multi-SVM

19

MVH [2011-SIGIR]

Composite Hashing with Multiple Information Sources

2

2( ) ( ) ( ) ( )1 2

1 1 1

( , , ) ( ) ( , )

( )

S C

M M MTT k k k kk

k k k

J J J

C tr C

Y W α Y Y W

Y L Y Y W X W

Overall Objection

20

Outline












LSH in MapReduce

21

LSH in MapReduce – Key Idea

22

LSH in MapReduce – First Round of MapReduce

23

LSH in MapReduce – Second Round of MapReduce

24

Reference

[1]. Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing[C]//VLDB. 1999, 99: 518-529.

[2]. Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions[C]//Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. IEEE, 2006: 459-468.

[3]. Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions[J]. COMMUNICATIONS OF THE ACM, 2008, 51(1): 117.

[4]. Charikar M S. Similarity estimation techniques from rounding algorithms[C]//Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 2002: 380-388.

[5]. Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web crawling[C]//Proceedings of the 16th international conference on World Wide Web. ACM, 2007: 141-150.

[6]. Zhang D, Wang J, Cai D, et al. Self-taught hashing for fast similarity search[C]//Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010: 18-25.

[7]. Liu W, Wang J, Ji R, et al. Supervised hashing with kernels[C]//Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 2074-2081.

25

Reference

[8]. Gong Y, Lazebnik S. Iterative quantization: A procrustean approach to learning binary codes[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011: 817-824.

[9]. Gong Y, Lazebnik S, Gordo A, et al. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2013, 35(12): 2916-2929.

[10]. Lin G, Shen C, Suter D, et al. A general two-step approach to learning-based hashing[C]//Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013: 2552-2559.

[11]. Yang Q, Huang L K, Zheng W S, et al. Smart hashing update for fast response[C]//Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press, 2013: 1855-1861.

[12]. Li H, Liu W, Ji H. Two-Stage Hashing for Fast Document Retrieval[C]. ACL. 2014

[13]. Wang Q, Zhang D, Si L. Semantic hashing using tags and topic modeling[C]//Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013: 213-222.

[14]. Rastegari M, Choi J, Fakhraei S, et al. Predictable Dual-View Hashing[C]//Proceedings of The 30th International Conference on Machine Learning. 2013: 1328-1336.

26

Reference

[15]. Zhang D, Wang F, Si L. Composite hashing with multiple information sources[C]//Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 2011: 225-234.

[16]. Szmit, Radosław. "Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data." Language Processing and Intelligent Information Systems. Springer Berlin Heidelberg, 2013. 171-178.

[17]. Blog: Location Sensitive Hashing in Map Reduce: http://horicky.blogspot.hk/2012/09/location-sensitive-hashing-in-map-reduce.html

[18]. Likelike Project: https://github.com/takahi-i/likelike

[19]. Jun Wang. Learning to Hash for Large-Scale Search. 2013 CIKM Tutorial.

27

Discussions and Questions?

Thank you!2014-07-04

20140702 xu jiaming hashinglearning - lite

Data & Analytics

Transcript of 20140702 xu jiaming hashinglearning - lite