FashionAsk: A Multimedia based Question-Answering...

3
FashionAsk: A Multimedia based Question-Answering System Wei Zhang, Lei Pang, Chong-Wah Ngo Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong {wzhang34-c, leipang3-c}@my.cityu.edu.hk, [email protected] ABSTRACT We demonstrate a multimedia-based question-answering sys- tem, named FashionAsk, by allowing users to ask questions referring to pictures snapped by mobile devices. Instead of asking verbose questions to depict visual instances, direct pictures are provided as part of the question. The signifi- cance of our system is that (1) our system is fully automatic such that no human expert is involved; (2) and we bypass the requirement of the name for the object under query, which is mostly unknown to the asker. To answer these multi-modal questions, FashionAsk performs a large-scale instance search and metadata pooling to name the instance under query, and then matches with similar questions from community- contributed QA websites as answers. This demonstration is conducted on a million-scale dataset of Web images and QA pairs in the domain of fashion products. Asking a multime- dia question through FashionAsk can take as short as five seconds to retrieve the candidate answer as well as suggested questions. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models General Terms Algorithms, Performance, Experimentation Keywords Multimedia question answering, instance naming, question matching 1. INTRODUCTION Community-contributed question-answering (cQA) web- sites, such as Yahoo! Answers 1 , has gained popularity in the past few years, and accumulated hundred of millions question-answer (QA) pairs. Most of the QA pairs are text- based, where a user posts question by typing and expects text answers in return by search or by other users. Tradi- tionally, a user can get answers from the community either by posting a question and waiting human answers or by searching past archives. The first way requires long waiting 1 http://answers.yahoo.com/ Copyright is held by the author/owner(s). ACM-HK Student Research and Career Day, 2013 Instance Search Offline Indexing Instance Naming NE Extraction Keyword Extraction Similar Question Search Questions Matching Syntactic Reranking Multimodal Question Answers Does this has anything to do with “Avengers”? + Suggested Answers 1. Hand painted sneakers from Sole Junkie… 2. Inspired by Avengers, Sole Junkie spent hours perfecting his drawings: Thor, Iron ... 3. Q&A Archive Figure 1: Framework of FashionAsk. time, and the latter requires certain knowledge of the object under query (e.g., the name). However, with the popular- ity of mobile devices for snapping pictures, there is a shift recently on how community users post questions. For in- stance, instead of asking a question by providing a long text description for a visual instance, direct visual examples are included as part of the question [9, 3]. For example, ask- ing the question “Who is the designer of this rock-and-roll t-shirt?” by providing a picture is far convenient than de- scribing the T-shirt verbosely. In this demo, a system, named FashionAsk, capable of an- swering multimedia-based questions in the domain of fash- ion is demonstrated. Specifically, a question is answered by first naming the visual instance in the given picture, e.g., “Twisted Sister Rock-and-Roll t-shirt”, through a large-scale search of Web images. The original text question is then augmented with the name entity, and searched against cQA archives. To this end, the system returns potential answers in text, as well as shortlists a few suggested QA pairs. Our current system operates on a dataset with 1.1 millions Web images and 1.5 millions QA pairs. Compared to existing cQA websites, which rely on text for searching similar QA pairs, our system is novel by an- swering multimodal questions with visual content analysis. Compared to recent mobile search engines, such as Google Goggles 2 and SnapTell 3 , the developed system is capable of handling instances of non-rigid motions and non-planar surfaces, such as t-shirts, sneakers, and handbags. Further- 2 http://www.google.com/mobile/goggles/#text 3 http://www.snaptell.com/

Transcript of FashionAsk: A Multimedia based Question-Answering...

FashionAsk: A Multimedia based Question-AnsweringSystem

Wei Zhang, Lei Pang, Chong-Wah NgoDepartment of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

{wzhang34-c, leipang3-c}@my.cityu.edu.hk, [email protected]

ABSTRACTWe demonstrate a multimedia-based question-answering sys-tem, named FashionAsk, by allowing users to ask questionsreferring to pictures snapped by mobile devices. Instead ofasking verbose questions to depict visual instances, directpictures are provided as part of the question. The signifi-cance of our system is that (1) our system is fully automaticsuch that no human expert is involved; (2) and we bypass therequirement of the name for the object under query, which ismostly unknown to the asker. To answer these multi-modalquestions, FashionAsk performs a large-scale instance searchand metadata pooling to name the instance under query,and then matches with similar questions from community-contributed QA websites as answers. This demonstration isconducted on a million-scale dataset of Web images and QApairs in the domain of fashion products. Asking a multime-dia question through FashionAsk can take as short as fiveseconds to retrieve the candidate answer as well as suggestedquestions.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Retrievalmodels

General TermsAlgorithms, Performance, Experimentation

KeywordsMultimedia question answering, instance naming, questionmatching

1. INTRODUCTIONCommunity-contributed question-answering (cQA) web-

sites, such as Yahoo! Answers1, has gained popularity inthe past few years, and accumulated hundred of millionsquestion-answer (QA) pairs. Most of the QA pairs are text-based, where a user posts question by typing and expectstext answers in return by search or by other users. Tradi-tionally, a user can get answers from the community eitherby posting a question and waiting human answers or bysearching past archives. The first way requires long waiting

1http://answers.yahoo.com/

Copyright is held by the author/owner(s).ACM-HK Student Research and Career Day, 2013

Instance

Search

Offline

Indexing

Instance Naming

NE

Extraction

Keyword

Extraction

Similar

Question

Search

Questions Matching

Syntactic

Reranking

Multimodal Question Answers

Does this has anything

to do with “Avengers”?+

Suggested Answers

1. Hand painted sneakers from Sole Junkie…

2. Inspired by Avengers, Sole Junkie spent

hours perfecting his drawings: Thor, Iron ...

3. …

Q&A

Archive

Figure 1: Framework of FashionAsk.

time, and the latter requires certain knowledge of the objectunder query (e.g., the name). However, with the popular-ity of mobile devices for snapping pictures, there is a shiftrecently on how community users post questions. For in-stance, instead of asking a question by providing a long textdescription for a visual instance, direct visual examples areincluded as part of the question [9, 3]. For example, ask-ing the question “Who is the designer of this rock-and-rollt-shirt?” by providing a picture is far convenient than de-scribing the T-shirt verbosely.

In this demo, a system, named FashionAsk, capable of an-swering multimedia-based questions in the domain of fash-ion is demonstrated. Specifically, a question is answeredby first naming the visual instance in the given picture, e.g.,“Twisted Sister Rock-and-Roll t-shirt”, through a large-scalesearch of Web images. The original text question is thenaugmented with the name entity, and searched against cQAarchives. To this end, the system returns potential answersin text, as well as shortlists a few suggested QA pairs. Ourcurrent system operates on a dataset with 1.1 millions Webimages and 1.5 millions QA pairs.

Compared to existing cQA websites, which rely on textfor searching similar QA pairs, our system is novel by an-swering multimodal questions with visual content analysis.Compared to recent mobile search engines, such as GoogleGoggles2 and SnapTell3, the developed system is capableof handling instances of non-rigid motions and non-planarsurfaces, such as t-shirts, sneakers, and handbags. Further-

2http://www.google.com/mobile/goggles/#text3http://www.snaptell.com/

Reference

Dataset

Offline Indexing

Feature

Extraction

Quantization

MIRFLICKRFeature

Extraction Vocab Training …

Feature

Extraction

Hamming

Embedding

Quantization

Hamming

Embedding

Multiple

Assignment BoW Ranking List… …

Spatial

Check via DT

Online Retrieval

Hamming

Training

HE MEDIAN

Figure 2: Our retrieval Model for Instance Search.

more, instead of returning relevant webpages to users, oursystem is more specific by allowing users to expect answersand suggested questions that might directly address the in-formation needed. Note our system is fully automatic, whichdoes not involve human experts’ interaction. Also our sys-tem can handle unknown products by offering facilities infer-ring the name of the product under query, which is extremelyuseful in the scenario of question answering.The rest of the paper is organized as follows. Section 2 de-

tails the method and implementation details for the system.Section 3 demonstrates the system and a brief evaluation,and finally Section 4 concludes this paper.

2. TECHNOLOGYFigure 1 shows the framework of FashionAsk. The system

is composed of two main components: Instance Naming andQA Matching. The former infers the name of a fashion in-stance, while the latter searches for similar QA pairs.

2.1 Instance NamingThe basic idea is to search similar images of the given

instance, and then infer the name by analyzing the meta-data. The main challenges are that fashion instances couldappear in different forms and exist in different backgrounds,as shown in Figure 5(a). Thus, our framework considers aspatial verification of instance by triangulation matching toaddress the challenges, which will be briefly elaborated next.Instance Search Figure 2 depicts the retrieval model usedfor large scale instance search. Our model is based on Bag-of-visual-Words model [7]. First, a hierarchical vocabularytree [4] with one million leaf nodes is trained on MIR-Flickrdataset [1]. To enable robust search, each word is indexedwith its spatial location and a 32-bit Hamming signature[2]. To support efficient large-scale search, the index filesare distributed to several machines. During online search, aquery is matched against images in the dataset by traversingthe inverted file with Hamming signature verification. Foreach matched image, triangulation meshes are further con-structed (as Figure 3) to characterize the spatial consistencyusing DT [10]. This technique is different from traditionaltechniques such as WGC [2] and RANSAC [6], which as-sume a strict linear transformation model. In contrast, thetriangulation-based matching does not assume such mod-els, but locally measures the relative positioning of visualwords to derive coherency scores for image ranking. Thematching is found to be more effective than other methods

Figure 3: Examples of topology based matching. Foreach example, matching points are plotted on a pairof images, and the corresponding mesh graphs areshown by their sides.

What does Kony poster mean?

Root

SBARQ

WHNPSQ

WP VBZ NP VP

VBNN NN

.

?

what does

kony poster mean

Root

SBARQ

WHNP SQ

WP

VBZ ADJP

VP

JJR

ADJP PP

what

is

better than

kony

IN NP

JJ NNS

2012

.

?

What is better than kony 2012?

Root

SBARQ

WHNPSQ

WP VBZ NP VP

VBNN CD

.

?

what does

kony 2012 mean

What does kony 2012 mean?

NP

NN NN

kony poster

Figure 4: Illustration of syntactic tree matching.The same syntactic parts between left and middleare highlighted with blue, green and orange bound-aries, and the only difference between left and rightis marked with red box.

in handling objects with non-rigid shapes and non-planarsurfaces. For example, despite the flapping wings (Figure 3(left)) violates any linear transformation, corresponding fea-tures on each local wing are still consistently depicted by thegraph. Figure 3 middle and right present two additional ex-amples for both relevant and irrelevant images. We can seethe graph is insensitive to scale/rotation and small varia-tions that remains stable topology, while being sensitive toirrelevant images with random matching locations.Name Entity Extraction With the list of retrieved im-ages, name entity extraction (NE) is processed by parsingthe metadata (title and description) of each image with theBerkeley Parser [5]. The noun phrases are then extracted di-rectly as candidate names. The likelihood of a phrase beingthe name of the instance is measured by its phrase frequencyweighted by the coherency scores of instance search.

2.2 Questions MatchingAs questions in cQA websites are asked in natural lan-

guage, similar questions are varied in terms of lexical, syn-tactic, and semantic features. Hence, retrieving similar ques-tions is not trivial, and requires natural language processing.For the consideration of speed, a two-stage question match-ing method is developed. In the first stage, the keywordsextracted from user’s question together with top name can-didates are used to retrieve a small set of similar questionsfrom a QA dataset with the BM25 retrieval model, whichranks documents based on a probabilistic model. In thesecond stage, the original question is augmented by replac-ing the pronoun, such as “this”, with candidate names, and

(a) (b)

Figure 5: Interface for FashionAsk: (a) showing ex-amples of fashion pictures in our dataset; (b) askinga question by picking or uploading a picture, andfollowed by the answer and suggested questions re-turned by our system.

matched against the set of retrieved questions. The match-ing technique is implemented based on the syntactic treematching algorithm [8], which further considers grammati-cal structures of questions. Figure 4 shows an example onthis matching paradigm. In general, sentences are parsed tosyntactic elements for matching with grammar. All threesentences in Figure 4 may be looked similar in terms oftheir elementary words and phrases. However, the grammarstructure can be very different between the left and middlesentences. For example, there are only three syntactic partsmatched, which are marked with green, orange and blue.While for the case between left and right sentences, onlyone part is different (marked in red) and any other syntacticbranches are matched. As a result, the questions with bothsimilar wording and grammar structure are selected as ourquestion candidate.

3. SYSTEM AND USER INTERFACEInterface: The system can be accessed through a Webbrowser operated either on a PC client or a mobile phone.As the interface shown in Figure 5(a), a user can eitherupload a picture, or pick an image randomly listed by oursystem, and then ask a fashion related question. Users canalso crop the region-of-interest from an image when issuingquestions. The returned page will show the name of the vi-sual instance, the most likely answer, and a few suggestedquestions retrieved from our database, as illustrated in Fig-ure 5(b).System Setup: The system currently consists of 1.1 mil-lions Web images, crawled from Flickr by querying severalfashion related keywords (e.g., t-shirt, sneaker, and hand-bags) and 142 popular tags. Our dataset also has 1.5 mil-lions QA pairs crawled from Yahoo! Answers, one quarterof which are related to fashion, and the others are randomlycrawled as distracters. To enable fast response to user’squery, we use several machines serving the million scale in-stance search subsystem. For QA matching part, we imple-ment the text processing with the help of Lucene.Performance: Based on current setup, the performanceof our system is as follows when testing on 180 multimediaquestions involving 50 fashion instances: the mean AveragePrecision of instance search is 18.8%; the accuracy of findingthe right instance name as the top-1 candidate is 21.7%; andthe accuracy of returning a right answer or a related questionin top-10 list is 19.8%. The system is currently run on two

machines with 2.67GHz CPU and 16GB main memory.Speed Efficiency: By current setup, instance naming con-sumes 1.7 seconds (including feature extraction, quantiza-tion, and search), and question matching takes about 2.9seconds to rank similar questions. In total, the user needsto wait approximate five seconds to get the result. Thistime could be further reduced to sub-second, if more thanten machines are employed to parallelize instances searchingand questions matching.

Interested readers can access our online video demostra-tion4 for more details.

4. CONCLUSIONSThis demo presents a first step towards large scale mul-

timodal question answering system, which enables questionasking on instances even the names are unknown. We showthat with the help of instance search, questions can be askedin an much easier way such that hands can be freed fromheavy typing. We also show that if only a narrow domain(such as fashion) is targeted, the performance of multimodalquestion answering can be quite reasonable.

5. REFERENCES[1] M. J. Huiskes and M. S. Lew. The MIR flickr retrieval

evaluation. In ACM MIR, 2008.

[2] H. Jegou, M. Douze, and C. Schmid. Hammingembedding and weak geometric consistency for largescale image search. In ECCV, 2008.

[3] L. Nie, M. Wang, Z. Zha, G. Li, and T.-S. Chua.Multimedia answering: enriching text QA with mediainformation. In ACM SIGIR, pages 695–704, 2011.

[4] D. Nister and H. Stewenius. Scalable recognition witha vocabulary tree. In CVPR, pages 2161–2168, 2006.

[5] S. Petrov, L. Barrett, R. Thibaux, and D. Klein.Learning accurate, compact, and interpretable treeannotation. In International Conference onComputational Linguistics, pages 433–440, 2006.

[6] J. Philbin, O. Chum, M. Isard, J. Sivic, andA. Zisserman. Object retrieval with large vocabulariesand fast spatial matching. In CVPR, 2007.

[7] J. Sivic and A. Zisserman. Video Google: A textretrieval approach to object matching in videos. InICCV, volume 2, pages 1470–1477, Oct. 2003.

[8] K. Wang, Z. Ming, and T.-S. Chua. A syntactic treematching approach to finding similar questions incommunity-based qa services. In SIGIR, 2009.

[9] T. Yeh, J. J. Lee, and T. Darrell. Photo-basedquestion answering. In ACM MM, 2008.

[10] W. Zhang and C.-W. Ngo. Searching visual instanceswith topology checking and context modeling. InInternational conference on multimedia retrieval,pages 57–64, 2013.

4http://www.youtube.com/watch?v=6SF2LPU4Cds