FashionAsk: A Multimedia based Question-Answering...
-
Upload
truongxuyen -
Category
Documents
-
view
216 -
download
0
Transcript of FashionAsk: A Multimedia based Question-Answering...
FashionAsk: A Multimedia based Question-AnsweringSystem
Wei Zhang, Lei Pang, Chong-Wah NgoDepartment of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
{wzhang34-c, leipang3-c}@my.cityu.edu.hk, [email protected]
ABSTRACTWe demonstrate a multimedia-based question-answering sys-tem, named FashionAsk, by allowing users to ask questionsreferring to pictures snapped by mobile devices. Instead ofasking verbose questions to depict visual instances, directpictures are provided as part of the question. The signifi-cance of our system is that (1) our system is fully automaticsuch that no human expert is involved; (2) and we bypass therequirement of the name for the object under query, which ismostly unknown to the asker. To answer these multi-modalquestions, FashionAsk performs a large-scale instance searchand metadata pooling to name the instance under query,and then matches with similar questions from community-contributed QA websites as answers. This demonstration isconducted on a million-scale dataset of Web images and QApairs in the domain of fashion products. Asking a multime-dia question through FashionAsk can take as short as fiveseconds to retrieve the candidate answer as well as suggestedquestions.
Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Retrievalmodels
General TermsAlgorithms, Performance, Experimentation
KeywordsMultimedia question answering, instance naming, questionmatching
1. INTRODUCTIONCommunity-contributed question-answering (cQA) web-
sites, such as Yahoo! Answers1, has gained popularity inthe past few years, and accumulated hundred of millionsquestion-answer (QA) pairs. Most of the QA pairs are text-based, where a user posts question by typing and expectstext answers in return by search or by other users. Tradi-tionally, a user can get answers from the community eitherby posting a question and waiting human answers or bysearching past archives. The first way requires long waiting
1http://answers.yahoo.com/
Copyright is held by the author/owner(s).ACM-HK Student Research and Career Day, 2013
Instance
Search
Offline
Indexing
Instance Naming
NE
Extraction
Keyword
Extraction
Similar
Question
Search
Questions Matching
Syntactic
Reranking
Multimodal Question Answers
Does this has anything
to do with “Avengers”?+
Suggested Answers
1. Hand painted sneakers from Sole Junkie…
2. Inspired by Avengers, Sole Junkie spent
hours perfecting his drawings: Thor, Iron ...
3. …
Q&A
Archive
Figure 1: Framework of FashionAsk.
time, and the latter requires certain knowledge of the objectunder query (e.g., the name). However, with the popular-ity of mobile devices for snapping pictures, there is a shiftrecently on how community users post questions. For in-stance, instead of asking a question by providing a long textdescription for a visual instance, direct visual examples areincluded as part of the question [9, 3]. For example, ask-ing the question “Who is the designer of this rock-and-rollt-shirt?” by providing a picture is far convenient than de-scribing the T-shirt verbosely.
In this demo, a system, named FashionAsk, capable of an-swering multimedia-based questions in the domain of fash-ion is demonstrated. Specifically, a question is answeredby first naming the visual instance in the given picture, e.g.,“Twisted Sister Rock-and-Roll t-shirt”, through a large-scalesearch of Web images. The original text question is thenaugmented with the name entity, and searched against cQAarchives. To this end, the system returns potential answersin text, as well as shortlists a few suggested QA pairs. Ourcurrent system operates on a dataset with 1.1 millions Webimages and 1.5 millions QA pairs.
Compared to existing cQA websites, which rely on textfor searching similar QA pairs, our system is novel by an-swering multimodal questions with visual content analysis.Compared to recent mobile search engines, such as GoogleGoggles2 and SnapTell3, the developed system is capableof handling instances of non-rigid motions and non-planarsurfaces, such as t-shirts, sneakers, and handbags. Further-
2http://www.google.com/mobile/goggles/#text3http://www.snaptell.com/
Reference
Dataset
Offline Indexing
Feature
Extraction
Quantization
MIRFLICKRFeature
Extraction Vocab Training …
…
…
…
Feature
Extraction
Hamming
Embedding
Quantization
Hamming
Embedding
Multiple
Assignment BoW Ranking List… …
Spatial
Check via DT
Online Retrieval
Hamming
Training
HE MEDIAN
Figure 2: Our retrieval Model for Instance Search.
more, instead of returning relevant webpages to users, oursystem is more specific by allowing users to expect answersand suggested questions that might directly address the in-formation needed. Note our system is fully automatic, whichdoes not involve human experts’ interaction. Also our sys-tem can handle unknown products by offering facilities infer-ring the name of the product under query, which is extremelyuseful in the scenario of question answering.The rest of the paper is organized as follows. Section 2 de-
tails the method and implementation details for the system.Section 3 demonstrates the system and a brief evaluation,and finally Section 4 concludes this paper.
2. TECHNOLOGYFigure 1 shows the framework of FashionAsk. The system
is composed of two main components: Instance Naming andQA Matching. The former infers the name of a fashion in-stance, while the latter searches for similar QA pairs.
2.1 Instance NamingThe basic idea is to search similar images of the given
instance, and then infer the name by analyzing the meta-data. The main challenges are that fashion instances couldappear in different forms and exist in different backgrounds,as shown in Figure 5(a). Thus, our framework considers aspatial verification of instance by triangulation matching toaddress the challenges, which will be briefly elaborated next.Instance Search Figure 2 depicts the retrieval model usedfor large scale instance search. Our model is based on Bag-of-visual-Words model [7]. First, a hierarchical vocabularytree [4] with one million leaf nodes is trained on MIR-Flickrdataset [1]. To enable robust search, each word is indexedwith its spatial location and a 32-bit Hamming signature[2]. To support efficient large-scale search, the index filesare distributed to several machines. During online search, aquery is matched against images in the dataset by traversingthe inverted file with Hamming signature verification. Foreach matched image, triangulation meshes are further con-structed (as Figure 3) to characterize the spatial consistencyusing DT [10]. This technique is different from traditionaltechniques such as WGC [2] and RANSAC [6], which as-sume a strict linear transformation model. In contrast, thetriangulation-based matching does not assume such mod-els, but locally measures the relative positioning of visualwords to derive coherency scores for image ranking. Thematching is found to be more effective than other methods
Figure 3: Examples of topology based matching. Foreach example, matching points are plotted on a pairof images, and the corresponding mesh graphs areshown by their sides.
What does Kony poster mean?
Root
SBARQ
WHNPSQ
WP VBZ NP VP
VBNN NN
.
?
what does
kony poster mean
Root
SBARQ
WHNP SQ
WP
VBZ ADJP
VP
JJR
ADJP PP
what
is
better than
kony
IN NP
JJ NNS
2012
.
?
What is better than kony 2012?
Root
SBARQ
WHNPSQ
WP VBZ NP VP
VBNN CD
.
?
what does
kony 2012 mean
What does kony 2012 mean?
NP
NN NN
kony poster
Figure 4: Illustration of syntactic tree matching.The same syntactic parts between left and middleare highlighted with blue, green and orange bound-aries, and the only difference between left and rightis marked with red box.
in handling objects with non-rigid shapes and non-planarsurfaces. For example, despite the flapping wings (Figure 3(left)) violates any linear transformation, corresponding fea-tures on each local wing are still consistently depicted by thegraph. Figure 3 middle and right present two additional ex-amples for both relevant and irrelevant images. We can seethe graph is insensitive to scale/rotation and small varia-tions that remains stable topology, while being sensitive toirrelevant images with random matching locations.Name Entity Extraction With the list of retrieved im-ages, name entity extraction (NE) is processed by parsingthe metadata (title and description) of each image with theBerkeley Parser [5]. The noun phrases are then extracted di-rectly as candidate names. The likelihood of a phrase beingthe name of the instance is measured by its phrase frequencyweighted by the coherency scores of instance search.
2.2 Questions MatchingAs questions in cQA websites are asked in natural lan-
guage, similar questions are varied in terms of lexical, syn-tactic, and semantic features. Hence, retrieving similar ques-tions is not trivial, and requires natural language processing.For the consideration of speed, a two-stage question match-ing method is developed. In the first stage, the keywordsextracted from user’s question together with top name can-didates are used to retrieve a small set of similar questionsfrom a QA dataset with the BM25 retrieval model, whichranks documents based on a probabilistic model. In thesecond stage, the original question is augmented by replac-ing the pronoun, such as “this”, with candidate names, and
(a) (b)
Figure 5: Interface for FashionAsk: (a) showing ex-amples of fashion pictures in our dataset; (b) askinga question by picking or uploading a picture, andfollowed by the answer and suggested questions re-turned by our system.
matched against the set of retrieved questions. The match-ing technique is implemented based on the syntactic treematching algorithm [8], which further considers grammati-cal structures of questions. Figure 4 shows an example onthis matching paradigm. In general, sentences are parsed tosyntactic elements for matching with grammar. All threesentences in Figure 4 may be looked similar in terms oftheir elementary words and phrases. However, the grammarstructure can be very different between the left and middlesentences. For example, there are only three syntactic partsmatched, which are marked with green, orange and blue.While for the case between left and right sentences, onlyone part is different (marked in red) and any other syntacticbranches are matched. As a result, the questions with bothsimilar wording and grammar structure are selected as ourquestion candidate.
3. SYSTEM AND USER INTERFACEInterface: The system can be accessed through a Webbrowser operated either on a PC client or a mobile phone.As the interface shown in Figure 5(a), a user can eitherupload a picture, or pick an image randomly listed by oursystem, and then ask a fashion related question. Users canalso crop the region-of-interest from an image when issuingquestions. The returned page will show the name of the vi-sual instance, the most likely answer, and a few suggestedquestions retrieved from our database, as illustrated in Fig-ure 5(b).System Setup: The system currently consists of 1.1 mil-lions Web images, crawled from Flickr by querying severalfashion related keywords (e.g., t-shirt, sneaker, and hand-bags) and 142 popular tags. Our dataset also has 1.5 mil-lions QA pairs crawled from Yahoo! Answers, one quarterof which are related to fashion, and the others are randomlycrawled as distracters. To enable fast response to user’squery, we use several machines serving the million scale in-stance search subsystem. For QA matching part, we imple-ment the text processing with the help of Lucene.Performance: Based on current setup, the performanceof our system is as follows when testing on 180 multimediaquestions involving 50 fashion instances: the mean AveragePrecision of instance search is 18.8%; the accuracy of findingthe right instance name as the top-1 candidate is 21.7%; andthe accuracy of returning a right answer or a related questionin top-10 list is 19.8%. The system is currently run on two
machines with 2.67GHz CPU and 16GB main memory.Speed Efficiency: By current setup, instance naming con-sumes 1.7 seconds (including feature extraction, quantiza-tion, and search), and question matching takes about 2.9seconds to rank similar questions. In total, the user needsto wait approximate five seconds to get the result. Thistime could be further reduced to sub-second, if more thanten machines are employed to parallelize instances searchingand questions matching.
Interested readers can access our online video demostra-tion4 for more details.
4. CONCLUSIONSThis demo presents a first step towards large scale mul-
timodal question answering system, which enables questionasking on instances even the names are unknown. We showthat with the help of instance search, questions can be askedin an much easier way such that hands can be freed fromheavy typing. We also show that if only a narrow domain(such as fashion) is targeted, the performance of multimodalquestion answering can be quite reasonable.
5. REFERENCES[1] M. J. Huiskes and M. S. Lew. The MIR flickr retrieval
evaluation. In ACM MIR, 2008.
[2] H. Jegou, M. Douze, and C. Schmid. Hammingembedding and weak geometric consistency for largescale image search. In ECCV, 2008.
[3] L. Nie, M. Wang, Z. Zha, G. Li, and T.-S. Chua.Multimedia answering: enriching text QA with mediainformation. In ACM SIGIR, pages 695–704, 2011.
[4] D. Nister and H. Stewenius. Scalable recognition witha vocabulary tree. In CVPR, pages 2161–2168, 2006.
[5] S. Petrov, L. Barrett, R. Thibaux, and D. Klein.Learning accurate, compact, and interpretable treeannotation. In International Conference onComputational Linguistics, pages 433–440, 2006.
[6] J. Philbin, O. Chum, M. Isard, J. Sivic, andA. Zisserman. Object retrieval with large vocabulariesand fast spatial matching. In CVPR, 2007.
[7] J. Sivic and A. Zisserman. Video Google: A textretrieval approach to object matching in videos. InICCV, volume 2, pages 1470–1477, Oct. 2003.
[8] K. Wang, Z. Ming, and T.-S. Chua. A syntactic treematching approach to finding similar questions incommunity-based qa services. In SIGIR, 2009.
[9] T. Yeh, J. J. Lee, and T. Darrell. Photo-basedquestion answering. In ACM MM, 2008.
[10] W. Zhang and C.-W. Ngo. Searching visual instanceswith topology checking and context modeling. InInternational conference on multimedia retrieval,pages 57–64, 2013.
4http://www.youtube.com/watch?v=6SF2LPU4Cds