Matrix Factorization with Knowledge Graph Propagation for … · 2015-07-22 · Knowledge Graph...

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing, pages 483–494,

Beijing, China, July 26-31, 2015. c©2015 Association for Computational Linguistics

Matrix Factorization with Knowledge Graph Propagationfor Unsupervised Spoken Language Understanding

Yun-Nung Chen, William Yang Wang, Anatole Gershman, and Alexander I. RudnickySchool of Computer Science, Carnegie Mellon University

5000 Forbes Aveue, Pittsburgh, PA 15213-3891, USA{yvchen, yww, anatoleg, air}@cs.cmu.edu

Abstract

Spoken dialogue systems (SDS) typicallyrequire a predefined semantic ontologyto train a spoken language understanding(SLU) module. In addition to the anno-tation cost, a key challenge for design-ing such an ontology is to define a coher-ent slot set while considering their com-plex relations. This paper introduces anovel matrix factorization (MF) approachto learn latent feature vectors for utter-ances and semantic elements without theneed of corpus annotations. Specifically,our model learns the semantic slots for adomain-specific SDS in an unsupervisedfashion, and carries out semantic pars-ing using latent MF techniques. To fur-ther consider the global semantic struc-ture, such as inter-word and inter-slot re-lations, we augment the latent MF-basedmodel with a knowledge graph propaga-tion model based on a slot-based seman-tic graph and a word-based lexical graph.Our experiments show that the proposedMF approaches produce better SLU mod-els that are able to predict semantic slotsand word patterns taking into account theirrelations and domain-specificity in a jointmanner.

1 Introduction

A key component of a spoken dialogue sys-tem (SDS) is the spoken language understand-ing (SLU) module—it parses the users’ utterancesinto semantic representations; for example, the ut-terance “find a cheap restaurant” can be parsedinto (price=cheap, target=restaurant) (Pieracciniet al., 1992). To design the SLU module of a SDS,most previous studies relied on predefined slots1

for training the decoder (Seneff, 1992; Dowding1A slot is defined as a basic semantic unit in SLU, such as

“price” and “target” in the example.

et al., 1993; Gupta et al., 2006; Bohus and Rud-nicky, 2009). However, these predefined semanticslots may bias the subsequent data collection pro-cess, and the cost of manually labeling utterancesfor updating the ontology is expensive (Wang etal., 2012).

In recent years, this problem led to the devel-opment of unsupervised SLU techniques (Heckand Hakkani-Tur, 2012; Heck et al., 2013; Chenet al., 2013b; Chen et al., 2014b). In particular,Chen et al. (2013b) proposed a frame-semanticsbased framework for automatically inducing se-mantic slots given raw audios. However, these ap-proaches generally do not explicitly learn the la-tent factor representations to model the measure-ment errors (Skrondal and Rabe-Hesketh, 2004),nor do they jointly consider the complex lexical,syntactic, and semantic relations among words,slots, and utterances.

Another challenge of SLU is the inference ofthe hidden semantics. Considering the user utter-ance “can i have a cheap restaurant”, from its sur-face patterns, we can see that it includes explicitsemantic information about “price (cheap)” and“target (restaurant)”; however, it also includeshidden semantic information, such as “food” and“seeking”, since the SDS needs to infer that theuser wants to “find” some cheap “food”, eventhough they are not directly observed in the sur-face patterns. Nonetheless, these implicit seman-tics are important semantic concepts for domain-specific SDSs. Traditional SLU models use dis-criminative classifiers (Henderson et al., 2012) topredict whether the predefined slots occur in theutterances or not, ignoring the unobserved con-cepts and the hidden semantic information.

In this paper, we take a rather radical approach:we propose a novel matrix factorization (MF)model for learning latent features for SLU, tak-ing account of additional information such as theword relations, the induced slots, and the slot re-lations. To further consider the global coherenceof induced slots, we combine the MF model with

483

a knowledge graph propagation based model, fus-ing both a word-based lexical knowledge graphand a slot-based semantic graph. In fact, as itis shown in the Netflix challenge, MF is cred-ited as the most useful technique for recommen-dation systems (Koren et al., 2009). Also, the MFmodel considers the unobserved patterns and esti-mates their probabilities instead of viewing themas negative examples. However, to the best of ourknowledge, the MF technique is not yet well un-derstood in the SLU and SDS communities, andit is not very straight-forward to use MF methodsto learn latent feature representations for semanticparsing in SLU. To evaluate the performance ofour model, we compare it to standard discrimina-tive SLU baselines, and show that our MF-basedmodel is able to produce strong results in seman-tic decoding, and the knowledge graph propaga-tion model further improves the performance. Ourcontributions are three-fold:• We are among the first to study matrix fac-

torization techniques for unsupervised SLU,taking account of additional information;• We augment the MF model with a knowl-

edge graph propagation model, increasing theglobal coherence of semantic decoding usinginduced slots;• Our experimental results show that the MF-

based unsupervised SLU outperforms strongdiscriminative baselines, obtaining promis-ing results.

In the next section, we outline the related workin unsupervised SLU and latent variable model-ing for spoken language processing. Section 3introduces our framework. The detailed MF ap-proach is explained in Section 4. We then intro-duce the global knowledge graphs for MF in Sec-tion 5. Section 6 shows the experimental results,and Section 7 concludes.

2 Related Work

Unsupervised SLU Tur et al. (2011; 2012) wereamong the first to consider unsupervised ap-proaches for SLU, where they exploited query logsfor slot-filling. In a subsequent study, Heck andHakkani-Tur (2012) studied the Semantic Web foran unsupervised intent detection problem in SLU,showing that results obtained from the unsuper-vised training process align well with the perfor-mance of traditional supervised learning. Fol-lowing their success of unsupervised SLU, recentstudies have also obtained interesting results onthe tasks of relation detection (Hakkani-Tur et al.,2013; Chen et al., 2014a), entity extraction (Wang

et al., 2014), and extending domain coverage (El-Kahky et al., 2014; Chen and Rudnicky, 2014).However, most of the studies above do not ex-plicitly learn latent factor representations from thedata—while we hypothesize that the better robust-ness in noisy data can be achieved by explicitlymodeling the measurement errors (usually pro-duced by automatic speech recognizers (ASR)) us-ing latent variable models and taking additional lo-cal and global semantic constraints into account.Latent Variable Modeling in SLU Early stud-ies on latent variable modeling in speech includedthe classic hidden Markov model for statisticalspeech recognition (Jelinek, 1997). Recently, Ce-likyilmaz et al. (2011) were the first to study theintent detection problem using query logs and adiscrete Bayesian latent variable model. In thefield of dialogue modeling, the partially observ-able Markov decision process (POMDP) (Younget al., 2013) model is a popular technique for di-alogue management, reducing the cost of hand-crafted dialogue managers while producing ro-bustness against speech recognition errors. Morerecently, Tur et al. (2013) used a semi-supervisedLDA model to show improvement on the slot fill-ing task. Also, Zhai and Williams (2014) proposedan unsupervised model for connecting words withlatent states in HMMs using topic models, obtain-ing interesting qualitative and quantitative results.However, for unsupervised learning for SLU, it isnot obvious how to incorporate additional infor-mation in the HMMs. To the best of our knowl-edge, this paper is the first to consider MF tech-niques for learning latent feature representationsin unsupervised SLU, taking various local andglobal lexical, syntactic, and semantic informationinto account.

3 The Proposed Framework

This paper introduces a matrix factorization tech-nique for unsupervised SLU,. The proposedframework is shown in Figure 1(a). Given theutterances, the task of the SLU model is to de-code their surface patterns into semantic formsand differentiate the target semantic concepts fromthe generic semantic space for task-oriented SDSssimultaneously. Note that our model does notrequire any human-defined slots and domain-specific semantic representations for utterances.

In the proposed model, we first build a featurematrix to represent the training utterances, whereeach row represents an utterance, and each columnrefers to an observed surface pattern or a inducedslot candidate. Figure 1(b) illustrates an example

484

1Utterance 1

i would like a cheap restaurant

Word Observation Slot Candidate

Train

… …

…

cheap restaurant food expensiveness

1

locale_by_use

11

find a restaurant with chinese food

Utterance 21 1

food

1 1

1 Test

1

1

.97.90 .95.85

.93 .92.98.05 .05

Word Relation Model Slot Relation Model

Reasoning with Matrix Factorization

Slot Induction

SLU

Model

Semantic

Representation

“can I have a cheap restaurant”Slot Induction

Unlabeled

Collection

SLU Model Training by Matrix Factorization

Frame-

Semantic

Parsing Fw Fs

Feature Model

Rw

Rs

Knowledge Graph

Propagation Model

Word Relation Model

Slot Relation Model

Knowledge

Graph

Construction

.

(a)

(b)

Semantic KG

Lexical KG

Figure 1: (a): The proposed framework. (b): Our matrix factorization method completes a partially-missing matrix for implicit semantic parsing. Dark circles are observed facts, shaded circles are inferredfacts. The slot induction maps (yellow arrow) observed surface patterns to semantic slot candidates.The word relation model (blue arrow) constructs correlations between surface patterns. The slot relationmodel (pink arrow) learns the slot-level correlations based on propagating the automatically derivedsemantic knowledge graphs. Reasoning with matrix factorization (gray arrow) incorporates these modelsjointly, and produces a coherent, domain-specific SLU model.

of the matrix. Given a testing utterance, we con-vert it into a vector based on the observed surfacepatterns, and then fill in the missing values of theslots. In the first utterance in the figure, althoughthe semantic slot food is not observed, the utter-ance implies the meaning facet food. The MF ap-proach is able to learn the latent feature vectors forutterances and semantic elements, inferring im-plicit semantic concepts to improve the decodingprocess—namely, by filling the matrix with prob-abilities (lower part of the matrix).

The feature model is built on the observed wordpatterns and slot candidates, where the slot candi-dates are obtained from the slot induction compo-nent through frame-semantic parsing (the yellowblock in Figure 1(a)) (Chen et al., 2013b). Sec-tion 4.1 explains the detail of the feature model.

In order to consider the additional inter-wordand inter-slot relations, we propose a knowledgegraph propagation model based on two knowl-edge graphs, which includes a word relation model(blue block) and a slot relation model (pink block),described in Section 4.2. The method of auto-

matic knowledge graph construction is introducedin Section 5, where we leverage distributed wordembeddings associated with typed syntactic de-pendencies to model the relations (Mikolov et al.,2013b; Mikolov et al., 2013c; Levy and Goldberg,2014; Chen et al., 2015).

Finally, we train the SLU model by learninglatent feature vectors for utterances and slot can-didates through MF techniques. Combining witha knowledge graph propagation model based onword/slot relations, the trained SLU model esti-mates the probability that each semantic slot oc-curs in the testing utterance, and how likely eachslot is domain-specific simultaneously. In otherwords, the SLU model is able to transform the test-ing utterances into domain-specific semantic rep-resentations without human involvement.

4 The Matrix Factorization Approach

Considering the benefits brought by MF tech-niques, including 1) modeling the noisy data, 2)modeling hidden semantics, and 3) modeling the

485

can i have a cheap restaurant

Frame: capability FT LU: can FE Filler: i

Frame: expensiveness FT LU: cheap

Frame: locale by use FT/FE LU: restaurant

Figure 2: An example of probabilistic frame-semantic parsing on ASR output. FT: frame target.FE: frame element. LU: lexical unit.

long-range dependencies between observations, inthis work we apply an MF approach to SLU mod-eling for SDSs. In our model, we use U to de-note the set of input utterances, W as the set ofword patterns, and S as the set of semantic slotsthat we would like to predict. The pair of an ut-terance u ∈ U and a word pattern/semantic slotx ∈ {W + S}, 〈u, x〉, is a fact. The input toour model is a set of observed facts O, and theobserved facts for a given utterance is denoted by{〈u, x〉 ∈ O}. The goal of our model is to esti-mate, for a given utterance u and a given word pat-tern/semantic slot x, the probability, p(Mu,x = 1),whereMu,x is a binary random variable that is trueif and only if x is the word pattern/domain-specificsemantic slot in the utterance u. We introduce aseries of exponential family models that estimatethe probability using a natural parameter θu,x andthe logistic sigmoid function:

p(Mu,x = 1 | θu,x) = σ(θu,x) =1

1 + exp (−θu,x)(1)

We construct a matrix M|U |×(|W |+|S|) as observedfacts for MF by integrating a feature model and aknowledge graph propagation model below.

4.1 Feature ModelFirst, we build a word pattern matrix Fw withbinary values based on observations, where eachrow represents an utterance and each columnrefers to an observed unigram. In other words, Fwcarries the basic word vectors for the utterances,which is illustrated as the left part of the matrix inFigure 1(b).

To induce the semantic elements, we parse allASR-decoded utterances in our corpus using SE-MAFOR2, a state-of-the-art semantic parser forframe-semantic parsing (Das et al., 2010; Das etal., 2013), and extract all frames from seman-tic parsing results as slot candidates (Chen et al.,2013b; Dinarelli et al., 2009). Figure 2 showsan example of an ASR-decoded output parsedby SEMAFOR. Three FrameNet-defined frames

2http://www.ark.cs.cmu.edu/SEMAFOR/

(capability, expensiveness, and locale by use)are generated for the utterance, which we consideras slot candidates for a domain-specific dialoguesystem (Baker et al., 1998). Then we build a slotmatrix Fs with binary values based on the inducedslots, which also denotes the slot features for theutterances (right part of the matrix in Figure 1(b)).

To build the feature model MF , we concatenatetwo matrices:

MF = [ Fw Fs ], (2)

which is the upper part of the matrix in Fig-ure 1(b) for training utterances. Note that we donot use any annotations, so all slot candidates areincluded.

4.2 Knowledge Graph Propagation ModelSince SEMAFOR was trained on FrameNet anno-tation, which has a more generic frame-semanticcontext, not all the frames from the parsing re-sults can be used as the actual slots in the domain-specific dialogue systems. For instance, in Fig-ure 2, we see that the frames “expensiveness”and “locale by use” are essentially the key slotsfor the purpose of understanding in the restaurantquery domain, whereas the “capability” framedoes not convey particularly valuable informationfor SLU.

Assuming that domain-specific concepts areusually related to each other, considering globalrelations between semantic slots induces a morecoherent slot set. It is shown that the relationson knowledge graphs help make decisions ondomain-specific slots (Chen et al., 2015). Con-sidering two directed graphs, semantic and lexi-cal knowledge graphs, each node in the semanticknowledge graph is a slot candidate si generatedby the frame-semantic parser, and each node in thelexical knowledge graph is a word wj .

• Slot-based semantic knowledge graph isbuilt as Gs = 〈Vs, Ess〉, where Vs = {si ∈S} and Ess = {eij | si, sj ∈ Vs}.• Word-based lexical knowledge graph is

built as Gw = 〈Vw, Eww〉, where Vw ={wi ∈ W} and Eww = {eij | wi, wj ∈ Vw}.

The edges connect two nodes in the graphs if thereis a typed dependency between them. Figure 3is a simplified example of a slot-based semanticknowledge graph. The structured graph helps de-fine a coherent slot set. To model the relations be-tween words/slots based on the knowledge graphs,we define two relation models below.

486

locale_by_use

food expensiveness

seeking

relational_quantity

PREP_FOR

PREP_FOR

NN AMOD

AMOD

AMOD

Figure 3: A simplified example of the automati-cally derived knowledge graph.

• Semantic RelationFor modeling word semantic rela-tions, we compute a matrix RSw =[Sim(wi, wj)]|W |×|W |, where Sim(wi, wj)is the cosine similarity between the de-pendency embeddings of the word pat-terns wi and wj after normalization.For slot semantic relations, we computeRSs = [Sim(si, sj)]|S|×|S| similarly3. Thematrices RSw and RSs model not only thesemantic but functional similarity since weuse dependency-based embeddings (Levyand Goldberg, 2014).

• Dependency RelationAssuming that important semantic slots areusually mutually related to each other, thatis, connected by syntactic dependencies, ourautomatically derived knowledge graphs areable to help model the dependency relations.For word dependency relations, we computea matrix RDw = [r(wi, wj)]|W |×|W |, wherer(wi, wj) measures the dependency betweentwo word patterns wi and wj based on theword-based lexical knowledge graph, and thedetail is described in Section 5. For slotdependency relations, we similarly computeRDs = [r(si, sj)]|S|×|S| based on the slot-based semantic knowledge graph.

With the built word relation models (RSw and RDw )and slot relation models (RSs and RDs ), we com-bine them as a knowledge graph propagation ma-trix MR

4.

MR =[RSDw 0

0 RSDs

], (3)

3For each column in RSw and RS

s , we only keep top 10highest values, which correspond the top 10 semanticallysimilar nodes.

4The values in the diagonal of MR are 0 to model thepropagation from other entries.

where RSDw = RSw +RDw and RSDs = RSs +RDs tointegrate semantic and dependency relations. Thegoal of this matrix is to propagate scores betweennodes according to different types of relations inthe knowledge graphs (Chen and Metze, 2012).

4.3 Integrated Model

With a feature model MF and a knowledge graphpropagation model MR, we integrate them into asingle matrix.

M = MF · (αI + βMR) (4)

=[αFw + βFwRw 0

0 αFs + βFsRs

],

where M is the final matrix and I is the iden-tity matrix. α and β are the weights for balanc-ing original values and propagated values, whereα + β = 1. The matrix M is similar to MF ,but some weights are enhanced through the knowl-edge graph propagation model, MR. The wordrelations are built by FwRw, which is the ma-trix with internal weight propagation on the lexicalknowledge graph (the blue arrow in Figure 1(b)).Similarly, FsRs models the slot correlations, andcan be treated as the matrix with internal weightpropagation on the semantic knowledge graph (thepink arrow in Figure 1(b)). The propagation mod-els can be treated as running a random walk algo-rithm on the graphs.Fs contains all slot candidates generated by

SEMAFOR, which may include some genericslots (such as capability), so the original featuremodel cannot differentiate the domain-specificand generic concepts. By integrating with Rs, thesemantic and dependency relations can be propa-gated via the knowledge graph, and the domain-specific concepts may have higher weights basedon the assumption that the slots for dialogue sys-tems are often mutually related (Chen et al., 2015).Hence, the structure information can be automati-cally involved in the matrix. Also, the word rela-tion model brings the same function, but now onthe word level. In conclusion, for each utterance,the integrated model not only predicts the proba-bility that semantic slots occur but also considerswhether the slots are domain-specific. The follow-ing sections describe the learning process.

4.4 Parameter Estimation

The proposed model is parameterized throughweights and latent component vectors, where theparameters are estimated by maximizing the log

487

likelihood of observed data (Collins et al., 2001).

θ∗ = arg maxθ

∏u∈U

p(θ |Mu) (5)

= arg maxθ

∏u∈U

p(Mu | θ)p(θ)

= arg maxθ

∑u∈U

ln p(Mu | θ)− λθ,

where Mu is the vector corresponding to the utter-ance u from Mu,x in (1), because we assume thateach utterance is independent of others.

To avoid treating unobserved facts as designednegative facts, we consider our positive-only dataas implicit feedback. Bayesian Personalized Rank-ing (BPR) is an optimization criterion that learnsfrom implicit feedback for MF, which uses a vari-ant of the ranking: giving observed true factshigher scores than unobserved (true or false)facts (Rendle et al., 2009). Riedel et al. (2013)also showed that BPR learns the implicit relationsfor improving the relation extraction task.

4.4.1 Objective FunctionTo estimate the parameters in (5), we create adataset of ranked pairs from M in (4): for eachutterance u and each observed fact f+ = 〈u, x+〉,where Mu,x ≥ δ, we choose each word pat-tern/slot x− such that f− = 〈u, x−〉, whereMu,x < δ, which refers to the word pattern/slot wehave not observed to be in utterance u. That is, weconstruct the observed data O from M . Then foreach pair of facts f+ and f−, we want to modelp(f+) > p(f−) and hence θf+ > θf− accord-ing to (1). BPR maximizes the summation of eachranked pair, where the objective is∑u∈U

ln p(Mu | θ) =∑f+∈O

∑f− 6∈O

lnσ(θf+ − θf−). (6)

The BPR objective is an approximation to theper utterance AUC (area under the ROC curve),which directly correlates to what we want toachieve – well-ranked semantic slots per utterance.

4.4.2 OptimizationTo maximize the objective in (6), we employ astochastic gradient descent (SGD) algorithm (Ren-dle et al., 2009). For each randomly sampled ob-served fact 〈u, x+〉, we sample an unobserved fact〈u, x−〉, which results in |O| fact pairs 〈f−, f+〉.For each pair, we perform an SGD update usingthe gradient of the corresponding objective func-tion for matrix factorization (Gantner et al., 2011).

can i have a cheap restaurant

ccomp

amod dobj nsubj det

capability expensiveness locale_by_use

Figure 4: The dependency parsing result.

5 Knowledge Graph Construction

This section introduces the procedure of con-structing knowledge graphs in order to estimater(wi, wj) for building RDw and r(si, sj) for RDsin Section 4.2. Considering the relations in theknowledge graphs, the edge weights for Eww andEss are measured as r(wi, wj) and r(si, sj) basedon the dependency parsing results respectively.

The example utterance “can i have a cheaprestaurant” and its dependency parsing result areillustrated in Figure 4. The arrows denote thedependency relations from headwords to theirdependents, and words on arcs denote types of thedependencies. All typed dependencies betweentwo words are encoded in triples and form aword-based dependency set Tw = {〈wi, t, wj〉},where t is the typed dependency between theheadword wi and the dependent wj . For example,Figure 4 generates 〈restaurant, AMOD, cheap〉,〈restaurant, DOBJ, have〉, etc. for Tw, Sim-ilarly, we build a slot-based dependency setTs = {〈si, t, sj〉} by transforming dependen-cies between slot-fillers into ones betweenslots. For example, 〈restaurant, AMOD, cheap〉from Tw is transformed into〈locale by use, AMOD,expensiveness〉 forbuilding Ts, because both sides of the non-dottedline are parsed as slot-fillers by SEMAFOR.

5.1 Relation Weight Estimation

For the edges in the knowledge graphs, we modelthe relations between two connected nodes xi andxj as r(xi, xj), where x is either a slot s or a wordpattern w. Since the weights are measured basedon the relations between nodes regardless of thedirections, we combine the scores of two direc-tional dependencies:

r(xi, xj) = r(xi → xj) + r(xj → xi), (7)

where r(xi → xj) is the score estimating the de-pendency including xi as a head and xj as a de-pendent. We propose a scoring function for r(·)using dependency-based embeddings.

488

Table 1: The example contexts extracted for training dependency-based word/slot embeddings.

Typed Dependency Relation Target Word Contexts

Word 〈restaurant, AMOD, cheap〉 restaurant cheap/AMOD

cheap restaurant/AMOD−1

Slot 〈locale by use, AMOD,expensiveness〉 locale by use expensiveness/AMOD

expansiveness locale by use/AMOD−1

5.1.1 Dependency-Based EmbeddingsMost neural embeddings use linear bag-of-wordscontexts, where a window size is defined to pro-duce contexts of the target words (Mikolov etal., 2013c; Mikolov et al., 2013b; Mikolov etal., 2013a). However, some important contextsmay be missing due to smaller windows, whilelarger windows capture broad topical content. Adependency-based embedding approach was pro-posed to derive contexts based on the syntactic re-lations the word participates in for training embed-dings, where the embeddings are less topical butoffer more functional similarity compared to orig-inal embeddings (Levy and Goldberg, 2014).

Table 1 shows the extracted dependency-basedcontexts for each target word from the example inFigure 4, where headwords and their dependentscan form the contexts by following the arc on aword in the dependency tree, and −1 denotes thedirectionality of the dependency. After replacingoriginal bag-of-words contexts with dependency-based contexts, we can train dependency-basedembeddings for all target words (Yih et al., 2014;Bordes et al., 2011; Bordes et al., 2013).

For training dependency-based word embed-dings, each target x is associated with a vectorvx ∈ Rd and each context c is represented as acontext vector vc ∈ Rd, where d is the embed-ding dimensionality. We learn vector representa-tions for both targets and contexts such that thedot product vx · vc associated with “good” target-context pairs belonging to the training data D ismaximized, leading to the objective function:

arg maxvx,vc

∑(w,c)∈D

log1

1 + exp(−vc · vx), (8)

which can be trained using stochastic-gradient up-dates (Levy and Goldberg, 2014). Then we canobtain the dependency-based slot and word em-beddings using Ts and Tw respectively.

5.1.2 Embedding-Based Scoring FunctionWith trained dependency-based embeddings, weestimate the probability that xi is the headwordand xj is its dependent via the typed dependency t

as

P (xi −→txj) =

Sim(xi, xj/t) + Sim(xj , xi/t−1)2

,

(9)where Sim(xi, xj/t) is the cosine similarity be-tween word/slot embeddings vxi

and context em-beddings vxj/t after normalizing to [0, 1].

Based on the dependency set Tx, we use t∗xi→xj

to denote the most possible typed dependency withxi as a head and xj as a dependent.

t∗xi→xj= arg max

tC(xi −→

txj), (10)

where C(xi −→txj) counts how many times the

dependency 〈xi, t, xj〉 occurs in the dependencyset Tx. Then the scoring function r(·) in (7) thatestimates the dependency xi → xj is measured as

r(xi → xj) = C(xi −−−−→t∗xi→xj

xj)·P (xi −−−−→t∗xi→xj

xj),

(11)which is equal to the highest observed frequencyof the dependency xi → xj among all types fromTx and additionally weighted by the estimatedprobability. The estimated probability smoothesthe observed frequency to avoid overfitting due tothe smaller dataset. Figure 3 is a simplified exam-ple of an automatically derived semantic knowl-edge graph with the most possible typed depen-dencies as edges based on the estimated weights.Then the relation weights r(xi, xj) can be ob-tained by (7) in order to build RDw and RDs ma-trices.

6 Experiments

6.1 Experimental Setup

In this experiment, we used the Cambridge Uni-versity SLU corpus, previously used on severalother SLU tasks (Henderson et al., 2012; Chen etal., 2013a). The domain of the corpus is aboutrestaurant recommendation in Cambridge; sub-jects were asked to interact with multiple SDSsin an in-car setting. The corpus contains a to-tal number of 2,166 dialogues, including 15,453utterances (10,571 for self-training and 4,882 for

489

Table 2: The MAP of predicted slots (%); † indicates that the result is significantly better than the LogisticRegression (row (b)) with p < 0.05 in t-test.

ApproachASR Manual

w/o w/ Explicit w/o w/ Explicit

ExplicitSVM (a) 32.48 36.62MLR (b) 33.96 38.78

ImplicitBaseline

Random (c) 3.43 22.45 2.63 25.09Majority (d) 15.37 32.88 16.43 38.41

MFFeature (e) 24.24 37.61† 22.55 45.34†

Feature + KGP (f) 40.46† 43.51† 52.14† 53.40†

speak on topic addr area

food

phone

part orientational direction locale part inner outer

food origin

contacting

postcode

price range

task

type

sending

commerce scenario expensiveness range

seeking desiring locating

locale by use building

Figure 5: The mappings from induced slots(within blocks) to reference slots (right sides ofarrows).

testing). The data is gender-balanced, with slightlymore native than non-native speakers. The vocab-ulary size is 1868. An ASR system was used totranscribe the speech; the word error rate was re-ported as 37%. There are 10 slots created by do-main experts: addr, area, food, name, phone,postcode, price range, signature, task, andtype.

For parameter setting, the weights for balanc-ing feature models and propagation models, α andβ, are set as 0.5 to give the same influence, andthe threshold for defining the unobserved facts δis set as 0.5 for all experiments. We use the Stan-ford Parser5 to obtain the collapsed typed syntac-tic dependencies (Socher et al., 2013) and set thedimensionality of embeddings d = 300 in all ex-periments.

6.2 Evaluation MetricsTo evaluate the accuracy of the automaticallydecoded slots, we measure their quality as theproximity between predicted slots and referenceslots. Figure 5 shows the mappings that indicatesemantically related induced slots and referenceslots (Chen et al., 2013b).

To eliminate the influence of threshold selectionwhen predicting semantic slots, in the following

5http://nlp.stanford.edu/software/lex-parser.

shtml

metrics, we take the whole ranking list into ac-count and evaluate the performance by the met-rics that are independent of the selected threshold.For each utterance, with the predicted probabilitiesof all slot candidates, we can compute an averageprecision (AP) to evaluate the performance of SLUby treating the slots with mappings as positive. APscores the ranking result higher if the correct slotsare ranked higher, which also approximates to thearea under the precision-recall curve (Boyd et al.,2012). Mean average precision (MAP) is the met-ric for evaluating all utterances. For all experi-ments, we perform a paired t-test on the AP scoresof the results to test the significance.

6.3 Evaluation ResultsTable 2 shows the MAP performance of predictedslots for all experiments on ASR and manual tran-scripts. For the first baseline using explicit seman-tics, we use the observed data to self-train mod-els for predicting the probability of each seman-tic slot by support vector machine (SVM) with alinear kernel and multinomial logistic regression(MLR) (row (a)-(b)) (Pedregosa et al., 2011; Hen-derson et al., 2012). It is shown that SVM andMLR perform similarly, and MLR is slightly bet-ter than SVM because it has better capability ofestimating probabilities. For modeling implicitsemantics, two baselines are performed as refer-ences, Random (row (c)) and Majority (row (d)),where the former assigns random probabilities forall slots, and the later assigns probabilities for theslots based on their frequency distribution. To im-prove probability estimation, we further integratethe results from implicit semantics with the betterresult from explicit approaches, MLR (row (b)),by averaging the probability distribution from tworesults.

Two baselines, Random and Majority, cannotmodel the implicit semantics, producing poor re-sults. The results of Random integrated withMLR significantly degrades the performance of

490

Table 3: The MAP of predicted slots using different types of relation models in MR (%); † indicates thatthe result is significantly better than the feature model (column (a)) with p < 0.05 in t-test.

Model Feature Knowledge Graph Propagation ModelRel. (a) None (b) Semantic (c) Dependency (d) Word (e) Slot (f) All

MR -[ RSw 0

0 RSs

] [ RDw 00 RDs

] [ RSDw 00 0

] [ 0 00 RSDs

] [ RSDw 00 RSDs

]ASR 37.61 41.39† 41.63† 39.19† 42.10† 43.51†

Manual 45.34 51.55† 49.04† 45.18 49.91† 53.40†

MLR for both ASR and manual transcripts. Also,the results of Majority integrated with MLR doesnot produce any difference compared to the MLRbaseline. Among the proposed MF approaches,only using feature model for building the ma-trix (row (e)) achieves 24.2% and 22.6% of MAPfor ASR and manual results respectively, whichare worse than two baselines using explicit se-mantics. However, with the combination of ex-plicit semantics, using only the feature model sig-nificantly outperforms the baselines, where theperformance comes from about 34.0% to 37.6%and from 38.8% to 45.3% for ASR and manualresults respectively. Additionally integrating aknowledge graph propagation (KGP) model (row(e)) outperforms the baselines for both ASR andmanual transcripts, and the performance is fur-ther improved by combining with explicit seman-tics (achieving MAP of 43.5% and 53.4%). Theexperiments show that the proposed MF modelssuccessfully learn the implicit semantics and con-sider the relations and domain-specificity simulta-neously.

6.4 Discussion and Analysis

With promising results obtained by the proposedmodels, we analyze the detailed difference be-tween different relation models in Table 3.

6.4.1 Effectiveness of Semantic andDependency Relation Models

To evaluate the effectiveness of semantic and de-pendency relations, we consider each of them in-dividually inMR of (3) (columns (b) and (c) in Ta-ble 3). Comparing to the original model (column(a)), both modeling semantic relations and mod-eling dependency relations significantly improvethe performance for ASR and manual results. It isshown that semantic relations help the SLU modelinfer the implicit meaning, and then the predic-tion becomes more accurate. Also, dependency re-lations successfully differentiate the generic con-cepts from the domain-specific concepts, so thatthe SLU model is able to predict more coherent

set of semantic slots (Chen et al., 2015). Integrat-ing two types of relations (column (f)) further im-proves the performance.

6.4.2 Comparing Word/ Slot Relation ModelsTo analyze the performance results from inter-word and inter-slot relations, the columns (d) and(e) show the results considering only word rela-tions and only slot relations respectively. It canbe seen that the inter-slot relation model signif-icantly improves the performance for both ASRand manual results. However, the inter-word re-lation model only performs slightly better resultsfor ASR output (from 37.6% to 39.2%), and thereis no difference after applying the inter-word rela-tion model on manual transcripts. The reason maybe that inter-slot relations carry high-level seman-tics that align well with the structure of SDSs, butinter-word relations do not. Nevertheless, combin-ing two relations (column (f)) outperforms both re-sults for ASR and manual transcripts, showing thatdifferent types of relations can compensate eachother and then benefit the SLU performance.

7 Conclusions

This paper presents an MF approach to self-trainthe SLU model for semantic decoding in an unsu-pervised way. The purpose of the proposed modelis not only to predict the probability of each se-mantic slot but also to distinguish between genericsemantic concepts and domain-specific conceptsthat are related to an SDS. The experiments showthat the MF-based model obtains promising re-sults, outperforming strong discriminative base-lines.

Acknowledgments

We thank anonymous reviewers for their usefulcomments and Prof. Manfred Stede for his men-toring. We are also grateful to MetLife’s support.Any opinions, findings, and conclusions expressedin this publication are those of the authors and donot necessarily reflect the views of funding agen-cies.

491

ReferencesCollin F Baker, Charles J Fillmore, and John B Lowe.

1998. The Berkeley FrameNet project. In Proceed-ings of COLING, pages 86–90.

Dan Bohus and Alexander I Rudnicky. 2009. TheRavenClaw dialog management framework: Archi-tecture and systems. Computer Speech & Language,23(3):332–361.

Antoine Bordes, Jason Weston, Ronan Collobert,Yoshua Bengio, et al. 2011. Learning structuredembeddings of knowledge bases. In Proceedings ofAAAI.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In Proceedings of Advances in Neu-ral Information Processing Systems, pages 2787–2795.

Kendrick Boyd, Vitor Santos Costa, Jesse Davis, andC David Page. 2012. Unachievable region inprecision-recall space and its effect on empiricalevaluation. In Machine learning: proceedings of theInternational Conference. International Conferenceon Machine Learning, volume 2012, page 349. NIHPublic Access.

Asli Celikyilmaz, Dilek Hakkani-Tur, and Gokhan Tur.2011. Leveraging web query logs to learn user in-tent via bayesian discrete latent variable model. InProceedings of ICML.

Yun-Nung Chen and Florian Metze. 2012. Two-layer mutually reinforced random walk for improvedmulti-party meeting summarization. In Proceedingsof The 4th IEEE Workshop on Spoken LanguageTachnology, pages 461–466.

Yun-Nung Chen and Alexander I. Rudnicky. 2014.Dynamically supporting unexplored domains inconversational interactions by enriching semanticswith neural word embeddings. In Proceedings of2014 IEEE Spoken Language Technology Workshop(SLT), pages 590–595. IEEE.

Yun-Nung Chen, William Yang Wang, and Alexan-der I. Rudnicky. 2013a. An empirical investigationof sparse log-linear models for improved dialogueact classification. In Proceedings of ICASSP, pages8317–8321.

Yun-Nung Chen, William Yang Wang, and Alexander IRudnicky. 2013b. Unsupervised induction and fill-ing of semantic slots for spoken dialogue systemsusing frame-semantic parsing. In Proceedings of2013 IEEE Workshop on Automatic Speech Recog-nition and Understanding (ASRU), pages 120–125.IEEE.

Yun-Nung Chen, Dilek Hakkani-Tur, and Gokan Tur.2014a. Deriving local relational surface forms from

dependency-based entity embeddings for unsuper-vised spoken language understanding. In Proceed-ings of 2014 IEEE Spoken Language TechnologyWorkshop (SLT), pages 242–247. IEEE.

Yun-Nung Chen, William Yang Wang, and Alexan-der I. Rudnicky. 2014b. Leveraging frame se-mantics and distributional semantics for unsuper-vised semantic slot induction in spoken dialoguesystems. In Proceedings of 2014 IEEE Spoken Lan-guage Technology Workshop (SLT), pages 584–589.IEEE.

Yun-Nung Chen, William Yang Wang, and Alexan-der I. Rudnicky. 2015. Jointly modeling inter-slotrelations by random walk on knowledge graphs forunsupervised spoken language understanding. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics - Human Language Technologies.ACL.

Michael Collins, Sanjoy Dasgupta, and Robert ESchapire. 2001. A generalization of principal com-ponents analysis to the exponential family. In Pro-ceedings of Advances in Neural Information Pro-cessing Systems, pages 617–624.

Dipanjan Das, Nathan Schneider, Desai Chen, andNoah A Smith. 2010. Probabilistic frame-semanticparsing. In Proceedings of The Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 948–956.

Dipanjan Das, Desai Chen, Andre F. T. Martins,Nathan Schneider, and Noah A. Smith. 2013.Frame-semantic parsing. Computational Linguis-tics.

Marco Dinarelli, Silvia Quarteroni, Sara Tonelli,Alessandro Moschitti, and Giuseppe Riccardi.2009. Annotating spoken dialogs: from speech seg-ments to dialog acts and frame semantics. In Pro-ceedings of the 2nd Workshop on Semantic Repre-sentation of Spoken Language, pages 34–41. ACL.

John Dowding, Jean Mark Gawron, Doug Appelt, JohnBear, Lynn Cherny, Robert Moore, and DouglasMoran. 1993. Gemini: A natural language systemfor spoken-language understanding. In Proceedingsof ACL, pages 54–61.

Ali El-Kahky, Derek Liu, Ruhi Sarikaya, Gokhan Tur,Dilek Hakkani-Tur, and Larry Heck. 2014. Ex-tending domain coverage of language understandingsystems via intent transfer between domains usingknowledge graphs and search query click logs. InProceedings of ICASSP.

Zeno Gantner, Steffen Rendle, Christoph Freuden-thaler, and Lars Schmidt-Thieme. 2011. Mymedi-alite: A free recommender system library. In Pro-ceedings of the fifth ACM conference on Recom-mender systems, pages 305–308. ACM.

492

Narendra Gupta, Gokhan Tur, Dilek Hakkani-Tur,Srinivas Bangalore, Giuseppe Riccardi, and MazinGilbert. 2006. The AT&T spoken language un-derstanding system. IEEE Transactions on Audio,Speech, and Language Processing, 14(1):213–222.

Dilek Hakkani-Tur, Larry Heck, and Gokhan Tur.2013. Using a knowledge graph and query click logsfor unsupervised learning of relation detection. InProceedings of ICASSP, pages 8327–8331.

Larry Heck and Dilek Hakkani-Tur. 2012. Exploitingthe semantic web for unsupervised spoken languageunderstanding. In Proceedings of SLT, pages 228–233.

Larry P Heck, Dilek Hakkani-Tur, and Gokhan Tur.2013. Leveraging knowledge graphs for web-scaleunsupervised semantic parsing. In Proceedings ofINTERSPEECH, pages 1594–1598.

Matthew Henderson, Milica Gasic, Blaise Thomson,Pirros Tsiakoulis, Kai Yu, and Steve Young. 2012.Discriminative spoken language understanding us-ing word confusion networks. In Proceedings ofSLT, pages 176–181.

Frederick Jelinek. 1997. Statistical methods for speechrecognition. MIT press.

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009.Matrix factorization techniques for recommendersystems. Computer, (8):30–37.

Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of ACL.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word represen-tations in vector space. In Proceedings of Workshopat ICLR.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Proceedings of Advances in Neural Informa-tion Processing Systems, pages 3111–3119.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.2013c. Linguistic regularities in continuous spaceword representations. In HLT-NAACL, pages 746–751. Citeseer.

Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in python. The Journal of Ma-chine Learning Research, 12:2825–2830.

Roberto Pieraccini, Evelyne Tzoukermann, ZakharGorelov, J Gauvain, Esther Levin, Chin-Hui Lee,and Jay G Wilpon. 1992. A speech understand-ing system based on statistical representation of se-mantics. In Proceedings of 1992 IEEE InternationalConference on Acoustics, Speech, and Signal Pro-cessing (ICASSP), volume 1, pages 193–196. IEEE.

Steffen Rendle, Christoph Freudenthaler, Zeno Gant-ner, and Lars Schmidt-Thieme. 2009. BPR:Bayesian personalized ranking from implicit feed-back. In Proceedings of the Twenty-Fifth Confer-ence on Uncertainty in Artificial Intelligence, pages452–461. AUAI Press.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M Marlin. 2013. Relation extraction withmatrix factorization and universal schemas. In Pro-ceedings of NAACL-HLT, pages 74–84.

Stephanie Seneff. 1992. TINA: A natural languagesystem for spoken language applications. Computa-tional linguistics, 18(1):61–86.

Anders Skrondal and Sophia Rabe-Hesketh. 2004.Generalized latent variable modeling: Multilevel,longitudinal, and structural equation models. CrcPress.

Richard Socher, John Bauer, Christopher D Manning,and Andrew Y Ng. 2013. Parsing with composi-tional vector grammars. In Proceedings of the ACLconference. Citeseer.

Gokhan Tur, Dilek Z Hakkani-Tur, Dustin Hillard, andAsli Celikyilmaz. 2011. Towards unsupervisedspoken language understanding: Exploiting queryclick logs for slot filling. In Proceedings of INTER-SPEECH, pages 1293–1296.

Gokhan Tur, Minwoo Jeong, Ye-Yi Wang, DilekHakkani-Tur, and Larry P Heck. 2012. Exploit-ing the semantic web for unsupervised natural lan-guage semantic parsing. In Proceedings of INTER-SPEECH.

Gokhan Tur, Asli Celikyilmaz, and Dilek Hakkani-Tur. 2013. Latent semantic modeling for slot fill-ing in conversational understanding. In Proceedingsof 2013 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), pages8307–8311. IEEE.

William Yang Wang, Dan Bohus, Ece Kamar, and EricHorvitz. 2012. Crowdsourcing the acquisition ofnatural language corpora: Methods and observa-tions. In Proceedings of SLT, pages 73–78.

Lu Wang, Dilek Hakkani-Tur, and Larry Heck. 2014.Leveraging semantic web search and browse ses-sions for multi-turn spoken dialog systems. In Pro-ceedings of 2014 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP),pages 4082–4086. IEEE.

Wen-tau Yih, Xiaodong He, and Christopher Meek.2014. Semantic parsing for single-relation questionanswering. In Proceedings of ACL.

Steve Young, Milica Gasic, Blaise Thomson, and Ja-son D Williams. 2013. POMDP-based statisticalspoken dialog systems: A review. Proceedings ofthe IEEE, 101(5):1160–1179.

493

Ke Zhai and Jason D Williams. 2014. Discoveringlatent structure in task-oriented dialogues. In Pro-ceedings of the Association for Computational Lin-guistics.

494

Matrix Factorization with Knowledge Graph Propagation for … · 2015-07-22 · Knowledge Graph...

Documents

Transcript of Matrix Factorization with Knowledge Graph Propagation for … · 2015-07-22 · Knowledge Graph...