Post on 15-Jan-2016
description
Topic Models for Morphologically Rich Languages
Michael Elhadad, Meni Adler, Yoav Goldberg, Rafi Cohen
23 Jan 2011, HaifaMachine Translation and Morphologically-rich Languages
Topic ModelsUnsupervised discovery of topics in text collectionUseful to browse/explore large corpora by themeTopic evolution over timeAuthor-topic modelsDifficult to evaluate / Task-based evaluations helpWSDSummarizationIRSentiment analysisMultilingual LDA could help as feature for MTTopic Models
Topic Models and Rich MorphologyTopic Models from text in HebrewRich morphologyHigh number of distinct word formsHigh ambiguity
Halakhic Domain (Jewish Religious Law)Mixture of languages (Hebrew / Aramaic)Various Historical / Geographical / SubdomainsExisting metadata / Can we exploit it?Medical DomainPatient letters / eHealth QA siteHigh level of mixture English/Hebrew (transliterations)Existing metadata (UMLS) / Can we exploit it?
Work in progressTopic Models
OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisCombining Semantic Priors and LDAMultilingual Topic ModelsEvaluating Topic ModelsOutline
ObjectivesInput: Domain specific text corpus in HebrewMetadata on documents (tags, alignment to English tags)Output: Topic model: Discover topics discussed in the corpusRecognize topics in unseen textIndex text collection by topicTask:Something where topics help:WSD, IR, Text categorization, clusteringSome part of MT?Objectives
Term Ambiguity and What is a Topic? (ox/bull) refers to many complex halakhic topics:Damages ( goring ox)Kosher meat ( slaughter)Sacrifices ()Shabbat ( domestic animals must rest)Calendar ( Zodiac sign Taurus)
What are these topics?
Terms are disambiguated in context+ (Ox + Shabbat)
Associate a word to a topicAssociate a document to topicsObjectives
Discovering Topic Models: LDALatent Dirichlet AllocationBlei and Jordan 2003Discover (unsupervised) topic structures in a document collectionTopics are modeled as distributions of wordsProbabilistic generative model of text
LDA
What can be done with an LDA Topic Model?D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. 2009LDA
Structure of an LDA ModelLDAFrom (Blei 2008)
The LDA ModelObservations: documents are composed of words.Latent variable: each document expresses a few topicsGenerative probabilistic model:Each document is a mixture of topicsEach word is drawn from the topics active in the document
LDA
LDA Graphical Model LDA(Blei 2008)
LDA Generative Process LDA(Blei 2008)
LDA Generative Process LDA(Blei 2008)(w1, w2,. wV)(t1,,tK)
LDA Estimation LDA(Blei 2008)
LDA Estimation LDA(Blei 2008)Matrix KxV
LDA Approximation LDA(Blei 2008)Generally use Gibbs Sampling to estimate
Gibbs SamplingRepresent corpus as:Array of words w[i] fixedDocument indices d[i] fixedTopics z[i] changeMarkov chain where states = topic assignments to wordsMacro-steps: assign a new topic to all the wordsMicro-steps: assign a new topic to each word w[i]LDA
OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisCombining Semantic Priors and LDAMultilingual Topic ModelsEvaluating Topic ModelsOutline
LDA in Hebrew
Explore various datasets in HebrewHow well does LDA work on Hebrew?Domain
Domain: Halakhic SourcesVarious Historical / Geographical background
Domain
The MishnaMishna (Tanaim)Exhaustive code of Jewish LawWritten by R. Yehuda Hanasi (220 CE)6 orders, 63 tractates, 524 chapters, 6K paragraphs, 350K words.Hierarchical thematic organization by topics
Domain
Rambams Mishne TorahCorpus of Mishne Torah (Rishonim)Exhaustive code of HalakhaWritten by Maimonides 1170-118014 books, 85 sections, 1,000 chapters, 15K articles, 600K words.Domain
Responsa CorpusWe manually constructed a reference corpus for testing purposes.Team of 5 Jewish Law experts with metadata associated to each QA document.
Documents8,000 responsa from 35 distinct books of various origins (geographical, historical)3.6M words (avg 450 tokens per document)On average 4.5 tags per document (from the ontology)
Ontology of Halakha~2,000concepts ~5,000 relations among concepts of 14 distinct types
MetadataPer book: Author, Location, Publication DatePer document:Topics from indexReferences to "sources" (Bavli, Yerushalmi, Mishna, Tanakh, Shulhan 'arukh) (In progress)References to other responsa (In progress)Domain
Halakhic Corpus SpecificityLanguageMixture (Hebrew + Aramaic)Semitic languages: rich morphologyMany acronyms / abbreviationsWide variety of domains / historical backgroundVarious GenresCodes (hierarchical, synthetic)Commentaries (segmented, linear)Responsa (implicitly hypertextual complex citations)Layers of corpus (derivation, authority)Mishna Gmara Mishne Tora ResponsaDomain
Medical CorpusInfomed.co.ilPopular QA Health site2M words / 4K documentsAnnotated by site categories6,000 concepts / 3,000 mapped to UMLSHospital Patient release lettersNeurology department 150K words / 1K documentsManual UMLS concept annotation (in progress)Domain
Medical Corpus SpecificityMany unknown words (~20% token types)Many transliterations (Rafis talk)Many named entitiesDomain
OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisCombining Semantic Priors and LDAMultilingual Topic ModelsEvaluating Topic ModelsOutline
Hebrew Morphological Analysis (name of an association) (while taking a picture) (their onion) (under their shades) (in a photographer) (in the photographer( (in an idol( (in the idol(
LDA and Morphology
Morphological Analysis proper-noun verb, infinitive- noun, singular, masculine -- noun, singular, masculine - noun, singular, masculine, absolute- noun, singular, masculine, construct - noun, definitive singular, masculineLDA and Morphology
Many morphological variantsLDA and MorphologyOne word about 50 distinct forms in the corpus(12 forms average)
Combining LDA and MorphologyLDA picks up patterns of word co-occurrence in documents.Heavy variations in Hebrew could mean we miss co-occurrence if we do not first analyze morphology.
What is the best method to combine LDA and Morphological analysis?LDA and Morphology
Combining LDA and Morphology3 options:Ignore morphology token-based LDAEnglish LDA: stemming, filter POS (nouns)Pipeline resolve morphological ambiguities, then learn LDA.MorphologyLemma is ambiguousJoint learn LDA on distributions of lemma conditioned by morphological analysis
LDA and Morphology
Joint LDA-Morphology LearningLDA and MorphologyStandard token-based LDA
Joint LDA-Morphology LearningLDA and MorphologyJoint Morphology-LDAConstrainedByTagger Decision
Joint LDA-Morphology worksToken-based LDA in Hebrew gives no useful topics:No semantic coherence (less than 1/3 topics)No alignment with semantic annotationsLDA-Morphology worksSemantic coherenceMore on evaluationLDA and Morphology
Morphology VariantsSemantic Coherence EvaluationAsk experts if they recognize a topic as coherent and to label it.Test on Rambam 128 topics108 coherent topics with short label20 unrecognized [2 taggers / high agreement]Test on Medical Data 128 topics115 coherent topicsTest on Mishna 128 topics60 coherent topicsLDA and Morphology
Morphology VariantsVariant models on Mishna DatasetLDA on Nouns onlyLDA on Nouns and Compound nouns (smixut)Semantic coherence only for Compound model80 coherent topics / 128 topicsUnstable: 75 coherent / 150 topicsMarked Compounds45 compounds appear as top terms in topics (out of 6,500 distinct compounds)All recognized as key concepts by domain expertsMore evaluation needed on term extractionWhy such a difference with Rambam?LDA and Morphology
OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisEvaluating Topic ModelsCombining Semantic Priors and LDAMultilingual Topic ModelsOutline
How Good are Discovered Topics?Difficult to evaluate LDA topicsMany parametersEach run gives slightly different resultsHow to compare topic models?MethodsData-oriented evaluationSemantic CoherenceOntology alignment evaluationTask-based evaluation Evaluation
Topic Evaluation Methods 1Data-oriented: Measure fit between dataset and generative model seen as language model (perplexity)Seems to miss what is good about topicsSemantic coherenceSubjective judgment Individual topics meaningful? Can be labeled?Assignments topic/docs meaningful?Find the intruder testsRank best word / worst word find the intruder wordEvaluation
Evaluating Topic Model (ox/bull) refers to many complex halakhic topics:Damages ( goring ox)Kosher meat ( slaughter)Sacrifices ()Shabbat ( domestic animals must rest)Calendar ( Zodiac sign Taurus)
Evaluation
Topics for (Ox) on Rambam CorpusDamagesSacrificesCalendarMeatEvaluation
Topics for (Ox) on Rambam CorpusDamagesSacrificesCalendarMeatSacrifices(again)Meat(again)Damages(again)Sacrifices(again)Evaluation
Topics for (Ox) on Rambam CorpusDamagesSacrifices?CalendarMeatSacrifices(again)Meat(again)Damages(again)Sacrifices(again)?????Evaluation
Topics for (Ox) on Rambam CorpusDamagesSacrifices?CalendarMeatSacrificesMeatDamages(again)Sacrifices(again)? Shabbat + Lighting candles ?? Wine + Sacrifices ???Evaluation
Topics for a Document - -
RambamBook of DamagesDamages by PropertyChapter 12DamagesDamagesEvaluation
Topics for a Document - -
RambamBook of DamagesDamages by PropertyChapter 12DamagesDamages Units??Evaluation
Topic Evaluation Methods 2Alignment Topic Model / OntologyDoes the topic model reproduce existing metadata classificationTask-based EvaluationDo topics facilitate search or navigation?For IR, relevance models with semantic smoothing Do multilingual topics capture word alignments?Evaluation
Semantic CoherenceSubjective evaluationTopic is meaningful / can be labeled?Highly positive on Rambam and MedicalLow on Mishna until restricted to Compound+N / Marked morphologically
Can topic semantic coherence be predicted?(Newman et al 2010) using PMI measure Evaluation
Ontology AlignmentRambam Mishne Torah has existing structureHierarchy of Book/Section/ChapterWe find good alignment Topic/BookSome topics are cross-concern (witnesses)
Evaluation
Topic DocumentsFits the Rambams classificationEvaluation
Alignment Topic / BooksEvaluationDocument Book on Rambams topic modelDocument = (book[1-14] / section[1-85] / document)
BooksTopics5 general topics / 20 focus on 2 books / 30 skinny / 65 focus on 1 book1 book covers many topics / 2 books very fewZRAIM MADA ZMANIM NZIKIN AVODA KINYAN TAHARA KORBANOT AHAVA MISHPATIM SHOFTIM NASHIM HAFLAA KDUSHA
OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisEvaluating Topic ModelsCombining Semantic Priors and LDAMultilingual Topic ModelsOutline
Semantics and LDALDA is fully unsupervisedLearn better models with underlying semantic knowledge?Active field of researchExcellent survey: Incorporating domain knowledge in latent topic models (Andrzejewski 2010)
Semantics and LDA
Semantics and LDA: 3 Types of ApproachesLDA+X:Model additional observed data (Document+Tag)SupervisedLDA, Author-Topic, Topic-Link LDAWord-Topic ConstraintsPrior constraints on word-topic associationSyntax: Syntactic Topic Model, HMM-LDAConcept-Topic Model (semantic fields), LDAWN, Dirichlet Forest, Topic-in-SetDocument-Topic ConstraintsPrior constraints on document-topic association and among topicsTopic relations: hLDA, Correlated Topic Models, PAMDocument-Topic: Dirichlet Multinomial Regression, labeled LDA, Logic LDATopics over time: DTM, TOTSemantics and LDA
Semantics and LDA: 3 Types of ApproachesSemantics and LDAWord-TopicConditionsTopic-TopicConditionsDocument-TagObserved
Which Method for our domainDocument-Tags are available Labeled LDA and DMRHierarchical topic models (PAM)Hyperlinks exist but are difficult to extract LinkLDA
Currently experimenting with Labeled-LDA on our datasets.Semantics and LDA
OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisEvaluating Topic ModelsCombining Semantic Priors and LDAMultilingual Topic ModelsOutline
Multilingual Topic ModelsAssume bilingual document set (di, li)Can we catch patterns of word co-occurrence across languages?
MUTO (Boyd-Graber & Blei 2009)Combine 2 aspects in one generative model:Align words across languagesGroup words into topics
Multilingual LDA
MUTO Generative ProcessChoose matching m (mst weight of (ws, wt))Choose multinomial term distributions:Choose background distributions for words not in m for (S,T) lChoose topic Ti ~ Dir() i in (1..K) over the pairs in mFor each document d (1..D) with language ldChoose topics d ~ Dir()For each n in (1..Md)Choose topic assignment zd ~ Mult(d)Choose cn from (matched, unmatched) uniformlyIf cn = matched: choose a pair ~ Mult(zn(m)) / project on ldIf cn = unmatched: choose wn ~ Mult(l)
Multilingual LDA
Learned bi-lingual topic (En/Ge)time:schattenworld:kontakthistory:roemischnumber:nummermath:withterm:zeroaxiom:axiomsystem:systemtheory:theorieMultilingual LDA
Learned bi-lingual topic (En/Ge)time:schattenworld:kontakthistory:roemischnumber:nummermath:withterm:zeroaxiom:axiomsystem:systemtheory:theorieEdit distance priorA bilingual dictionary helpsDoes much better on aligned corpora
Multilingual LDA
Could topic models over documents help MT with document level features?
Multilingual LDA
ConclusionsMorphological analysis is critical to start exploring topic models in MRLsTopic models are hard to evaluateSemi-supervised topic models improve quality of topicsMulti-lingual topics can be learned
Could help provide document level direction in MTConclusion