Text mining, word embeddings, & wikipedia

Text Mining, Word Embeddings, & Wikipedia Muhammad Atif Qureshi

  • Text Mining, Word Embeddings, & Wikipedia

    Muhammad Atif Qureshi

  12/01/17

    Contents Introduction Text Mining

    Similar words Word ambiguity

    Word Embedding Related Research Toy Example

    Wikipedia Structure Phrase Chunking Case studies

  12/01/17

    Problem Motivation

    Human beings have found a great comfort in expressing their viewpoint in writing because of its ability to preserve thoughts for a longer period of time than oral communication.

    Textual data is a very popular means of communication over the World Wide Web in the form of data on online news websites, social networks, emails, governmental websites, etc.


    Text may contain the following complexities Lack of contextual and background information Ambiguity due to more than one possible interpretation of the meaning of text Focus and assertions on multiple topics

  12/01/17

    Text Mining

    MotivationWith so much textual data around us especially on the World Wide Web, there is a motivation to understand the meaning of the data

    DefinitionIt is the process by which textual data is analyzed in order to derive high quality information on the basis of patterns

  12/01/17

    Similar Words

    Can similar words be group together as one? Simple techniques

    Lemmatization (mapping plural to singulars, accurate but low coverage)

    Stemming (map word to a root word, inaccurate but high coverage)

    Complex technique A word is known by the company it keeps Word


  12/01/17

    Word Ambiguity

    Is Apple a company or a fruit? Apple tastes better than blackberry Apple phones are better than blackberry

    Context is important Tastes Fruit Phones Apple Inc.

  12/01/17

    Word Embedding

    Definition It is a technique in NLP that quantifies a concept

    (word or phrase) as a vector of real numbers.

    Simple application scenario How similar are two words? Similarity(vector(good), vector(best))

  12/01/17

    Related Research

    Word embeddings Word2Vec

    It is a predictive model which uses two layer neural networks FastText

    It is an extension to word2vec by Facebook GloVe

    It is a count based model which performs dimensionality reduction on the co-occurrence matrix

    Wikipedia based Relatedness Semantic Relatedness Framework

    It uses Wikipedia sub-category hierarchy to measure relatedness

  12/01/17

    Toy Example Word Embeddings Train co-occurence matrix Apply cosine similarity Find vectors Further concepts

    Dimestionality Reduction Window size Filter words

  12/01/17

    Word Analogies

    Man is to Woman, King is to ____ ? London is to England, Islamabad is to

    ____ ? Using vectors, we can say

    King Man + Woman Queen Islamabad London + England Pakistan

  12/01/17

    Why Wikipedia for Text Mining? One of the largest encyclopedia Free to use Collaboratively and actively updated

  12/01/17

    Wikipedia Each article has a title that identifies a concept.

    Each article contains content that defines a particular concept textually.

    Each article is mentioned inside different categories

    E.g., article Espresso is mentioned inside Coffee drinks, Italian cuisine, etc.

    Each Wikipedia category generally contains parent and children categories.

    E.g., Italian cuisine has parent categories Italian culture, Cuisine by nationality, etc

    E.g., Italian cuisine has children categories Italian desserts , Pizza, etc

  12/01/17






    C5 C6 C7



    Category Article

    Category Edge Article Belonging to Category


    Article Link

    Wikipedia Category Graph Structure along with Wikipedia Articles

    Wikipedia Graph Structure

  12/01/17

    Example of Wikipedia Category Structure










    Truncated Wikipedia Category Graph

  12/01/17

    Phrase Chunking using Wikipedia

    i prefer samsung s5 over htc, apple, nokia because it is economical and good.

    i prefer samsung s5 over htc apple nokia because it is economical and good

    Phrase chunking using phraseboundaries

    Longest phrase that matches withWikipedia Article Title or Redirect(which is not a stopword)

    samsung s5prefer htc apple

    nokia economical

    overi because it

    and goodis

    Removed stopwords Extracted phrases

    I prefer Samsung S5 over HTC, Apple, Nokia because it is economical and good.

    Conversion into lowercase

  12/01/17

    Word Embedding using Wikipedia We can find more complex relationships

    due to Article-Category Graph structure Multi-lingual relations Infobox, birth, age, etc

  12/01/17

    Wikipedia Documents



    Wikipedia ArticleTitle or Redirect Stream of


    Candidate Phrases

    Wikipedia Category-Article Structure

    Online ReputationManagement Tasks

    Perspective AwareSearch Engine


    Wikipedia Based Semantic Relatedness Framework

  12/01/17

    Perspective Aware Approach to Search

    Problem: The result set from a search engine (Google, Bing, Yahoo) for any user's query may have an inherent perspective given issues with the search engine or issues with the underlying collection.

    PAS is system that allows users to specify at query time a perspective together with their query.

    The system allows the users to quickly surmise the presence of the perspective in the returned set.

  12/01/17

    Perspective Aware Approach to Search

    Perspective is modelled by making use of Wikipedia articles-categories graph structure Perspective: activism Wikipedia fetches articles defining activism by

    looking into category graph structure

  12/01/17

    Perspective Aware Approach to Search

  12/01/17

    Keyword Extraction via Identification of Domain-Specific Keywords

    Title of Web Pages

    Wikipedia Articles& Redirects


    Community DetectionAlgorithm



    Domain-Specific Phrases

    Identifies readable phrases

    Domain-Specific Single Terms

    Merging both

    Domain-Specific Keywords

    By exploiting Wikipedia Article-Category Structure

    Problem: Given a collection of document titles from different school websites, we extract domain specific keywords for the entire website that represent the domain.

    Example: Information Retrieval, Science

  12/01/17

    Innovation in Automotive

    Red Probability 1.0Green Probability 0.5White Probability 0.0

    Size represents how much a category is mentioned inside the dataset`

  12/01/17

    Python Snippet for the Usage of the WikiMadeEasy API wiki_client = Wiki_client_service() print(wiki_client.process([`isTitle', `business', 0])) print(wiki_client.process([`isPerson', `albert einstein', 0])) print(wiki_client.process([`mentionInCategories', `data mining', 0])) print(wiki_client.process([`containsArticles', `business', 0])) print(wiki_client.process([`matchesCategories', `pakistan', 0])) print(wiki_client.process([`matchesArticles', `computer science', 0])) print(wiki_client.process([`getWikiOutlinks', `pagerank', 0])) print(wiki_client.process([`getWikiInlinks', `google', 0])) print(wiki_client.process([`getExtendedAbstract', `pakistan', 0])) print(wiki_client.process([`getSubCategory', `science', 0])) print(wiki_client.process([`getSuperCategory', `science', 0])) graph_dict = wiki_client.process([`getSubtoSuperCategoryGraph', [`information_science',

    `sociology'], 2])

  12/01/17


