Text mining and Visualizations

32
Patent Data Mining and Visualization Functionalities A foray into the worlds of &

Transcript of Text mining and Visualizations

  1. 1. Patent Data Mining and Visualization Functionalities A foray into the worlds of &
  2. 2. Overview Data Mining What is Text Mining? Text Mining Process Text Transformation Feature Selection - tf-idf Feature Selection -Term Document Matrix Feature Selection Term Term Matrix Word Clouds and Clustering Examples R and KNIME Live Example - R Shiny Visualizations SVG and D3 The Big Data, R and KNIME KNIME Versus R Conclusions Document Vectorization
  3. 3. Data Mining Data Mining = Building Models Model (Regression, Decision Trees, Neural Networks) = Set of rules connecting Collection of Inputs to particular target outcome Model can result in explaining outcomes of particular interest predicted by available facts Data Mining Tasks Classification Estimation Prediction Affinity grouping Clustering Directed Finding Particular Target Variable Undirected discover structure in Data without any target variable in mind
  4. 4. Why this Study? Apply Data Mining Techniques to understand fine structure of published Patent Documents. Features of Patent Documents Structured Component Patent Number, Filing Dates, Assignees, Regional Coverage Unstructured Components Title, Claims, Abstract, Descriptions Data Mining Visualizations Outcome Augment Manual interpretation of the results Address Visualization limitations Providing Collapsible lay-outs, Interactive Graphs etc
  5. 5. What Is Text Mining?The objective of Text Mining is to exploit information contained in textual documents in various ways, including discovery of patterns and trends in data, associations among entities, predictive rules, etc. (Grobelnik et al., 2001) Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known. (Hearst, 1999) References M. Hearst, Untangling Text Data Mining, in the Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, 1999. M. Grobelnik, D. Mladenic, and N. Milic-Frayling, Text Mining as Integration of Several Related Research Areas: Report on KDD2000 Workshop on Text Mining, 2000.
  6. 6. Text Mining Process Preprocessing Data Import Text preprocessing Text Transformation Stop word Removal Stemming Parts of Speech Tagging Ngrams Generation Synonym Generalization Feature Selection And Data Mining Term Document Matrix Term-Term Matrix Clustering or Classification
  7. 7. Text Transformation Gulf Applied Technologies Inc said it sold its subsidiaries engaged in Stop Word Removal (and", "for", "in", "is", "it", "not", "the", "to,its) "Gulf Applied Technologies Inc said sold its subsidiaries engaged Gulf Applied Technologies Inc said it sold its subsidiaries engaged in Stemming "Gulf Appli Technolog Inc said it sold it subsidiari engag in pipelin" Gulf Applied Technologies Inc said it sold its subsidiaries engaged in Parts of Speech Tagging "Gulf/NNP Applied/NNP Technologies/NNPS Inc/NNP said/VBD its/PRP sold/VBD NNP stands for proper noun, singular, or e.g., VBD stands for verb, past tense Gulf Appli Ngrams Gulf Appli Company Synonyms (wordnet) synonyms("company") "caller" "companionship" "company" "fellowship
  8. 8. Text Transformation Regular Expressions (regex) A regular expression (abbreviated regex or regexp) is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations standard feature Unix text processing utilities like grep. Now supported by almost all software A simple regexp ^[ t]+|[ t]+$ matches excess whitespace at the beginning and end of a line. An advanced regexp used to match any numeral is ^[+-]?(d+.?d*|.d+)([eE][+- ]?d+)?$ One More Example [c|C]ollimat* DAP.* [g|G]uid.*[f|F]ield [f|F]ield.*[g|G]uid [L|l]ight.*[b|B]eam [L|l]aser.*[b|B]eam [b|B]eam.*[L|l]ight [b|B]eam.*[L|l]aser
  9. 9. Feature Selection Term Frequency Inverse Document Frequency (tf-idf) tfidf is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others
  10. 10. Feature Selection Term-Document Matrix
  11. 11. Feature Extraction Term-Term Matrix
  12. 12. 12 Word Clouds and Hierarchical Clustering Using Term-Term Matrix
  13. 13. Clustering (Kmeans) and Contour Plots
  14. 14. A software package especially suitable for data analysis, data (text) Mining with rich visualization functionality Scripting interface Graphical User Interface development via shiny package Supports Modular Node based workflows Core functionality required for Data and Text mining are implemented via these nodes Extensibility of the functionality of nodes via R and Java code Snippets in the nodes R and KNIME R Example
  15. 15. Workflow in KNIME
  16. 16. Live Example - R Shiny Package Web Applications Using (Only) R No Need for HTML or Javascript Great for Communication and Visualization http://www.rstudio.com/shiny/showcase/ http://rstudio.github.io/shiny/tutorial/ Ui.r Put all UI related code hear Server.r Put all UI related code hear Socket R Shiny Example
  17. 17. SVG and D3
  18. 18. The Big Data, R and KNIME pbdr is an academic initiative requires special permission to access a cluster of computers called Tara All Revolution R Enterprise 7 editions are distributed with Open Source R (version 3.0.2), are 100% compatible with R scripts, functions and CRAN packages, and include phone and online technical support. ParAccel Hadoop Analytics
  19. 19. KNIME Versus R KNIME R Visual Programming Interface Intuitive but some amount familiarity is required Scripting interface Steep Learning curve Workflows could be tailor made Workflows could be tailor made R Shiny user Interface All Text mining & data analytic tools are available from a single user interface. Classification problems Supervised learning could be handled better here as all the required libraries are present at one place and one can view intermediate results at the node output ports Most of the libraries for Text Mining & data analytic are available but they require prior invocation before their usage The Desktop version of the KNIME is available for free but for server version requires special requirements Server as well as desktop version is available KNIME requires a reasonably modern PC running Linux, Windows (XP and later), or Max OSX. Multi core systems is a plus The memory limitations could be overcome using packages like: ff ffBase Graphics output could be sent SVG etc Graphics could be sent SVG etc. One could also send Graphics to DHTML using R Shiny R and Java code could be at nodes for creating proprietary analysis and visualizations Robust big data extensions are available for distributed frameworks such as Hadoop Programming with Big Data in R pbdR and distributed frameworks such as Hadoop
  20. 20. Conclusions Starting with reasons for doing this project, tools like R and KNIME were looked at for their suitability for Text data mining and automatic classification Due to the availability of several built-in Libraries R and KNIME are more amenable to Text Data mining. R and KNIME could be used in an Big Data Setting though this may be require additional hardware and use of proprietary software KNIME scores over R in terms of ease of use due to its node based visual programming interface This study is very exploratory in nature and no serious attempt is made solve problems related to automatic document classification. Some of the text mining libraries that were explored are: TM library in R for Generating the so called Term-Document Matrix and also for removing stop words and punctuation marks in text TM library is also used for N-gram Tokenization (Taking Two Words at a time) OpenNLP Library for Parts of speech tagging Snowball and Potter Stemmer for Stemming text Graphing capabilities of R and KNIME were explored for Visual depiction of Text in the form of Word Clouds
  21. 21. Thank You
  22. 22. Backup Slides
  23. 23. Text mining With R Regular Expressions Tag Meaning Examples ADJ adjective new, good, high, special, big, local ADV adverb really, already, still, early, now CNJ conjunction and, or, but, if, while, although DET determiner the, a, some, most, every, no EX existential there, there's FW foreign word dolce, ersatz, esprit, quo, maitre MOD modal verb will, can, would, may, must, should N noun year, home, costs, time, education NP proper noun Alison, Africa, April, Washington NUM number twenty-four, fourth, 1991, 14:24 PRO pronoun he, their, her, its, my, I, us P preposition on, of, at, with, by, into, under TO the word to to UH interjection ah, bang, ha, whee, hmpf, oops V verb is, has, get, do, make, see, run VD past tense said, took, told, made, asked VG present participle making, going, playing, working VN past participle given, taken, begun, sung WH wh determiner who, which, when, what, where, how Parts of Speech Tagging (POS)
  24. 24. Invocation of Shiny runApp takes the name of the Test directory in this example it is Test_Shiny01. This directory contains Test.csv as the data source and two R files called ui.R and server.R. The Ui.r invokes the user interface in this case it is an HTML page with tabs and sidebar panel (with user controls). The server.R file does all the event handling after user selection of Test.csv file. The present implementation works only with Test.csv file only
  25. 25. Choosing the data source Click on browse button and Choose the file Test.csv Click the Update now
  26. 26. Different Tab Views Histogram of Value Scores Value Score
  27. 27. IPC Word Cloud
  28. 28. Box Plots based on Value Score for Top Five Players Companies
  29. 29. Word Cloud Based on IPC Codes Bigram Cloud based (Bi-gram contains two words) Word Cloud R Patent Informatics Word Clouds and Cluster Dendograms Cluster Dendrogram Different technical aspects related Ultrasound that are associated with the Ultrasound Probe
  30. 30. Each individual patent is treated as a file- these files are generated using R Code. For this Text Mining example Title, Abstract and claims data is used 31 Workflows In KNIME Java Code Snippet R Code Snippet
  31. 31. Appendix III Word Cloud
  32. 32. Principal Components Analysis 33 Principal Component Analysis Appendix II Partition Clustering in R (Kmeans)