Text mining meets neural nets

Dan SullivanOctober 21, 2015

Portland, OR

Text Mining Meets Neural Nets: Mining the Biomedical Literature

*Overview

* Introduction to Natural Language Processing and Text Mining

* Linguistic and Statistical Approaches

*Critiquing Classifier Results

* A New Dawn: Deep Learning

* What’s Next

*My Background

* Enterprise Architect, Big Data and Analytics

* Former Research Scientist, bioinformatics institute

* Completing PhD in Computational Biology with focus on text mining

*Author

*Contact*[email protected]*@dsapptech*Linkedin.com/in/dansullivanpdx

mailto:[email protected]

*Introduction to Natural Language

Processing and Text Mining

*“Text is unstructured”

*Unstructured?

Manual procedures are time consuming and costly

Volume of literature continues to grow

Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually

Some success with popular tools but limitations

Challenges in Text Analysis

*Dominant Eras in NLP

* Linguistic (from 1960s)* Focus on syntax* Transformational Grammar * Sentence parsing

*Statistical (from 1990s)* Focus on words, ngrams, etc.* Statistics and Probability* Related work in Information

Retrieval* Topic Modeling and Classification

* Deep Learning (from ~2006)* Focus on multi-layered neural net

computing non-linear functions* Light on theory, heavy on

engineering* Multiple NLP tasks

*Symbolic vs Sub-Symbolic

VS.

*Linguistic and Statistical

Approaches

http://www.slideshare.net/DanSullivan10/text-mining-meets-neural-nets



*Linguistic Approaches

*Linguistic Approaches -

SyntaxImage: http://www.nltk.org/book_1ed/ch08.html

*Linguistic Approaches - Semantics

Stephen H. Chen et al. Physiol. Genomics 2005;22:257-267

*Statistical Approaches

*Statistical Approach: Topic

Models

* Technique for identify dominant themes in document

* Does not require training

* Multiple Algorithms* Probabilistic Latent Semantic Indexing

(PLSI)* Latent Dirichlet allocation (LDA)

*Assumptions*Documents about a mixture of topics*Words used in document attributable to

topic

Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/

Debt, Law, Graduation

Debt, EU, Greece, Euro

Source: http://www.nytimes.com/pages/business/index.html April 27, 2015

EU, Greece, Negotiations, Varoufakis

http://www.nytimes.com/pages/business/index.html

http://www.nytimes.com/pages/business/index.html

*Topic Modeling Techniques

* Topics represented by words; documents about a set of topics*Doc 1: 50% politics, 50% presidential*Doc 2: 25% CPU, 30% memory, 45% I/O*Doc 3: 30% cholesterol, 40% arteries, 30% heart

* Learning Topics*Assign each word to a topic*For each word and topic, compute* Probability of topic given a document P(topic|doc)* Probability of word given a topic P(word|topic)* Reassign word to new topic with probability

P(topic|doc) * P(word|topic)* Reassignment based on probability that topic T

generated use of word W

TOPICS

Image Source: David Blei, “Probabilistic Topic Models” http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/

* 3 Key Components* Data* Representation scheme* Algorithms

* Data * Positive examples – Examples from

representative corpus* Negative examples – Randomly selected

from same publications

* Representation* TF-IDF* Vector space representation* Cosine of vectors measure of similarity

* Algorithms* Supervised learning

* SVMs* Ridge Classifier* Perceptrons* kNN* SGD Classifier* Naïve Bayes* Random Forest* AdaBoost *Training a Text Classifier

*Text Classification Process

Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/

*Term Frequency (TF) tf(t,d) = # of occurrences of t in dt is a termd is a document

* Inverse Document Frequency (IDF)idf(t,D) = log(N / |{d in D : t in d}|)D is set of documentsN is number of document

*TF-IDF = tf(t,d) * idf(t,D)

*TF-IDF is * large when high term frequency in document

and low term frequency in all documents*small when term appears in many documents

*Representation: TF-IDF

The 1 0 0 0 0 0 0Esp8 0 1 0 0 0 0 0gene 0 0 1 0 0 0 0is 0 0 0 1 0 0 0a 0 0 0 0 1 0 0known 0 0 0 0 0 1 0virulence 0 0 0 0 0 0 1

translocates reduced levels of Esp8 host cell

Sentence 1 0.193 0.2828 0.078 0.0001 0.389 0.0144 0.011

Sentence 2 0 0.0091 0.0621 0 0 0 0

Sentence 3 0 0 0 0 0.028 0.0113 0

Sentence 4 0.021 0 0 0 0 0 0

One Hot Representation

TF-IDF Representation

*Sparse Representations

* Bag of words model

* Ignores structure (syntax) and meaning (semantics) of sentences

* Representation vector length is the size of set of unique words in corpus

* Stemming used to remove morphological differences

* Each word is assigned an index in the representation vector, V

* The value V[i] is non-zero if word appears in sentence represented by vector

* The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus

*Representation: Vector Space

Support Vector Machine (SVM) is large margin classifier

Commonly used in text classification

Initial results based on life sciences sentence classifier

Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png

*Classification Algorithms

*Critiquing Classifier Results

Non-VF, Predicted VF: “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of

EspB into the host cell.”

“Data were log-transformed to correct for heterogeneity of the variances where necessary.”

“Subsequently, the kanamycin resistance cassette from pVK4 was cloned into thePstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.”

VF, Predicted Non-VF “Here, it is reported that the pO157-encoded Type V-secreted serine protease

EspP influences the intestinal colonization of calves. “

“Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “

“The DsbLI system also comprises a functional redox pair”

Virulence Factor (VF)-Misclassification

Examples

Adding additional examples is not likely to substantially improve results as seen by error curve

Preliminary Results-Training

Error

0 2000 4000 6000 8000 100000

0.050.1

0.150.2

0.250.3

0.350.4

0.450.5

All

Training ErrorValidation Error

8 Alternative AlgorithmsSelect 10,000 most important features using chi-square

Alternative Supervised Learning Algorithms

* Increase quantity of data (not always helpful; see error curves)

* Improve quality of data* Utilize multiple supervised algorithms,

ensemble and non-ensemble* Use unlabeled data and semi-supervised

techniques

* Feature Selection

* Parameter Tuning

* Feature Engineering

* Given:* High quality data in sufficient quantity* State of the art machine learning algorithms

* How to improve results: Change Representation?

*Improving Quality

*TF-IDF*Loss of syntactic and

semantic information

*No relation between term index and meaning

*No support for disambiguation

*Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties

*Representation Schemes

Ideal Representation◦ Capture semantic

similarity of words

◦ Does not require feature engineering

◦ Minimal pre-processing, e.g. no mapping to ontologies

◦ Improves precision and recall

*A New Dawn: Deep Learning

*Word Embeddings

*Dense vector representation (n = 50 … 300 or more)

*Capture semantics – similar words close by cosine measure

*Captures language features*Syntactic relations*Semantic relations

*Dense Word Representation

[0.160610 -0.547976 -0.444522 -0.037896 0.044305 0.245423 -0.261498 0.000294 -0.275621 -0.021201 -0.432955 0.388905 0.106494 0.405797 -0.159357 -0.073897 0.177182 0.043535 0.600987 0.064762 -0.348964 0.189289 0.650318 0.112554 0.374456 -0.227780 0.208623 0.065362 0.235401 -0.118003 0.032858 -0.309767 0.024085 -0.055148 0.158807 0.171749 -0.153825 0.090301 0.033275 0.089936 0.187864 -0.044472 0.421533 0.209217 -0.142092 0.153070 -0.168291 -0.052823 -0.090984 0.018695 -0.265503 -0.055572 -0.212252 -0.326411 -0.083590 -0.009575 -0.125065 0.376738 0.059734 -0.005585 -0.085654 0.111499 -0.099688 0.147020 -0.419087 -0.042069 -0.241274 0.154339 -0.008625 -0.298928 0.060612 0.216670 -0.080013 -0.218985 -0.805539 0.298797 0.089364 0.071044 0.390878 0.167600 -0.101478 -0.017312 -0.260500 0.392749 0.184021 -0.258466 -0.222133 0.357018 -0.244508 0.221385 -0.012634 -0.073752 -0.409362 0.113296 0.048397 0.000424 0.146018 -0.060891 -0.139045 -0.180432 0.014984 0.023384 -0.032300 -0.161608 -0.188434 0.018036 0.023236 0.060335 -0.173066 0.053327 0.523037 -0.330135 -0.014888 -0.124564 0.046332 -0.124301 0.029865 0.144504 0.163142 -0.018653 -0.140519 0.060562 0.098858 -0.128970 0.762193 -0.230067 -0.226374 0.100086 0.367147 0.160035 0.148644 -0.087583 0.248333 -0.033163 -0.312134 0.162414 0.047267 0.383573 -0.271765 -0.019852 -0.033213 0.340789 0.151498 -0.195642 -0.105429 -0.172337 0.115681 0.033890 -0.026444 -0.048083 -0.039565 -0.159685 -0.211830 0.191293 0.049531 -0.008248 0.119094 0.091608 -0.077601 -0.050206 0.147080 -0.217278 -0.039298 -0.303386 0.543094 -0.198962 -0.122825 -0.135449 0.190148 0.262060 0.146498 -0.236863 0.140620 0.128250 -0.157921 -0.119241 0.059280 -0.003679 0.091986 0.105117 0.117597 -0.187521 -0.388895 0.166485 0.149918 0.066284 0.210502 0.484910 0.396106 -0.118060 -0.076609 -0.326138 -0.305618 -0.297695 -0.078404 -0.210814 0.423335 -0.377239 -0.323599 0.282586]

immune_system

*Large volume of data*Billions of words in context*Multiple passes over data

*Algorithms*Word2Vec*CBOW*Skip-gram

*GloVe

*Linguistic terms with similar distributions have similar meaning* Learning Word Representation

T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf

*Skip-gram predicts surrounding wordsImage:

https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc

*CBOW predicts current wordImage:

https://drive.google.com/file/d/0B7XkCwpI5KDYRWRnd1RzWXQ2TWc

*Word Similarity - Malaria

*Word Similarity: Alanine (Amino Acid)

*Word Similarity: Leukocyte

*Word Similarity: Shigella

*Analogy I (correct)

Heart : Cardiovascular as Kidney:

*Analogy II (near miss)

Salmonella : Proteobacteria Staphylococcus

*Analogy III (miss)

Salmonella : Enterobacteriacea as Staphylococcus

Staphylococcaceae

*Quick Intro to Neural Networks

*Feed forward neural networkImage: http://u.cs.biu.ac.il/~yogo/nnlp.pdf

*Calculating with Neural Netshttps://en.wikibooks.org/wiki/Artificial_Neural_Networks/

Activation_Functions

*Key Characteristics

* Non-linear Activation Function*Sigmoid*Hyberbolic tangent (tanh)*Rectifier (ReLU)

* Word embeddings

* Window size

* Loss function*Binary*Multiclass*Cross-entropy

*Training a Neural Network – Stochastic

Gradient DescentImages: http://u.cs.biu.ac.il/~yogo/nnlp.pdf; http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/

http://u.cs.biu.ac.il/~yogo/nnlp.pdf


*Convolutional Neural Network for TextImage: https://aclweb.org/anthology/P/P14/P14-2105.xhtml

*Sentence Classification with Convolutional

Networks

*What’s Next?

*Survey n-dimensional Word Embedding Space

Image: http://greg.org/archive/2010/07/05/the_planck_all-sky_survey.html

*Formalize a Mathematical Model of

Semanticshttp://riotwire.com/column/immigrants-socialists-and-semantics-oh-my/

*Tools and References

*Word Embedding Tools

* Word2Vec – command line tool* Gensim – Python topic modeling tool

with word2vec module* GloVe (Global Vector for Word

Representation) – command line tool

*Deep Learning Tools

* Theano: Python CPU/GPU symbolic expression compiler

* Torch: Scientific framework for LuaJIT

* PyLearn2: Python deep learning platform

* Lasange: light weight framework on Theano

* Keras: Python library for working with Theano

* DeepDist: Deep Learning on Spark

* Deeplearning4J: Java and Scala, integrated with Hadoop and Spark

*References

*Deep Learning Bibliography - http://memkite.com/deep-learning-bibliography/

* Deep Learning Reading List –http://deeplearning.net/reading-list/

*Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014).

* Goldberg, Yav. “A Primer on Neural Network Models for Natural Language Processing” http://u.cs.biu.ac.il/~yogo/nnlp.pdf

http://memkite.com/deep-learning-bibliography/

http://memkite.com/deep-learning-bibliography/



*Q & A

Text mining meets neural nets

Data & Analytics

Transcript of Text mining meets neural nets