Text classi cation - LiU
Transcript of Text classi cation - LiU
![Page 1: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/1.jpg)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Text classification
Marco Kuhlmann Department of Computer and Information Science
732A92/TDDE16 Text Mining (2020)
![Page 2: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/2.jpg)
Text classification
• Text classification is the task of categorising text documents into predefined classes.
• The term ‘document’ is applied to everything from tweets over press releases to complete books.
![Page 3: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/3.jpg)
Topic classification
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
diamond baseball
forward soccer
team captain
UK China Elections Sports
Adapted from Manning et al. (2008)
![Page 4: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/4.jpg)
Forensic linguistics
‘I realized the faxed copy I just received was an outline of the manifesto, using much of the same wording, definitely the same topics and themes. … I invented [the language analysis] for this case and really, forensics linguistics took off after that.’
James Fitzgerald, profiler
Sources: Wikipedia, Newsweek
![Page 5: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/5.jpg)
Sentiment analysis
The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.
… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.
positive negative
![Page 6: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/6.jpg)
The standard text classification pipeline
predictor
document collection
matrix of document vectors
vector of class labels
vectorizer Example: CountVectorizer
Example: MultinomialNB
![Page 7: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/7.jpg)
The standard text classification pipeline
predictor
document collection
matrix of document vectors
vector of class labels
vectorizer CountVectorizer
MultinomialNB
![Page 8: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/8.jpg)
Reminder: Documents as tf–idf vectors
Scandal in Bohemia
Final problem
Empty house
Norwood builder
Dancing men
Retired colourman
Adair 0,0000 0,0000 0,0692 0,0000 0,0000 0,0000
Adler 0,0531 0,0000 0,0000 0,0000 0,0000 0,0000
Lestrade 0,0000 0,0000 0,0291 0,1424 0,0000 0,0000
Moriarty 0,0000 0,0845 0,0528 0,0034 0,0000 0,0000
![Page 9: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/9.jpg)
Documents as count vectors – the bag of words
the gorgeously elaborate continuation of the lord of the rings trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson’s expanded vision of j.r.r. tolkien’s middle-earth.
… is a sour little movie at its core an exploration of the emptiness that underlay the relentless gaiety of the 1920’s as if to stop would hasten the economic and global political turmoil that was to come
positive negative
![Page 10: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/10.jpg)
Documents as count vectors – the bag of words
a adequately cannot co-writer/director column continuation describe elaborate expanded gorgeously huge is j.r.r. jackson lord middle-earth of of of of peter rings so that the the the tolkien trilogy vision words
… 1920’s a an and as at come core economic emptiness exploration gaiety global hasten if is its little movie of of political relentless sour stop that that the the the the to to turmoil underlay was would
positive negative
![Page 11: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/11.jpg)
Documents as count vectors – the bag of words
Word Count
of 4
the 3
words 1
vision 1
trilogy 1
…
Word Count
the 4
to 2
that 2
of 2
would 1
…positive negative
![Page 12: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/12.jpg)
The standard text classification pipeline
predictor
document collection
matrix of document vectors
vector of class labels
vectorizer CountVectorizer
MultinomialNB
![Page 13: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/13.jpg)
Text classification as supervised machine learning
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
A B C
A
A
B
B
C
C
![Page 14: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/14.jpg)
Text classification as supervised machine learning
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
A B C
A
A
B
B
C
C
training set
test set
learning
evaluation
![Page 15: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/15.jpg)
Training and testing
• Training
When we train a classifier, we present it with a document 𝑥 and its gold-standard class 𝑦 and apply some learning algorithm. maximise likelihood/minimise cross-entropy of the training data
• Testing
When we evaluate a classifier, we present it with 𝑥 and compare the predicted class for this input with the gold-standard class 𝑦.
![Page 16: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/16.jpg)
Two challenges in text classification
• Standard document representations such as the bag of words easily yield tens of thousands of features. computational challenge, data sparsity
• Many document collections are highly imbalanced with respect to the represented classes. frequency bias, problems for evaluation
![Page 17: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/17.jpg)
This lecture
• Introduction to text classification
• Evaluation of text classifiers
• The Naive Bayes classifier
• The Logistic Regression classifier
• Support Vector Machines
![Page 18: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/18.jpg)
Evaluation of text classifiers
![Page 19: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/19.jpg)
Evaluation of text classifiers
• We require a test set consisting of a number of documents, each of which has been tagged with its correct class. typically part of a larger gold-standard data set
• To evaluate a classifier, we apply it to the test set and compare the predicted classes with the gold-standard classes.
• We do so in order to estimate how well the classifier will perform on new, previously unseen documents. assume that all samples are drawn from the same distribution
![Page 20: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/20.jpg)
Evaluation of text classifiers
Windsor The Queen
Mao Communist
TV-ads campaign
A B C
A B A
gold-standard class
predicted class
![Page 21: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/21.jpg)
Accuracy
The accuracy of a classifier is the proportion of documents for which the classifier predicts the gold-standard class:
number of correctly classified documentsnumber of all documents
accuracy =
![Page 22: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/22.jpg)
Accuracy
Document Gold-standard class Predicted class
Chinese Beijing Chinese China China
Chinese Chinese Shanghai China China
Chinese Macao China China
Tokyo Japan Chinese Japan China
accuracy for this example: 3/4 = 75%
![Page 23: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/23.jpg)
Accuracy and imbalanced data sets
50 %50 %
80 %
20 %
Is 80% accuracy good or bad?
![Page 24: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/24.jpg)
The role of baselines
• Evaluation measures are no absolute measures of performance. Whether ‘80% accuracy’ is good or not depends on the task at hand.
• Instead, we should ask for a classifier’s performance relative to other classifiers, or other points of comparison. ‘Logistic Regression has a higher accuracy than Naive Bayes.’
• When other classifiers are not available, a simple baseline is to always predict the most frequent class in the training data.
![Page 25: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/25.jpg)
Grading criteria
"TTFTTNFOU DSJUFSJB
8IFO HSBEJOH ZPVS SFQPSU * XJMM BTTFTT JU XJUI SFTQFDU UP UIF DSJUFSJB EFĕOFE CFMPX�'PS FBDI DSJUFSJPO * XJMM BTTJHO FJUIFS UIF TDPSF ƈ EPFT OPU NFFU FYQFDUBUJPOT PS BTDPSF CFUXFFO Ɖ NFFUT FYQFDUBUJPOT BOE Ƒ FYDFFET FYQFDUBUJPOT� 5P HFU B QBTTJOHHSBEF ZPVS TDPSF GPS FBDI DSJUFSJPO NVTU CF BU MFBTU Ɖ� :PVS HSBEF JT DBMDVMBUFE GSPNUXP DPNQPOFOU TDPSFT BT TQFDJĕFE JO UIF GPMMPXJOH UBCMF� 5PHFUIFS XJUI ZPVS HSBEF ZPV XJMM SFDFJWF B TIPSU XSJUUFO NPUJWBUJPO CBTFE PO UIF EFTDSJQUPST GPS FBDI DSJUFSJPO�
$PNQPOFOU � $PNQPOFOU � (SBEF ���"�� (SBEF 5%%&��
� � & �
� � % �
� � $ �
� � % �
� � $ �
� � # �
� � $ �
� � # �
� � " �
$PNQPOFOU �
ćJT DPNQPOFOU JT UIF TDPSF GPS B TJOHMF DSJUFSJPO�
4PVOEOFTT BOE DPSSFDUOFTT *T UIF UFDIOJDBM BQQSPBDI TPVOE BOE XFMM�DIPTFO "SFUIF DMBJNT NBEF JO UIF SFQPSU TVQQPSUFE CZ QSPQFS FYQFSJNFOUT BOE BSF UIF SFTVMUTPG UIFTF FYQFSJNFOUT DPSSFDUMZ JOUFSQSFUFE
� 5SPVCMFTPNF� ćFSF BSF TPNF JEFBT XPSUI TBMWBHJOH IFSF CVU UIF XPSL TIPVMESFBMMZ IBWF CFFO EPOF PS FWBMVBUFE EJČFSFOUMZ�
� 'BJSMZ SFBTPOBCMF XPSL� ćF BQQSPBDI JT OPU CBE UIF FWBMVBUJPO JT BQQSPQSJBUF BOE BU MFBTU UIF NBJO DMBJNT BSF QSPCBCMZ DPSSFDU�
� (FOFSBMMZ TPMJE XPSL BMUIPVHI UIFSF BSF TPNF BTQFDUT PG UIF BQQSPBDI PS UIFFWBMVBUJPO UIBU * BN OPU TVSF BCPVU�
� ćF BQQSPBDI JT GVMMZ BQU BOE BMM DMBJNT BSF DPOWJODJOHMZ TVQQPSUFE� ćF SFQPSUEJTDVTTFT UIF MJNJUBUJPOT PG UIF XPSL�
�
![Page 26: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/26.jpg)
The role of baselines
• Evaluation measures are no absolute measures of performance. Whether ‘80% accuracy’ is good or not depends on the task at hand.
• Instead, we should ask for a classifier’s performance relative to other classifiers, or other points of comparison. ‘Logistic Regression has a higher accuracy than Naive Bayes.’
• When other classifiers are not available, a simple baseline is to always predict the most frequent class in the training data.
![Page 27: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/27.jpg)
Confusion matrix
classifier ‘positive’
classifier ‘negative’
gold standard ‘positive’
true positives false negatives
gold standard ‘negative’
false positives true negatives
![Page 28: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/28.jpg)
Accuracy
classifier ‘positive’
classifier ‘negative’
gold standard ‘positive’
true positives false negatives
gold standard ‘negative’
false positives true negatives
![Page 29: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/29.jpg)
Precision and recall
• Precision and recall ‘zoom in’ on how good a system is at identifying documents of a specific class 𝑐.
• Precision is the proportion of correctly classified documents among all documents for which the system predicts class 𝑐. When the system predicts class 𝑐, how often is it correct?
• Recall is the proportion of correctly classified documents among all documents with gold-standard class 𝑐. When the document has class 𝑐, how often does the system predict it?
![Page 30: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/30.jpg)
Precision with respect to the positive class
classifier ‘positive’
classifier ‘negative’
gold standard ‘positive’
true positives false negatives
gold standard ‘negative’
false positives true negatives
![Page 31: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/31.jpg)
Recall with respect to the positive class
classifier ‘positive’
classifier ‘negative’
gold standard ‘positive’
true positives false negatives
gold standard ‘negative’
false positives true negatives
![Page 32: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/32.jpg)
Precision and recall with respect to the positive class
# true positives# true positives + # false positives
precision =
# true positives# true positives + # false negatives
recall =
![Page 33: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/33.jpg)
F1-measure
A good classifier should balance between precision and recall. The F1-measure is the harmonic mean of the two values:
2 · precision · recallprecision + recall
F1 =
![Page 34: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/34.jpg)
F1-measure
��� ��� ��� ��� ��� ���QSFDJTJPO
������������������
SFDBMM
���������������������������
![Page 35: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/35.jpg)
Accuracy with three classes
A B C
A 58 6 1
B 5 11 2
C 0 7 43
![Page 36: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/36.jpg)
Precision with respect to class B
A B C
A 58 6 1
B 5 11 2
C 0 7 43
![Page 37: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/37.jpg)
Recall with respect to class B
A B C
A 58 6 1
B 5 11 2
C 0 7 43
![Page 38: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/38.jpg)
Classification report in scikit-learn
precision recall f1-score support
C 0.63 0.04 0.07 671 KD 0.70 0.02 0.03 821 L 0.92 0.02 0.04 560 M 0.36 0.68 0.47 1644 MP 0.36 0.25 0.29 809 S 0.46 0.84 0.59 2773 SD 0.57 0.12 0.20 1060 V 0.59 0.15 0.24 950
accuracy 0.43 9288 macro avg 0.57 0.26 0.24 9288 weighted avg 0.52 0.43 0.34 9288
![Page 39: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/39.jpg)
This lecture
• Introduction to text classification
• Evaluation of text classifiers
• The Naive Bayes classifier
• The Logistic Regression classifier
• Support Vector machines
![Page 40: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/40.jpg)
The Naive Bayes classifier
![Page 41: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/41.jpg)
Naive Bayes
• The Naive Bayes classifier is a simple but surprisingly effective probabilistic text classifier that builds on Bayes’ rule.
• It is called ‘naive’ because it makes strong (unrealistic) independence assumptions about probabilities.
![Page 42: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/42.jpg)
Naive Bayes decision rule, informally
elaborate gorgeously
elaborate gorgeously
elaborate gorgeously
negativepositive
?
𝑃(pos) 𝑃(elaborate |pos) 𝑃(gorgeously |pos)
score(pos) =
𝑃(neg) 𝑃(elaborate |neg) 𝑃(gorgeously |neg)
score(neg) =
70% 30%
![Page 43: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/43.jpg)
elaborate gorgeously
?
Naive Bayes decision rule, informally
elaborate gorgeously
elaborate gorgeously
negative
positive
𝑃(pos) 𝑃(elaborate |pos) 𝑃(gorgeously |pos)
score(pos) =
𝑃(neg) 𝑃(elaborate |neg) 𝑃(gorgeously |neg)
score(neg) =
70% 30%
![Page 44: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/44.jpg)
The role of Bayes’ rules
• For classification, we would like to know 𝑃(class |document). 𝑃(disease |symptom)
• But a Naive Bayes classifier contains 𝑃(document |class). 𝑃(symptom |disease)
• The classifier uses Bayes’ rule to convert between the two. 𝑃(class |document) ∝ 𝑃(class) 𝑃(document |class)
![Page 45: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/45.jpg)
Formal definition of a Naive Bayes classifier
𝐶 a set of possible classes
𝑉 a set of possible words; the model’s vocabulary
𝑃(𝑐) probabilities that specify how likely it is for a document to belong to class 𝑐 (one probability for each class)
𝑃(𝑤 |𝑐) probabilities that specify how likely it is for a document to contain the word 𝑤, given that the document belongs to class 𝑐 (one probability for each class–word pair)
![Page 46: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/46.jpg)
Naive Bayes decision rule
choose that class 𝑐 which maximises the term to the right of the ‘arg max’
predicted class for the document
count of the word 𝑤 in the document
Оਅ � BSHNBYਅó৫ ৸ਅ ď Ƕਘó৾৸ਘ ] ਅ�ਘ
![Page 47: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/47.jpg)
What does the Naive Bayes classifier do if all words in the document to be classified
are out-0f-vocabulary?
![Page 48: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/48.jpg)
Log probabilities
• In order to avoid underflow, implementations use the logarithms of probabilities instead of the probabilities themselves. 𝑃(𝑤 |𝑐) becomes log 𝑃(𝑤 |𝑐)
• Note that in this case, instead of multiplying probabilities, we have to add their logarithms.
![Page 49: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/49.jpg)
Log probabilities
��� ��� ��� ��� ��� ���÷�÷�÷�÷��MPH
![Page 50: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/50.jpg)
Learning a Naive Bayes classifier
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
A B C
A
A
B
B
C
C
![Page 51: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/51.jpg)
Learning a Naive Bayes classifier
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
A B C
A
A
B
B
C
C
training set
test set
learning
evaluation
![Page 52: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/52.jpg)
Learning a Naive Bayes classifier
congestion London
Parliament Big Ben
Olympics Beijing
tourism Great Wall
recount votes
seat run-off
A B C
A B C
𝑃(𝑐) 𝑃(𝑤 |𝑐)class probabilities word probabilities
![Page 53: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/53.jpg)
Maximum Likelihood Estimation (MLE)
• One standard technique for estimating the probabilities of a model is Maximum Likelihood Estimation (MLE). Find probabilities that maximise the probability of the training data.
• Depending on the type of model under consideration, MLE can be computationally challenging.
• For the special case of the Naive Bayes classifier, MLE amounts to equating probabilities with relative frequencies.
![Page 54: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/54.jpg)
MLE for the Naive Bayes classifier
• To estimate the class probabilities 𝑃(𝑐):
Compute the percentage of documents with class 𝑐 among all documents in the training set.
• To estimate the word probabilities 𝑃(𝑤 |𝑐):
Compute the percentage of occurrences of the word 𝑤 among all word occurrences in documents with class 𝑐.
![Page 55: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/55.jpg)
MLE for the Naive Bayes classifier
#(𝑐) number of documents with gold-standard class 𝑐
#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐
৸ਅ � �ਅöਙó৫ �ਙ ৸ਘ ] ਅ � �ਘ ਅöਙó৾ �ਙ ਅ
![Page 56: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/56.jpg)
MLE of word probabilities
31 tokens 37 tokens
The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.
… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.
positive negative
![Page 57: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/57.jpg)
MLE of word probabilities
Word Count
of 4
The 2
words 1
vision 1
trilogy 1
… …
Word Count
the 4
to 2
that 2
of 2
would 1
… …positive negative
![Page 58: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/58.jpg)
MLE of word probabilities
Probability Estimated value
𝑃(of |pos) 4/31
𝑃(The |pos) 2/31
𝑃(words |pos) 1/31
𝑃(vision |pos) 1/31
𝑃(trilogy |pos) 1/31
… …
Probability Estimated value
𝑃(the |neg) 4/37
𝑃(to |neg) 2/37
𝑃(that |neg) 2/37
𝑃(of |neg) 2/37
𝑃(would |neg) 1/37
… …positive negative
![Page 59: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/59.jpg)
Smoothing
• If we equate word probabilities with relative frequencies, some probabilities may be zero. For example, ‘gorgeously’ may only occur in positive documents.
• This is a problem because we multiply probabilities in the decision rule for Naive Bayes. Slogan: Zero probabilities destroy information.
• To avoid zero probabilities, we can use techniques for smoothing the probability distribution.
![Page 60: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/60.jpg)
Additive smoothing
• In additive smoothing, we add a constant value 𝛼 ≥ 0 to all counts before computing relative frequencies. For 𝛼 = 1, this technique is also known as Laplace smoothing.
• Effectively, we hallucinate 𝛼 extra occurrences of each word, including words that never occurred.
• The smoothing constant 𝛼 is a hyperparameter of the model that needs to be tuned on validation data.
![Page 61: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/61.jpg)
Hyperparameters and cross-validation
• A hyperparameter is a parameter of a machine learning model that is not set by the learning algorithm itself.
• To tune hyperparameters, we need a separate validation set, or need to use cross-validation:
• Split the data into 𝑘 parts.
• In each fold, use 𝑘 − 1 parts for training, 1 part for validation.
• Take the mean performance over the 𝑘 folds.
![Page 62: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/62.jpg)
Five-fold cross validation
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
80% training, 20% validation
![Page 63: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/63.jpg)
Additive smoothing
• In additive smoothing, we add a constant value 𝛼 ≥ 0 to all counts before computing relative frequencies. For 𝛼 = 1, this technique is also known as Laplace smoothing.
• Effectively, we hallucinate 𝛼 extra occurrences of each word, including words that never occurred.
• The smoothing constant 𝛼 is a hyperparameter of the model that needs to be tuned on validation data.
![Page 64: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/64.jpg)
৸ਘ ] ਅ � �ਘ ਅ � öਙó৾<�ਙ ਅ � ><latexit sha1_base64="JOxlUkKj5gvzRD2EaG79HryUar8=">AAAFVHicjVRdb9MwFPU+CqPAPuCRF4tq0mBp1XTTuj5UmjaYQKKs0H1JTVQ5jttac5wodraVkF/Er+EFCfghvPCAnZSNpZ2YpUTX5x4fX5+b2AkYFbJa/T4zOzdfuHd/4UHx4aPHi0vLK0+OhR+FmBxhn/nhqYMEYZSTI0klI6dBSJDnMHLinO3p/Mk5CQX1+aEcBcT20IDTPsVIKqi3/Lq9dmEZny0Dv4BNaPVDhGOrtHah5usWYsEQJbElIq8XX0KLcnicwK7KX2r+OswYdtJbLlUr1XTAycAcByUwHu3eytwvy/Vx5BEuMUNCdM1qIO0YhZJiRpKiFQkSIHyGBqSrQo48Iuw4PW8CVxXiwr4fqodLmKL/roiRJ8TIcxTTQ3Io8jkNTst1I9nftmPKg0gSjrON+hGD0ofaPOjSkGDJRipAOKSqVoiHSHkmlcU3dtGFiYDgmyfRO5aFHDHSfNs5MByfuddTO444xb5Lyml9RUsQ6SHKtVS3uArhrqLv6wM3YfwSdohHtUBiFCHs0E9knyAZhUTotIIgjDWqZuXtyqYBDwJVLmJjbDtdds1RlLK5kWeZZo5mbpTNRqUxwWvkeYpUzrPqtUSRUuY7Orgq9pCcauh95DnKRV19y+e+UJ4RV0kwt6MdUsvsuEW5+mphO/T/2iOHmT3/s2DLgGPrMo2W7sQh5SMDpvI6oaCO6mogs3fuSFuZjZMqeyiQ01RucXhS4CMZRAyFd9C4sn9SpBM5w7so6MbUa9MUXlER5Nj1Wvk25tS9dHuv29TKPmNGpHVOcNPSv2Q/vSoaemxdXQyTwXGtYm5UNj9slnZ2x5fGAngGnoM1YII62AFvQBscAQy+gK/gB/g5/23+d2GuUMioszPjNU/BjVFY/APPmKKb</latexit>
#(𝑐) number of documents with gold-standard class 𝑐
#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐
MLE with additive smoothing
no smoothing here!
৸ਅ � �ਅöਙó৫ �ਙ
![Page 65: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/65.jpg)
#(𝑐) number of documents with gold-standard class 𝑐
#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐
MLE with additive smoothing
৸ਅ � �ਅöਙó৫ �ਙ ৸ਘ ] ਅ � �ਘ ਅ � ʺöਙó৾ �ਙ ਅˁ � ৾<latexit sha1_base64="CjRgHhyU22p8bV3xgZls7ozdoVg=">AAAFYnicjVRbb9MwFPYuhVFuHXuEB4tq0mBJ1XTTuj5UmjY0gURZobtJTVQ5jttac5wodraV0B/GT+GJFx7gF/CKnZSNpZ2YpUTH53znO8ffSeyGjApZrX6bm19YLNy7v/Sg+PDR4ydPS8vPjkUQR5gc4YAF0amLBGGUkyNJJSOnYUSQ7zJy4p7t6fjJOYkEDfihHIXE8dGA0z7FSCpXr9Rpr13YxhfbwK9gE9r9COHELq9dqP26jVg4ROPEdumAdaEtYr+XXEKbcng8hgp1qbN0NHLgOszgKtQrlauVarrgtGFNjDKYrHZveeG37QU49gmXmCEhulY1lE6CIkkxI+OiHQsSInyGBqSrTI58IpwkPf0YriqPB/tBpB4uYer9NyNBvhAj31VIH8mhyMe0c1asG8v+tpNQHsaScJwV6scMygBqKaFHI4IlGykD4YiqXiEeIqWgVILfqKIbEyHBN0+iK5pCjhhpvuscGG7AvOutk8Sc4sAjZtpf0RZE+ohyTdUtrkK4q+D7+sBNmLyGHeJTTTA2ihB26GeyT5CMIyJ0WLkgTLRX7cztyqYBD0LVLmIT33aado1RENPayKMsKwezNkyrUWlM4Rp5nAKZeVS9NlagFPmeDq6aPSSn2vUh9l2lou6+FfBAKM2IpyiY19EKqTQnaVGuvmHYjoK/8shhJs//JNgy4ES6jKOlJ3FI+ciAKb0OKFdHTTWU2Tt3pK1MxmmWPRTKWSy3KDxN8IkMYoaiO3BcyT9N0ond4V0Y9GDqtVkMb6gIc+h6zbwNObOWHu/1mFrZZ8yItM8Jbtr6l+ynV0VDr62ri2HaOK5VrI3K5sfN8s7u5NJYAs/BS7AGLFAHO+AtaIMjgMFX8B38BL8WfxSKheXCSgadn5vkrIAbq/DiDySBptg=</latexit>
![Page 66: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/66.jpg)
Estimating word probabilities
31 tokens 37 tokens
The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.
… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.
positive negative
![Page 67: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/67.jpg)
Vocabulary
1920’s J.R.R. Jackson’s Lord Middle-earth Peter Rings The Tolkien’s a adequately an and as at cannot co-writer/director column come continuation core describe economic elaborate emptiness expanded exploration gaiety global gorgeously hasten huge if is its little movie of political relentless so sour stop that the to trilogy turmoil underlay vision was words would
53 unique words
![Page 68: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/68.jpg)
MLE with add-one smoothing
Word Modified count
of 4 + 1
The 2 + 1
words 1 + 1
vision 1 + 1
trilogy 1 + 1
… …
Word Modified count
of 2 + 1
The 0 + 1
words 0 + 1
vision 0 + 1
trilogy 0 + 1
… …positive negative
![Page 69: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/69.jpg)
MLE with add-one smoothing
Probability Estimated value
𝑃(of |pos) (4 + 1)/(31 + 53)
𝑃(The |pos) (2 + 1)/(31 + 53)
𝑃(words |pos) (1 + 1)/(31 + 53)
𝑃(vision |pos) (1 + 1)/(31 + 53)
𝑃(trilogy |pos) (1 + 1)/(31 + 53)
… …
Probability Estimated value
𝑃(of |neg) (2 + 1)/(37 + 53)
𝑃(The |neg) (0 + 1)/(37 + 53)
𝑃(words |neg) (0 + 1)/(37 + 53)
𝑃(vision |neg) (0 + 1)/(37 + 53)
𝑃(trilogy |neg) (0 + 1)/(37 + 53)
… …positive negative
![Page 70: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/70.jpg)
Additive smoothing is assuming a uniform prior
Let 𝑁𝑐 be the total number of word occurrences in class 𝑐. Then
can be written as a linear interpolation of the maximum-likelihood estimate and the uniform distribution over word types:
where
৸ਘ ] ਅ � �ਘ ਅ � ৶ਅ � ৾<latexit sha1_base64="R9dEYwk+CSVSLJizl8pecD8Edls=">AAAFQXicjVTLbtNAFJ22AUp4pbBkMyKqVKgTxWnVNItIVYsqkEgbSF9SHEXj8U0y6vghz7hVMPkPvoYNC/gFPoEdYsOCDTN2aKmTio5k6865Z87cOdceO+BMyErl29z8Qu7W7TuLd/P37j94+Kiw9PhI+FFI4ZD63A9PbCKAMw8OJZMcToIQiGtzOLZPd3T++AxCwXzvQI4C6Lpk4LE+o0QqqFeotlbOLeODZdDnuIGtfkhobBVXztV81SI8GJJxvNejeBWnM3w07hWKlXIlGXg6MCdBEU1Gq7e08MtyfBq54EnKiRAdsxLIbkxCySiHcd6KBASEnpIBdFToERdEN04ON8bLCnFw3w/V40mcoP+uiIkrxMi1FdMlciiyOQ3OynUi2d/sxswLIgkeTTfqRxxLH2unsMNCoJKPVEBoyFStmA6JMkgqP6/sogsTAdCrJ9E7loQccWi8bu8bts+dy2k3jjxGfQdKSX15S4B0CfO0VCe/jPG2ou/qAzdw/AK3wWVaYGzkMW6z97ALREYhCJ1WEMaxRtWstFleN/B+oMolfIJtJssuOYpSMteyLNPM0My1klkv16d49SxPkUpZVq06VqSE+YYNLoo9gBMN7UWurVzU1Td9zxfKM3CUBHfa2iG1rBs3mac+UdwK/b/2yGFqz/8s2DDwxLpUo6k7ccC8kYETeZ1QUFt1NZDpO3OkjdTGaZUdEshZKtc4PC3wDgYRJ+ENNC7snxZpR/bwJgq6MbXqLIWXTAQZdq1auo45cy/d3ss2NdPPmIO0zoA2LP1L9pOroq7HxsXFMB0cVcvmWnn97Xpxa3tyaSyip+gZWkEmqqEt9Aq10CGi6CP6hL6gr7nPue+5H7mfKXV+brLmCboycr//AIhxnmg=</latexit>
৸ਘ ] ਅ � �ਘ ਅ৶ਅ � � ÷ �৾<latexit sha1_base64="p3lV1p5lnnNdlt5LoL4fmxbrShk=">AAAFUXicjVRdb9MwFPU+CqMDtsEjLxbVpA6aqummdX2oNG1oAoluhe5LaqrKcW5ba86HYmdTCfk//BpeeAB+CeINOw0bazsxS61uzj33+PrcxHbAmZCVys+5+YXF3IOHS4/yy4+fPF1ZXXt2KvwopHBCfe6H5zYRwJkHJ5JJDudBCMS1OZzZF/s6f3YJoWC+dyxHAXRdMvBYn1EiFdRb3WsVr6zSZ6tEN3ADW1xVOgRb/ZDQ2CoUrxSexIc9muDXuGgaGWEjY5hJfJr0VguVciVdeDows6CAstXqrS38shyfRi54knIiRMesBLIbk1AyyiHJW5GAgNALMoCOCj3igujG6WETvK4QB/f9UP08iVP034qYuEKMXFsxXSKHYjKnwVm5TiT7O92YeUEkwaPjjfoRx9LH2jnssBCo5CMVEBoy1SumQ6JskMrfW7voxkQA9PZJ9I6GkCMOjXfto5Ltc+fmsRtHHqO+A0baX94SIF3CPC3Vya9jvKfoB/rADRy/wm1wmRZISnmM2+wTHACRUQhCpxWEcaxR9WTslLdK+ChQ7RKeYTtp2Q1HUQxzc5JlmhM0c9Mw6+X6FK8+yVMkY5JVqyaKlDLfs8F1s8dwrqHDyLWVi7r7pu/5QnkGjpLgTls7pMq6cZN56pXFrdD/a48cju35nwXbJZxZN9Zo6kkcM29Uwqm8TiioraYayPH/xJG2xzZOq+yTQM5SucPhaYGPMIg4Ce+hcW3/tEg7sof3UdCDqVVnKbxhIphg16rGXcyZe+nx3oypOX6NOUjrEmjD0p9kP70q6nptX18M08FptWxulrc+bBV297JLYwm9QC9REZmohnbRW9RCJ4iiL+gr+o5+LH5b/J1DufkxdX4uq3mObq3c8h9gqqF+</latexit>
� ৶ਅ৶ਅ � ৾<latexit sha1_base64="Ee9pFSkvjwL8laKaOLbVZ0SITto=">AAAFM3icjVRdb9MwFPW2AqN8rINHXizGJARN1XTTuj5UmjY0gUS3QfclNVXlOLetNceJYmdSifoH+DW88AD/BPGGeOWZV+ykbCztxCwlujn3+Pj6XMduyJlU1eq3ufmFwq3bdxbvFu/df/BwqbT86FgGcUThiAY8iE5dIoEzAUeKKQ6nYQTEdzmcuGc7Jn9yDpFkgThUoxC6PhkI1meUKA31Ss8crskewU3s9CNCk70eHZsXfokdwsMhwcfjXmmlWqmmA08H9iRYQZNx0Fte+O14AY19EIpyImXHroaqm5BIMcphXHRiCSGhZ2QAHR0K4oPsJul2xnhVIx7uB5F+hMIp+u+MhPhSjnxXM32ihjKfM+CsXCdW/c1uwkQYKxA0W6gfc6wCbLzBHouAKj7SAaER07ViOiTaFaUdvLKKKUyGQK/uxKxoSTXi0HzT3i+7AfcuP7tJLBgNPLDS+oqOBOUTJoxUp7iK8bam75oNN3HyArfBZ0ZgXC5i3GYfYBeIiiOQJq0hjBOD6i9rs7JexvuhLpfwCbaZTrvkaIplr+VZtp2j2WuW3ag0pniNPE+TrDyrXhtrUsp8ywYXxR7CqYH2Yt/VLprqW4EIpPYMPC3BvbZxSE/rJi0m9KHEB1Hw1x41zOz5nwUbZTyxLtNomU4cMjEq41TeJDTU1l0NVfbObWkjs3FaZYeEapbKNQ5PC7yHQcxJdAONC/unRdqxO7yJgmlMvTZL4RWTYY5dr1nXMWeuZdp72aZWdow5KOccaNMxv2Q/vSoaZmxcXAzTwXGtYq9V1t+tr2xtTy6NRfQEPUXPkY3qaAu9RgfoCFH0EX1CX9DXwufC98KPws+MOj83mfMYXRmFX38A1YiaPw==</latexit>
![Page 71: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/71.jpg)
This lecture
• Introduction to text classification
• Evaluation of text classifiers
• The Naive Bayes classifier
• The Logistic Regression classifier
• Support Vector machines
![Page 72: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/72.jpg)
The Logistic Regression classifier
![Page 73: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/73.jpg)
The multi-class linear model
input vector (1-by-F)
weight matrix (F-by-K)
output vector (1-by-K)
bias vector (1-by-K)
F = number of features (input variables), K = number of classes (output variables)
О � ੍ਲ � <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
![Page 74: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/74.jpg)
The logistic model
• The logistic model extends the linear model by adding a pointwise logistic function 𝑓:
• The term logistic regression refers to the procedure of learning the parameters of the logistic model. That model is then used for classification. Confusing terminology!
logitО � ਈ XIFSF � ੍ਲ � <latexit sha1_base64="kqlvgRIJZ5DumuuM3LN7IGHG36c=">AAAFcHicjVRbb9MwFPYuhVFuG7wg8YBhmjS2pmq6aV0fKk0bmkCibNDdpKaaHOe0teZciJ2xLsrv4zfwI3jhgb1iJ2WjaSdmKcnJd77z+fhzYjvgTMhK5cfU9Mxs4d79uQfFh48eP3k6v/DsSPhRSOGQ+twPT2wigDMPDiWTHE6CEIhrczi2z3Z0/vgcQsF870AOAui4pOexLqNEKuh0nljnQGOrT2Q8SBLcwN3lFLlM3mLra0Sc9IYtCRcy/taHEJIRPOOqujS6SLLncYJXs8hOTucXK+VKOvB4YA6DRTQc+6cLM1eW49PIBU9SToRom5VAdmISSkY5JEUrEhAQekZ60FahR1wQnTj1IsFLCnFw1w/V5Umcov9WxMQVYuDaiukS2Rf5nAYn5dqR7G52YuYFkQSPZhN1I46lj7Wx2GEhUMkHKiA0ZKpXTPskJFQq+0dm0Y2JAOjoSvSMhpADDo0Prb2S7XPn5rUTRx6jvgNG2l/REiBdwjwt1S4uYbyt6Lt6wQ0cr+AWuEwLJKUixi12CbtAZBSC0GkFYRxrVL0Zm+X1Et4LVLuED7HNtOyGoyiGuZZnmWaOZq4ZZr1cH+PV8zxFMvKsWjVRpJT5kfWumz2AEw19ilxbuai7b/qeL5Rn4CgJ7rS0Q6qsEzeZp75ovB/6f+2R/cye/1mwUcJD6zKNpt6JA+YNSjiV1wkFtdSuBjK755a0kdk4rrJDAjlJ5RaHxwW+QC/iJLyDxrX94yKtyO7fRUFvTK06SeEdE0GOXasatzEnzqW392abmtlnzEHqY6Jh6V+ymx4VdT02rg+G8eCoWjbXyuuf1xe3toeHxhx6id6gZWSiGtpC79E+OkQUfUc/0W90Nfur8KLwqvA6o05PDWueo5FRWPkDwvWwxw==</latexit>
![Page 75: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/75.jpg)
The standard logistic function for binary classification
÷��� ÷��� ��� ��� �����������������������
Оਚ � �� � FYQ÷ਛ<latexit sha1_base64="549xeEhtfoSeI9qr017T01X+qbc=">AAAFNXicjVTLbtNAFJ22AUp4tIUlmxFVUYE4itMqj0WkqkUVSIQW0pcUR9V4fJOMOn7IM66aWv4DvoYNC/gRFuwQW5ZsmbFDS51UdCRb1+eeOXPn3PHYAWdCVirfZmbnCrdu35m/W7x3/8HDhcWlRwfCj0IK+9TnfnhkEwGcebAvmeRwFIRAXJvDoX2ypfOHpxAK5nt7chRAzyUDj/UZJVJBx4vPrCGR8SjBLWz1Q0JjM4lN/BJbEs5kDGdBsmqcP0+OF5cr5Uo68GRgjoNlNB67x0tzvy3Hp5ELnqScCNE1K4HsxSSUjHJIilYkICD0hAygq0KPuCB6cbqhBK8oxMF9P1SPJ3GK/jsjJq4QI9dWTJfIocjnNDgt141kv9GLmRdEEjyaLdSPOJY+1u5gh4VAJR+pgNCQqVoxHRLli1QeXllFFyYCoFd3olc0hBxxaL3p7JRsnzuXn7048hj1HTDS+oqWAOkS5mmpbnEF401F39YbbuH4Be6Ay7RAUipi3GHnsA1ERiEInVYQxrFG1ZfRKK+X8E6gyiV8jDXSaZccRTHMtTzLNHM0c80wm+XmBK+Z5ymSkWfVq4kipcy3bHBR7B4caehd5NrKRV192/d8oTwDR0lwp6MdUtN6cZt56lji3dD/a48cZvb8z4JaCY+tyzTauhN7zBuVcCqvEwrqqK4GMnvntlTLbJxU2SKBnKZyjcOTAh9gEHES3kDjwv5JkU5kD2+ioBtTr05TeMVEkGPXq8Z1zKlr6fZetqmdHWMO0joF2rL0L9lPr4qmHrWLi2EyOKiWzbXy+vv15Y3N8aUxj56gp2gVmaiONtBrtIv2EUUf0Sf0BX0tfC58L/wo/MyoszPjOY/RlVH49QdSlJto</latexit>
![Page 76: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/76.jpg)
The softmax function for k-ary classification
• The softmax function converts a vector of activations into a vector of probabilities (a finite probability distribution).
• The softmax function is the natural generalisation of the standard logistic function to 𝑘-ary classification problems.
TPęNBYʐʞ<ਊ> � FYQʐ<ਊ>ʞö FYQʐ<>ʞ<latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit>
![Page 77: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/77.jpg)
Training a logistic regression classifier
• We present the model with training samples of the form (𝒙, 𝑦) where 𝒙 is a feature vector and 𝑦 is the gold-standard class.
• The output of the model is a vector of conditional probabilities 𝑃(𝑘 | 𝒙) where 𝑘 ranges over the possible classes.
• We want to train the model so as to maximise the likelihood of the training data under this probability distribution. Same training principle as for Naive Bayes – but no closed-form solution.
![Page 78: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/78.jpg)
Gradient descent
• Step 0: Start with an arbitrary value for the parameters 𝜽.
• Step 1: Compute the gradient of the loss function, ∇𝐿(𝜽).
• Step 2: Update the parameters 𝜽 as follows: 𝜽 ≔ 𝜽 − 𝜂 ∇𝐿(𝜽) The hyperparameter 𝜂 is the learning rate.
• Repeat step 1–2 until the error is sufficiently low.
‘Follow the gradient into the valley of error.’
![Page 79: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/79.jpg)
���� ���� ���� ���� ����Оਚ���
MPTT
���� ���� ���� ���� ����Оਚ���
MPTT
Cross-entropy loss function
𝑦 = 1 𝑦 = 0
৴ଔ � ÷ʐਚ MPH Оਚ � � ÷ ਚ MPH� ÷ Оਚʞ<latexit sha1_base64="CE9DLhcmq8FduK1f26l6kxpQ2OU=">AAAFWHicjVRbT9swFDaXDuhuwB73Yg0htVtTNQVR+lAJwYSGRAdbuUlNVTnOaWvhXBQ7SFnU37Rfs4ftYfsZe52dZDDaomEp0cl3Pn8+/o5jO+BMyFrt+9z8wmLhydLySvHps+cvXq6urV8IPwopnFOf++GVTQRw5sG5ZJLDVRACcW0Ol/b1gc5f3kAomO+dyTiAnkuGHhswSqSC+qtHxyXrBmhiyRFIMi7jFjawZbMhL+EYW9wfYmtEZBKP8TtcMo24nIEqzPFySg/L/dWNWrWWDjwdmHmwgfJx2l9b+G05Po1c8CTlRIiuWQtkLyGhZJTDuGhFAgJCr8kQuir0iAuil6R7HuNNhTh44Ifq8SRO0X9nJMQVInZtxXSJHInJnAZn5bqRHOz2EuYFkQSPZgsNIo6lj7WB2GEhUMljFRAaMlUrpiMSEiqVzfdW0YWJAOj9negVDSFjDq2jzknF9rlz99lLIo9R3wEjra9oCZAuYZ6W6hY3Md5X9EO94RZO3uIOuEwLjCtFjDvsCxwCkVEIQqcVhHGiUd3U3ep2BZ8EqlzCc2w3nXbHURTD3JpkmeYEzdwyzGa1OcVrTvIUyZhkNepjRUqZx2x4W+wZXGnoY+TaykVdfdv3fKE8A0dJcKejHVLTekmbeerk4tPQ/2uPHGX2/M+CnQrOrcs02roTZ8yLKziV1wkFdVRXA5m9J7a0k9k4rXJAAjlL5QGHpwU+wzDiJHyExq390yKdyB49RkE3plGfpfCeiWCC3agbDzFnrqXbe9emdnaMOUh9zbQs/UsO0quiqcfO7cUwHVzUq+ZWdfvT9sbefn5pLKPX6A0qIRM10B76gE7ROaLoK/qGfqJfiz8KqLBUWMmo83P5nFfo3iis/wEAo6Kx</latexit>
![Page 80: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/80.jpg)
÷��� ÷��� ��� ��� ���ਛ���
MPTTԯMPHJTUJD
÷��� ��� ��� ���ਛ���
MPTTԯMPHJTUJD
Cross-entropy loss for logistic units
𝑦 = 1 𝑦 = 0
ਆ৴ଔਆਛ � Оਚ ÷ ਚ<latexit sha1_base64="Wa9zpBIQpRiesmO2FcpJkvxZTHc=">AAAFOHicjVRNb9MwGPa2AqN8bIMjF4tp0oaaqummdT1UmjY0gbSyQfclNVXlOG9ba86HYmdSFuU38Gu4cIDfwY0b4sqFK3ZSNpZ2YpYSvXnex49fP69jO+BMyFrt28zsXOne/QfzD8uPHj95urC49OxE+FFI4Zj63A/PbCKAMw+OJZMczoIQiGtzOLXPd3X+9AJCwXzvSMYB9Fwy9NiAUSIV1F9cswYhoYmzv2pdAE0sOQJJ0rU0cS5T3MLWiMgkTrGB4/7icq1aywaeDMxxsIzG47C/NPfbcnwaueBJyokQXbMWyF5CQskoh7RsRQICQs/JELoq9IgLopdke0rxikIcPPBD9XgSZ+i/MxLiChG7tmK6RI5EMafBabluJAdbvYR5QSTBo/lCg4hj6WNtEHZYCFTyWAWEhkzViumIKJOksvHGKrowEQC9uRO9oiFkzKH1tnNQsX3uXH/2kshj1HfAyOorWwKkS5inpbrlFYx3FH1Pb7iFk1e4Ay7TAmmljHGHXcIeEBmFIHRaQRgnGlVfxlZ1o4IPAlUu4WNsK5t2zVEUw1wvskyzQDPXDbNZbU7wmkWeIhlFVqOeKlLG3GfDq2KP4ExD7yLXVi7q6tu+5wvlGThKgjsd7ZCa1kvazFMnEx+G/l975Ci3538WbFbw2Lpco607ccS8uIIzeZ1QUEd1NZD5u7ClzdzGSZVdEshpKrc4PCnwAYYRJ+EdNK7snxTpRPboLgq6MY36NIXXTAQFdqNu3MacupZu73Wb2vkx5iD1/dGy9C85yK6Kph6bVxfDZHBSr5rr1Y33G8vbO+NLYx69QC/RKjJRA22jN+gQHSOKPqJP6Av6Wvpc+l76UfqZU2dnxnOeoxuj9OsPlOedJA==</latexit>
![Page 81: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/81.jpg)
Cross-entropy loss function for the softmax function
For a logistic model with parameters 𝜽 = (𝑾, 𝒃), the cross-entropy loss for a sample (𝒙, 𝑦) is defined as
where we write [𝑘 = 𝑦] for the function that returns 1 if 𝑘 = 𝑦, and 0 otherwise (Iverson bracket).
৴ଔ � ÷ǵ < � ਚ> MPH TPęNBY੍ਲ � <><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
![Page 82: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/82.jpg)
Regularisation
• The term regularisation refers to all changes to a model that are intended to decrease generalisation error but not training error.
• A standard for logistic regression is L2-regularisation, where we penalise parameter vectors based on their lengths:
regularisation strength
ଔ Ն� ଔ ÷ ʐò৴ଔ � ૣјଔј�ʞ<latexit sha1_base64="RueUXWVanBvf8bR2DRDwJIHEzvs=">AAAFe3icjVRdT9swFE2Bbqz7gu1xL9YqJGBN1RRG6aRKCCa0SXSwlY9KTVU5zm1r4XzIdpC60D+5t/2S7WXS7KSDkhYNS0ntc889vj43tRMyKmSl8jO3sLiUf/R4+Unh6bPnL16urL46F0HECZyRgAW87WABjPpwJqlk0A45YM9hcOFcHuj4xRVwQQP/VI5C6Hp44NM+JVgqqLfC7CsgsS2HIPEY2R6WQw4s/tBQi+mIiWz1i2yHDtg6sn3sMIyO1qc5G+gdsvkwQPb1NGxf96pJHt/orRQr5Uoy0OzEmkyKxmSc9FYXf9luQCIPfEkYFqJjVULZjTGXlDAYF+xIQIjJJR5AR0197IHoxoktY7SmEBf1A64eX6IEnc6IsSfEyHMUU59bZGManBfrRLK/242pH0YSfJJu1I8YkgHSHiOXciCSjdQEE05VrYgMMcdEqk7c2UUXJkIgd0+idzSFHDFofG4dl5yAubfLbhz5lAQumEl9BVuA9DD1tVSnsIbQvqIf6gM3ULyJWuBRLTAuFRBq0e9wCFhGHIQOKwihWKNqZe6Wt0voOFTlYjbBdpO0W46imNZWlmVZGZq1ZVr1cn2GV8/yFMnMsmrVsSIlzCM6uCn2FNoa+hJ5jnJRV98M/EAoz8BVEsxtaYdUWjduUl993OiEB//skcPUnv9ZsFNCE+tSjabuxCn1RyWUyOuAglqqq6FM35kj7aQ2zqoc4FDOU7nH4VmBbzCIGOYP0Lixf1akFTnDhyjoxtSq8xQ+UhFm2LWqeR9z7l66vbdtaqafMQOpr42Grf+S/eSqqOuxc3MxzE7Oq2Vrq7z9dbu4tz+5NJaNN8ZbY92wjJqxZ3wyTowzgxg/jN+5XG5h6U++mN/Ml1LqQm6S89q4M/Lv/wKs0bHO</latexit>
![Page 83: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/83.jpg)
Logistic Regression and Naive Bayes
• Learning a Naive Bayes classifier involves fitting a probability model that optimizes the joint likelihood 𝑃(class, document). 𝑃(class | document) is obtained via Bayes’ rule.
• Logistic regression fits the same probability model to directly optimize the conditional 𝑃(class | document). lower aymptotic error
• Naive Bayes and logistic regression form a generative–discriminative pair.
![Page 84: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/84.jpg)
Support Vector Machines
![Page 85: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/85.jpg)
Linear model
input vector (1-by-F)
weight vector (F-by-1)
predicted output
F = number of features (independent variables, covariates)
Оਚ � ੍ੌ � <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
bias
![Page 86: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/86.jpg)
Linearly separable problems
linearly separable not linearly separable
![Page 87: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/87.jpg)
Linearly separable problems
• A classification problem is called linearly separable, if there exists a linear model that can learn the problem with zero error.
• Any line in feature space that separates the classes from each other is called a decision boundary.
For simplicity, we restrict ourselves to binary classification problems here.
![Page 88: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/88.jpg)
Different decision boundaries
margin
![Page 89: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/89.jpg)
Different decision boundaries
We would like to choose a decision boundary that maximises the margin to the closest training samples.
![Page 90: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/90.jpg)
Computing the margin
• Mathematically, the decision boundary is a plane that is defined by the parameters (weights and bias) of the linear classifier. More specifically, this plane is the set of all points 𝒙 such that 𝒙𝒘 + 𝑏 = 0.
• The margin between a data point 𝒙𝑖 and the decision boundary is the distance between 𝒙𝑖 and the plane along the direction of 𝒘.
• Subject to some convenient normalisations, for the closest data points this yields a margin of 1/∥𝒘∥.
![Page 91: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/91.jpg)
Support vectors
support vector
support vector
support vector
![Page 92: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/92.jpg)
Maximising the margin
• If the training data is linearly separable, then the optimal margin can be found using quadratic programming. more recent approach: sub-gradient descent, coordinate descent
• To generalise this to training sets which are not linearly separable, one can use gradient-based optimisation.
![Page 94: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/94.jpg)
Practical concerns
• More than two classes can be handled according to a one-versus-one or one-versus-all scheme.
• For general SVM, training time scales quadratically with the number of samples, which quickly becomes impractical.
• An alternative is to use a special-purpose implementation of a linear SVM, which can have significantly shorter training times.
![Page 95: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/95.jpg)
This lecture
• Introduction to text classification
• Evaluation of text classifiers
• The Naive Bayes classifier
• The Logistic Regression classifier
• Support Vector Machines
![Page 96: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/96.jpg)
Project idea
• There are many other text classification methods than Naive Bayes, Logistic regression, and Support Vector Machines. decision trees, many different types of neural networks, …
• Pick a method that you find interesting and compare its performance to that of other methods on a standard data set.
• In the report, make it clear why you chose this method, and that you have understood how it works.
![Page 97: Text classi cation - LiU](https://reader031.fdocuments.in/reader031/viewer/2022022504/62170934e99a2c69d5194594/html5/thumbnails/97.jpg)
Beyond the bag-of-words assumption
• The linear sequence of the words in a text carries meaning. The brown dog on the mat saw the striped cat through the window.
The brown cat saw the striped dog through the window on the mat.
• We can try to capture at least parts of this sequence by representing texts as 𝒏-grams, contiguous sequences of 𝑛 words. unigram (not), bigram (not bad)
• Recurrent neural networks have been designed to explicitly capture long-range dependencies between words.