Language Identification System For Code-Mixed Social Media Text Analysis
-
Upload
parth-desai -
Category
Social Media
-
view
59 -
download
3
Transcript of Language Identification System For Code-Mixed Social Media Text Analysis
![Page 1: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/1.jpg)
Hello!
![Page 2: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/2.jpg)
Language Identification system
for code-mixed Social Media Text
Analysis
![Page 3: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/3.jpg)
Multilingual speakers often switch between languages.Mixing multiple languages together (code mixing) is a popular trend
in social media users.This complicates automatic language identification as it is shifted
from document level to word level.
Challengesproblems
![Page 4: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/4.jpg)
Enhance social media analysis in language-dense areas.
Challengesmotivation
![Page 5: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/5.jpg)
Creation of new lexical and syntactic structures (e.g. code-mixing on morpheme level).Interflow of dissimilar grammar when combining languages.Classification of particular words/ phrases that have been assimilated into
one language from another. (e.g.. Bottle)Classification of particular words that exist in more than one language.
(e.g.. Hum)Defining canonical forms for word normalization.
ChallengesChallenges
Hi
![Page 6: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/6.jpg)
Class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction.
Feature function is a function that takes in input as the sentence, position of word and labels as defined, outputs a real-valued number (usually either 0 or 1).
Conversion of features to probabilities by assigning them weights and exponentiating, normalizing the summation of weighted features.
ChallengesConditional random fields (crf)
![Page 7: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/7.jpg)
@buttmona098 @HamzaIdrees Accha Topic Change karo
ChallengesDataset format
Univ Univ Hi HiEn En
@imp13196 ATM me Cash nhi haiUniv Acro Hi En Hi Hi
![Page 8: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/8.jpg)
CRF++ is a simple, customizable, and open source implementation of CRF for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.
ChallengesCrf++ tool
% crf_learn template_file train_file model_file
% crf_test -m model_file test_files
https://taku910.github.io/crfpp/#format
![Page 9: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/9.jpg)
ChallengesExample of template file
![Page 10: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/10.jpg)
ChallengesExample of input file
![Page 11: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/11.jpg)
Precision Recall F1-Score Support
En 0.77 0.81 0.79 2015
Ne 0.39 0.46 0.42 304
Hi 0.87 0.83 0.85 2817
Univ 0.98 0.98 0.98 1011
Mixed 0.00 0.00 0.00 3
Acro 0.81 0.55 0.66 78
Avg/Total 0.83 0.83 0.83 6228
initial approach
Precision-Recall Table:
Confusion Matrix :
![Page 12: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/12.jpg)
Affixes (Prefix, Suffix)Length of WordFirst Character in Uppercase (Binary Feature)All Characters in Uppercase (Binary Feature)If contains any symbol (Binary Feature)Stop-Words for EnglishPrevious Word / Next Word Etc.
ChallengesFeatures
Total Features : 10
![Page 13: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/13.jpg)
3232 17 127 4 0 1
68 141 28 0 0 1
138 18 1315 0 0 5
4 0 0 970 0 0
0 0 0 0 0 0
12 0 4 5 0 37
Precision Recall F1-Score Support
En 0.94 0.96 0.95 3381
Ne 0.80 0.59 0.68 238
Hi 0.89 0.89 0.89 1476
Univ 0.99 1.00 0.99 974
Mixed 0.00 0.00 0.00 0
Acro 0.84 0.64 0.73 58
Avg/Total 0.93 0.93 0.93 6127
Final approach
Precision-Recall Table:
Confusion Matrix :
![Page 14: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/14.jpg)
Affixes (Prefix, Suffix)Length of WordFirst Character in Uppercase (Binary Feature)All Characters in Uppercase (Binary Feature)If contains any symbol (Binary Feature)Stop-Words for EnglishPrevious Word / Next WordList of trained Hindi Words (Binary Feature)Numerical Data Feature Etc.
ChallengesFeatures (final)
Total Features : 17
![Page 15: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/15.jpg)
Probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences), it computes a probability distribution over possible sequences of labels and chooses the best label sequence.
ChallengesHidden markov model (hmm)
![Page 16: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/16.jpg)
Trigrams ‘n’ Tags, is a very efficient statistical POS tagger that is trainable on different languages and virtually any tagset. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words.
ChallengesTnt tool
http://www.coli.uni-saarland.de/~thorsten/tnt/
% ./tnt-para train_file
% ./tnt train_file test_file
![Page 17: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/17.jpg)
ChallengesExample of tnt
![Page 18: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/18.jpg)
ChallengesExample of tnt output file
![Page 19: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/19.jpg)
CRF hmm
Vs.
![Page 20: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/20.jpg)
CRF : Accuracy 93 %
HMM : Accuracy 67%
![Page 21: Language Identification System For Code-Mixed Social Media Text Analysis](https://reader031.fdocuments.in/reader031/viewer/2022022415/5a650d2c7f8b9ab5218b4a35/html5/thumbnails/21.jpg)
Challenges
Thank You!
Parth Desai
Shreshta Bhat
Harish B. Manish Shrivastava
Created By :
Under the supervision of :