Profiling Personality of Social Media...
Transcript of Profiling Personality of Social Media...
ProfilingPersonalityofSocialMediaAuthors
WalterDaelemansCLiPS ComputationalLinguisticsGroup
[email protected],PilsenOctober11,2016
www.analyzewords.com (Pennebaker)Seealso:https://personality-insights-livedemo.mybluemix.net/ (Watson)
Figure6.Words,phrases,andtopicsmostdistinguishingextraversionfromintroversionandneuroticismfromemotionalstability.
SchwartzHA,EichstaedtJC,KernML,DziurzynskiL,RamonesSM,etal.(2013)Personality,Gender,andAgeintheLanguageof SocialMedia:TheOpen-VocabularyApproach.PLoSONE8(9):e73791.doi:10.1371/journal.pone.0073791http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0073791
Whatwedon'twanttodo
• Horoscopes– E.g.theprofilingofPennebaker /IBMpersonality
• Weneedgoldstandarddata
• Content-basedassignment– E.g.emotionallystableAmericanpeopletalkmoreabout“sports”,“vacation”,“beach”,“church”,or“team”
• It’snotstyleandcanbefakedeasily
Contents
• Profiling andComputationalStylometry– MethodsandIssues– SocialMediaLanguage
• PersonalityProfilinginSocialMedia– Applications– StateoftheArt
• PAN• CGIandTwistycorpora
LanguageVariationIndividual Variation
Bayes rule (1763)
WhatisProfiling?
Spelling,punctuation,lexical choice,sentence structure,themes,tone,discoursestructure,…
Sociolinguistics
Profiling
P 𝑖 𝑙 = % 𝑙 𝑖 %(')%())
SocialMedialanguage
• Everybody “writes”allthetimenow,notonlyprofessionalwriters
• Hugeincreaseinsubjectivelanguage(opinions)asopposedtoobjective,factuallanguage
• Opensharingofprivateinformation(metadata)– Magnificentsourceofdataforcomputationalstylometry!
ComputationalSociolinguistics!
Non-standardlanguageuseinchat
ComputationalStylometry
• Writingstyle:Acombination ofinvariantandunconsciousdecisionsinlanguageproductionatalllinguisticlevels(discourse,syntacticstructures,lexicalchoice,…)associated withspecificauthorsorauthortraits:– Age,gender,educationlevel,nativelanguage,personality,emotionalstate,mentalhealth,region,deception,ideology,politicalconviction,religiousbeliefs,sexualpreferences,…
Basicresearchquestions• Doesanidiolect/stylome existand(how)canit
bemeasured?• Canwehandlegenre,register&topic
interference?• Canwehandlewithin-profileinterference?• Isdetectionrobust?(adversarialstylometry)• Canwehandledynamicaspects?(language
changeovertimewithage,illness,languageinput,context,…)
• Canweexplainwhy/howtrainedmodelswork?
Method:TextCategorization
Document
SyntacticFunctionwordpatternsPartofspeechn-gramsSyntacticstructures
Lexicallemman-gramsreadabilityfeatureslexicalrichness
Semanticwordsensepatternssemanticdictionariesdistributedvectorswordembeddingssemanticrolepatterns
Discourserhetoricalstructuresdiscoursemarkerpatterns
FeatureconstructionDocumentRepresentation
Superficialtokenn-gramscharactern-gramspunctuationpatterns
Inputfromlinguistictheory
Method:TextCategorization
DocumentRepresentation
SyntacticFunctionwordpatternsPartofspeechn-gramsSyntacticstructures
Lexicallemman-gramsreadabilityfeatureslexicalrichness
Semanticwordsensepatternssemanticdictionariesdistributedvectorswordembeddingssemanticrolepatterns
Discourserhetoricalstructuresdiscoursemarkerpatterns
FeatureSelection
FeatureVector
Superficialtokenn-gramscharactern-gramspunctuationpatterns
Feedbacktolinguistictheory
Method:TextCategorization
• Objectiveevaluationonthebasisof“goldstandarddata”andevaluationmetrics
FeatureVectors classes
ModelNewDocuments Predictedclasses
Feedbacktolinguistictheory
SupervisedMachineLearning
ExplanationinStylometry
• Quantitativeevaluationsandlistsofimportantfeaturesdon’ttellusalotabouthowitworks
• Whycanage,gender,personality,authorshipetc.bedeterminedwithaparticularsetoflinguisticfeatures?
• Landmark:genderinfictionandnon-fiction– Koppel&Argamon etal.2002
ExplanationinStylometry
Relational Informative
Verbs Nouns
Subjects Modifiers
DeterminersPronouns Prepositions
Gender
Female Male
Personality• Setofmentaltraits(types)that
– Explainandpredictpatternsofthought,feelingandbehavior
– Remainrelativelystableovertimeandcontext• Sourceofvariationbetweenpeoplethat
– Influencessuccessinrelations,jobs,studies,…– Explains35%ofvarianceinlifesatisfaction
• Compare:income(4%),employment(4%),maritalstatus(1%to4%)
• Detectioninteractswiththatofdeception,opinion,emotion,…
ApplicationsofPersonalityDetection
• Targetedadvertising• Adaptiveinterfacesandrobotbehavior
– music,style,tone,color,…• Psychologicaldiagnosis,forensics
– psychopaths,massmurderers,bullies,victims– depression,suicidaltendencies
• Humanresourcesmanagement• ResearchinLiteraryscience,socialpsychology,sociolinguistics,…
Example:HumanResources• Situation:thousandsofapplicantsfor1job• Method
– Collectwrittenanswerstoquestionsandtranscriptsofaudiomessages
– Deeplearningofwordembeddings onbackgroundtext
– Deeplearningof“applicantembeddings”oncollectedtext
– Similarlyfor“anchors”(modelanswers,goodcandidates)
– Nearestneighborranking
BackgroundCorpus
Applicantanswersandtranscripts
ApplicantEmbeddingMap
Datacollection• Source
– Socialmedia• Twitter,Facebookstatuses
– Essays• Method
– Self-reporting• questionnaire
– Expertannotation(consensus)– Distantsupervision
• Patternsextractedbyexpertsfromquestionnairequestions(Neuman &Cohen)orfromcorrelationstudies(Celli)
• Classsystem(ClassificationofRegression)– MBTI(Myers-BriggsTypeIndicator)– FFM(FiveFactorModel,BigFive)
BigFive(FFM)• Extraversion(vs.Introversion)
– Sociable,active,energetic,positive• Neuroticism
– Sad,anxious,tense• Agreeableness
– Altruism,trust,modesty,sympathy• Conscientiousness
– Goal-oriented,organized,incontrol• Openness
– Originality,breadthanddepthofmentallifeandexperience
Meyers-Briggspreferences
• Introversion&Extraversion• iNtuition &Sensing• Feeling&Thinking• Judging&Perceiving
• Leadsto16types:ENTJ(1.8%)…ESFJ(12.3%)• Validityandreliabilityhavebeenquestioned
6 ESFP4 ISFP4 INFP4 ESTJ3 INTP (architect)1 ESTP (promoter)1 ENTP (inventor)0 ISTP (crafter)
Atypicalsampleofourstudents
28 ESFJ (provider)23 ENFJ (teacher)16 ISFJ (protector)15 INTJ (mastermind)15 INFJ9 ENFP8 ISTJ8 ENTJ
FeatureCorrelations• Extraverts (as opposed to introverts)
– Produce more language (verbosity)– Smaller vocabulary (TTR), fewer hapaxes– Fewer negative emotion words – More positive emotion words– Fewer hedges (confidence)– More agreement and compliments– Fewer negation and causation– Less concrete– Less formal, more contextualised / relational
• More pronouns, verbs, adverbs, interjections• More present tense• Fewer numbers, less quantification• Fewer negation and causation words
FeatureCorrelations• Neurotics
– More“I”– Morenegativeemotionwords,fewerpositiveemotionwords– Moreconcreteandfrequentwords
• Agreeable– Morepositiveemotionwords,fewernegativeemotionwords– Fewerdeterminers
• Conscientious– Fewernegations– Fewernegativeemotionwords– Fewerhedges(?)
• Openness– Longerwords– Morehedges– Fewer“I”andpresenttense
PAN2015
• Personalityaspartofprofiling(alsoageandgender)
• Twitter• 5numericFFMvalues
– Regression(oneclassifierforeachfactor)– Evaluation:RMSE
• English(194users),Spanish(140),Italian(50),Dutch(44)
• 22teamsOverviewofthe3rdAuthorProfilingTaskatPAN2015FRangel,PRosso,MPotthast,BStein,WDaelemans - CLEF,2015http://pan.webis.de/clef15/pan15-web/author-profiling.html
DocumentRepresentation• Preprocessing
– RemoveHTMLcodes,hashtags,URLs,mentions,…• Features
– charactern-grams– wordn-grams– tf-idf n-grams– syntacticn-grams(lemma,pos,relations,…)– stylisticfeatures
• punctuation,case• emoticons• word,sentencelength,verbosity• characterflooding
– topicmodeling(LSA)– familyvocabulary,LIWC,MRC– frequentterms,discriminativewords,Nes,othervocabularies...– secondorderfeatures(relationshipsamongterms,documents,profiles)
DocumentRepresentation• Preprocessing
– RemoveHTMLcodes,hashtags,URLs,mentions,…• Features
– charactern-grams– wordn-grams– tf-idf n-grams– syntacticn-grams(lemma,pos,relations,…)– stylisticfeatures
• punctuation,case• emoticons• word,sentencelength,verbosity• characterflooding
– topicmodeling(LSA)– familyvocabulary,LIWC,MRC– frequentterms,discriminativewords,Nes,othervocabularies...– secondorderfeatures(relationshipsamongterms,documents,profiles)
Álvarez-Carmonaetal.(2015)
MachineLearningMethod
• SVMs• DecisionTrees(RandomForests)• Bagging• LinearDiscriminantAnalysis• StochasticGradientDescent• Linear,Logistic,RidgeforRegression• Distance-basedapproaches
BestResults
O C E A NEnglish .12 .11 .14 .13 .20Spanish .11 .10 .13 .10 .16Italian .10 .11 .07 .05 .16Dutch .04 .06 .08 .00 .06
Problem
• Samefeaturesareinformativefordifferentauthoraspects(e.g.personality &gender)– Women~ extraverts
• positiveandnegativeemotionwords,pronounuse,…– Jointlearning,ensemblemethods,…
• Maincauseofoverfitting• Largereferencecorpusneededofauthorswithvarioustraitsofinterest– Samplestratification
• Example:CLiPS Stylometry Investigation(CSI)corpus
CLiPSCSICorpus(withBenVerhoeven)
• http://www.clips.uantwerpen.be/datasets/csi-corpus• Corpuswithtwogenres:essaysandreviews(truthfulanddeceptive)– Dutchnativespeakers,studentsoflanguageandliterature
• Meta-data– Age,gender,sexualorientation,regionoforigin,personalityprofile(BigFive&MBTI)
• Yearlyexpansion– Currently:1800documents;660authors;770,000words
• SuccessorofPersonaeCorpus(2008,withKimLuyckx)
MiningpersonalitydatafromTwitter(withVerhoeven andPlank)
• Peopletweetabouttheirownpersonality
TwiSty Corpusapproach• SearchuserswithatweetmentioningtheirMBTIpersonalitytype(keyword-based)+gender– manualchecking
• Fetchrecenttweetsfromthoseusers(~2000average)– applylanguageidentificationforselectingcleanlanguagesample
• ES,PT,FR,NL,IT,DE• Sensationalismbias?• TwiSty isavailablefromhttp://www.clips.uantwerpen.be/datasets
Somecorpusproperties
0 20 40 60 80 100
ESPT
FRNL
ITDE
IversusE
0 20 40 60 80 100
ESPT
FRNL
ITDE
T versusF
PredictionExperiment
• LinearSVC,wordandcharactern-grams
Cognitive-biologicalApproach(Yair Neuman)
• “Personality”categorizationoriginatesfrom– Threat/Trustmanagement– Riskassessmentprocess
• Fight,flight,avoid
– (+complexityofabstractionandinferenceuniquetohumans)
• inthecontextofInterpersonalrelations
Conclusions
• Computationalpersonality– Promisingfortrend-basedapplicationsbutnotgoodenoughyetforindividualprofiling
– Textcategorizationmodelisnotsufficient• Hardtoobtainsufficientreliableannotateddata• Unsupervisedmodels?Clevermining?• threat- trustmodel?
– Computationaltheoryofwritingstyleneedsalarge,preferablymultilingual,butinanycasebalancedcorpus
Stylometry researchers@CLiPS:
Questions?