Profiling Personality of Social Media...

Post on 21-Apr-2020

4 views 0 download

Transcript of Profiling Personality of Social Media...

ProfilingPersonalityofSocialMediaAuthors

WalterDaelemansCLiPS ComputationalLinguisticsGroup

walter.daelemans@uantwerpen.beSLSP2016,PilsenOctober11,2016

www.analyzewords.com (Pennebaker)Seealso:https://personality-insights-livedemo.mybluemix.net/ (Watson)

Figure6.Words,phrases,andtopicsmostdistinguishingextraversionfromintroversionandneuroticismfromemotionalstability.

SchwartzHA,EichstaedtJC,KernML,DziurzynskiL,RamonesSM,etal.(2013)Personality,Gender,andAgeintheLanguageof SocialMedia:TheOpen-VocabularyApproach.PLoSONE8(9):e73791.doi:10.1371/journal.pone.0073791http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0073791

Whatwedon'twanttodo

• Horoscopes– E.g.theprofilingofPennebaker /IBMpersonality

• Weneedgoldstandarddata

• Content-basedassignment– E.g.emotionallystableAmericanpeopletalkmoreabout“sports”,“vacation”,“beach”,“church”,or“team”

• It’snotstyleandcanbefakedeasily

Contents

• Profiling andComputationalStylometry– MethodsandIssues– SocialMediaLanguage

• PersonalityProfilinginSocialMedia– Applications– StateoftheArt

• PAN• CGIandTwistycorpora

LanguageVariationIndividual Variation

Bayes rule (1763)

WhatisProfiling?

Spelling,punctuation,lexical choice,sentence structure,themes,tone,discoursestructure,…

Sociolinguistics

Profiling

P 𝑖 𝑙 = % 𝑙 𝑖 %(')%())

SocialMedialanguage

• Everybody “writes”allthetimenow,notonlyprofessionalwriters

• Hugeincreaseinsubjectivelanguage(opinions)asopposedtoobjective,factuallanguage

• Opensharingofprivateinformation(metadata)– Magnificentsourceofdataforcomputationalstylometry!

ComputationalSociolinguistics!

Non-standardlanguageuseinchat

ComputationalStylometry

• Writingstyle:Acombination ofinvariantandunconsciousdecisionsinlanguageproductionatalllinguisticlevels(discourse,syntacticstructures,lexicalchoice,…)associated withspecificauthorsorauthortraits:– Age,gender,educationlevel,nativelanguage,personality,emotionalstate,mentalhealth,region,deception,ideology,politicalconviction,religiousbeliefs,sexualpreferences,…

Basicresearchquestions• Doesanidiolect/stylome existand(how)canit

bemeasured?• Canwehandlegenre,register&topic

interference?• Canwehandlewithin-profileinterference?• Isdetectionrobust?(adversarialstylometry)• Canwehandledynamicaspects?(language

changeovertimewithage,illness,languageinput,context,…)

• Canweexplainwhy/howtrainedmodelswork?

Method:TextCategorization

Document

SyntacticFunctionwordpatternsPartofspeechn-gramsSyntacticstructures

Lexicallemman-gramsreadabilityfeatureslexicalrichness

Semanticwordsensepatternssemanticdictionariesdistributedvectorswordembeddingssemanticrolepatterns

Discourserhetoricalstructuresdiscoursemarkerpatterns

FeatureconstructionDocumentRepresentation

Superficialtokenn-gramscharactern-gramspunctuationpatterns

Inputfromlinguistictheory

Method:TextCategorization

DocumentRepresentation

SyntacticFunctionwordpatternsPartofspeechn-gramsSyntacticstructures

Lexicallemman-gramsreadabilityfeatureslexicalrichness

Semanticwordsensepatternssemanticdictionariesdistributedvectorswordembeddingssemanticrolepatterns

Discourserhetoricalstructuresdiscoursemarkerpatterns

FeatureSelection

FeatureVector

Superficialtokenn-gramscharactern-gramspunctuationpatterns

Feedbacktolinguistictheory

Method:TextCategorization

• Objectiveevaluationonthebasisof“goldstandarddata”andevaluationmetrics

FeatureVectors classes

ModelNewDocuments Predictedclasses

Feedbacktolinguistictheory

SupervisedMachineLearning

ExplanationinStylometry

• Quantitativeevaluationsandlistsofimportantfeaturesdon’ttellusalotabouthowitworks

• Whycanage,gender,personality,authorshipetc.bedeterminedwithaparticularsetoflinguisticfeatures?

• Landmark:genderinfictionandnon-fiction– Koppel&Argamon etal.2002

ExplanationinStylometry

Relational Informative

Verbs Nouns

Subjects Modifiers

DeterminersPronouns Prepositions

Gender

Female Male

Personality• Setofmentaltraits(types)that

– Explainandpredictpatternsofthought,feelingandbehavior

– Remainrelativelystableovertimeandcontext• Sourceofvariationbetweenpeoplethat

– Influencessuccessinrelations,jobs,studies,…– Explains35%ofvarianceinlifesatisfaction

• Compare:income(4%),employment(4%),maritalstatus(1%to4%)

• Detectioninteractswiththatofdeception,opinion,emotion,…

ApplicationsofPersonalityDetection

• Targetedadvertising• Adaptiveinterfacesandrobotbehavior

– music,style,tone,color,…• Psychologicaldiagnosis,forensics

– psychopaths,massmurderers,bullies,victims– depression,suicidaltendencies

• Humanresourcesmanagement• ResearchinLiteraryscience,socialpsychology,sociolinguistics,…

Example:HumanResources• Situation:thousandsofapplicantsfor1job• Method

– Collectwrittenanswerstoquestionsandtranscriptsofaudiomessages

– Deeplearningofwordembeddings onbackgroundtext

– Deeplearningof“applicantembeddings”oncollectedtext

– Similarlyfor“anchors”(modelanswers,goodcandidates)

– Nearestneighborranking

BackgroundCorpus

Applicantanswersandtranscripts

ApplicantEmbeddingMap

Datacollection• Source

– Socialmedia• Twitter,Facebookstatuses

– Essays• Method

– Self-reporting• questionnaire

– Expertannotation(consensus)– Distantsupervision

• Patternsextractedbyexpertsfromquestionnairequestions(Neuman &Cohen)orfromcorrelationstudies(Celli)

• Classsystem(ClassificationofRegression)– MBTI(Myers-BriggsTypeIndicator)– FFM(FiveFactorModel,BigFive)

BigFive(FFM)• Extraversion(vs.Introversion)

– Sociable,active,energetic,positive• Neuroticism

– Sad,anxious,tense• Agreeableness

– Altruism,trust,modesty,sympathy• Conscientiousness

– Goal-oriented,organized,incontrol• Openness

– Originality,breadthanddepthofmentallifeandexperience

Meyers-Briggspreferences

• Introversion&Extraversion• iNtuition &Sensing• Feeling&Thinking• Judging&Perceiving

• Leadsto16types:ENTJ(1.8%)…ESFJ(12.3%)• Validityandreliabilityhavebeenquestioned

6 ESFP4 ISFP4 INFP4 ESTJ3 INTP (architect)1 ESTP (promoter)1 ENTP (inventor)0 ISTP (crafter)

Atypicalsampleofourstudents

28 ESFJ (provider)23 ENFJ (teacher)16 ISFJ (protector)15 INTJ (mastermind)15 INFJ9 ENFP8 ISTJ8 ENTJ

FeatureCorrelations• Extraverts (as opposed to introverts)

– Produce more language (verbosity)– Smaller vocabulary (TTR), fewer hapaxes– Fewer negative emotion words – More positive emotion words– Fewer hedges (confidence)– More agreement and compliments– Fewer negation and causation– Less concrete– Less formal, more contextualised / relational

• More pronouns, verbs, adverbs, interjections• More present tense• Fewer numbers, less quantification• Fewer negation and causation words

FeatureCorrelations• Neurotics

– More“I”– Morenegativeemotionwords,fewerpositiveemotionwords– Moreconcreteandfrequentwords

• Agreeable– Morepositiveemotionwords,fewernegativeemotionwords– Fewerdeterminers

• Conscientious– Fewernegations– Fewernegativeemotionwords– Fewerhedges(?)

• Openness– Longerwords– Morehedges– Fewer“I”andpresenttense

PAN2015

• Personalityaspartofprofiling(alsoageandgender)

• Twitter• 5numericFFMvalues

– Regression(oneclassifierforeachfactor)– Evaluation:RMSE

• English(194users),Spanish(140),Italian(50),Dutch(44)

• 22teamsOverviewofthe3rdAuthorProfilingTaskatPAN2015FRangel,PRosso,MPotthast,BStein,WDaelemans - CLEF,2015http://pan.webis.de/clef15/pan15-web/author-profiling.html

DocumentRepresentation• Preprocessing

– RemoveHTMLcodes,hashtags,URLs,mentions,…• Features

– charactern-grams– wordn-grams– tf-idf n-grams– syntacticn-grams(lemma,pos,relations,…)– stylisticfeatures

• punctuation,case• emoticons• word,sentencelength,verbosity• characterflooding

– topicmodeling(LSA)– familyvocabulary,LIWC,MRC– frequentterms,discriminativewords,Nes,othervocabularies...– secondorderfeatures(relationshipsamongterms,documents,profiles)

DocumentRepresentation• Preprocessing

– RemoveHTMLcodes,hashtags,URLs,mentions,…• Features

– charactern-grams– wordn-grams– tf-idf n-grams– syntacticn-grams(lemma,pos,relations,…)– stylisticfeatures

• punctuation,case• emoticons• word,sentencelength,verbosity• characterflooding

– topicmodeling(LSA)– familyvocabulary,LIWC,MRC– frequentterms,discriminativewords,Nes,othervocabularies...– secondorderfeatures(relationshipsamongterms,documents,profiles)

Álvarez-Carmonaetal.(2015)

MachineLearningMethod

• SVMs• DecisionTrees(RandomForests)• Bagging• LinearDiscriminantAnalysis• StochasticGradientDescent• Linear,Logistic,RidgeforRegression• Distance-basedapproaches

BestResults

O C E A NEnglish .12 .11 .14 .13 .20Spanish .11 .10 .13 .10 .16Italian .10 .11 .07 .05 .16Dutch .04 .06 .08 .00 .06

Problem

• Samefeaturesareinformativefordifferentauthoraspects(e.g.personality &gender)– Women~ extraverts

• positiveandnegativeemotionwords,pronounuse,…– Jointlearning,ensemblemethods,…

• Maincauseofoverfitting• Largereferencecorpusneededofauthorswithvarioustraitsofinterest– Samplestratification

• Example:CLiPS Stylometry Investigation(CSI)corpus

CLiPSCSICorpus(withBenVerhoeven)

• http://www.clips.uantwerpen.be/datasets/csi-corpus• Corpuswithtwogenres:essaysandreviews(truthfulanddeceptive)– Dutchnativespeakers,studentsoflanguageandliterature

• Meta-data– Age,gender,sexualorientation,regionoforigin,personalityprofile(BigFive&MBTI)

• Yearlyexpansion– Currently:1800documents;660authors;770,000words

• SuccessorofPersonaeCorpus(2008,withKimLuyckx)

MiningpersonalitydatafromTwitter(withVerhoeven andPlank)

• Peopletweetabouttheirownpersonality

TwiSty Corpusapproach• SearchuserswithatweetmentioningtheirMBTIpersonalitytype(keyword-based)+gender– manualchecking

• Fetchrecenttweetsfromthoseusers(~2000average)– applylanguageidentificationforselectingcleanlanguagesample

• ES,PT,FR,NL,IT,DE• Sensationalismbias?• TwiSty isavailablefromhttp://www.clips.uantwerpen.be/datasets

Somecorpusproperties

0 20 40 60 80 100

ESPT

FRNL

ITDE

IversusE

0 20 40 60 80 100

ESPT

FRNL

ITDE

T versusF

PredictionExperiment

• LinearSVC,wordandcharactern-grams

Cognitive-biologicalApproach(Yair Neuman)

• “Personality”categorizationoriginatesfrom– Threat/Trustmanagement– Riskassessmentprocess

• Fight,flight,avoid

– (+complexityofabstractionandinferenceuniquetohumans)

• inthecontextofInterpersonalrelations

Conclusions

• Computationalpersonality– Promisingfortrend-basedapplicationsbutnotgoodenoughyetforindividualprofiling

– Textcategorizationmodelisnotsufficient• Hardtoobtainsufficientreliableannotateddata• Unsupervisedmodels?Clevermining?• threat- trustmodel?

– Computationaltheoryofwritingstyleneedsalarge,preferablymultilingual,butinanycasebalancedcorpus

Stylometry researchers@CLiPS:

Questions?