Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute,...

18
Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute,...

Page 1: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31

Text Mining SAS-L Topics

Larry Hoyle, Policy Research Institute, University of Kansas

Page 2: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31 SAS-L topics• Read each weekly topic list from

http://www.listserv.uga.edu/archives/sas-l.html

• Parse topic, HTMLdecode

• Strip “Re: “ /* strip variations of re: */

topicRE = prxparse('/^ *[R|r][E|e] *: *(.*)/');

if prxmatch(topicRE, topic) then do;

topic = prxposn(topicRE, 1,topic);

end;

• Proc SQL to aggregate topic counts across weeks

Page 3: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31 SAS-L 2005

• 35324 thread/topic lines in the html files• 7081 threads after merging across weeks and a

little cleaning

Page 4: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31SAS-L Top Threads in Number of Messages

Page 5: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31 Text Miner on the SAS-L topics

Page 6: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31

Page 7: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31

Page 8: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31

Page 9: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31

Page 10: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31 Largest clusters

Page 11: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31 Smaller Clusters

Page 12: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31 Message Content

Page 13: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31 Web scraping with tmfilteroptions noxwait;

%macro aweek(week=0501a);

x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week";x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredposts\&week";

libname sugi31 'C:\ddrive\projects\sugs\sugi31\SASLBOF\datasets';

%tmfilter(dataset=sugi31.SL&week.,dir=C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week,destdir=C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredPosts\&week,URL=http://listserv.uga.edu/cgi-bin/wa?A1=ind&week.%NRSTR(&L=sas-l),

depth=1,links=sugi31.SL&week.L,norestrict=' ',

numchars=2000)

%mend aweek;

%aweek(week=0501a);%aweek(week=0501b);

Page 14: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31 Parse date and sender

Page 15: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31Using a 10% sample of message text

Page 16: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31Using a 10% sample of message text

Page 17: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31Filter out too common terms, listserv

Page 18: Hoyle paper 019-31 SUGI 31 Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas.

Hoyle paper 019-31

SUGI 31Filter out too common terms, listserv