Team 03 - Data Technophilessurveygizmolibrary.s3.amazonaws.com/library/155312/12524_blind.pdf ·...

08Fall

Team03-DataTechnophiles

Enron Scandal: A Topic Modeling Approach to Analyze Enron E-mail Data Using SAS®

ABSTRACT Enron Corporation was one of the leading companies in the U.S Energy industry. In order to grow quickly and to become a major energy supplier Enron Corporation borrowed huge amounts of money, chose to hide its debt and inflated its profit margins [DiLallo 2015]. These unethical practices were revealed in October 2001 and lead to the company filing for bankruptcy in December 2001 [Enron-Wikipedia].

The objective of this paper is to analyze the public e-mail dataset from Enron Corporation using topic modeling to identify whether this tragedy was avoidable. In addition to this we also construct a distribution of these identified topics over time and estimate their similarity using the Kullback directed divergence measure [Lin 1991]. To facilitate our analysis, we used SAS® Studio and built-in SAS® procedures like HADOOP, SQL, FORMAT, SGPLOT.

INTRODUCTION

Kenneth Lay created Enron Corporation after merging the two energy companies InterNorth and Houston Natural Gas [DiLallo 2015 and Enron-Wikipedia]. Enron was initially involved in supplying natural gas, producing coal and petroleum. Enron Corporation started to engage in energy trading from the beginning of 1990. This was responsible for considerable growth of the company. However this growth was attributed to the corporation’s aggressive policies like creating electricity shortages knowingly in California [DiLallo 2015]. In addition to this Enron started to diversify it’s business by engaging in communication (broadband) services, plastics, steel, pulp and paper trading. In addition to this Enron was also involved in futures trading of commodities like grains, coffee, meat, etc. [Enron-Wikipedia]. A large proportion of this growth occurred because of the money the corporation borrowed. A substantial portion of this debt was hidden from the public and the profitability of the company was overstated. In 2000, Enron employed around 20000 people and claimed to have earned $111 billion dollars as revenue [Enron-Wikipedia]. A combination of factors like using “merchant model” [Enron Scandal -Wikipedia] to interpret revenue; altering balance-sheets to show profits; creating several entities like CHEWCO, Whitewing and signing complex agreements to hide debts are considered to be the reasons for the downfall of Enron Corporation. These unethical practices were revealed in October 2001 and as a result the stock prices of the company fell more than 30% towards the end of 2001. Enron Corporation finally filed for bankruptcy in 2001. Accounting scandals in companies like Enron Corporation and Worldcom were responsible for the enaction of the “Sarbenes-Oxley Act” in 2002. This act held the top-level management for the verification and certification of the accuracy of the accounting and financial information [Sarbenes-Oxley Act].

In this paper we apply data science techniques to analyze publicly available e-mail correspondence between Enron employees. The major objective of the study is to answer the questions “did anyone anticipate or concerned with the fate of the company? Could the analysis

of the e-mail data have given clues so that the final outcome could have been avoided ?”. For our analysis we use SAS® studio and several built-in procedures as listed above.

DATA

The Enron e-mail dataset provided is available in multiple formats like XML and PST and was stored in HDFS (Hadoop File Distributed System). For this paper we chose to work with the XML format, as it was relatively easier to process when compared to the PST format. The e-mail data in XML format was in compliance with the EDRM XML Schema [EDRM Format]. Data regarding a single e-mail can be found in the DOCUMENT tag. Sample e-mail is shown in Appendix1. List of attributes and their data types that were available for a single e-mail is shown in Table 1.

Root Tag Attribute Name Data type

Tag

From Text To Text CC Text Subject Text DateSent DateTime AttachmentCount Numeric HasAttachments Boolean X-SDOC Text X-ZID Text

External File

FilePath Text FileName Text FileSize Numeric HASH Text

Location Custodian Text LocationURI Text

TABLE 1 Enron E-mail dataset row

From these set of attributes, we choose from, to, CC, Subject, DateSent, HasAttachments and AttachmentCount for our analysis. Due to unavailability of the e-mail body, we were not able to use the same in our analysis.

OUR METHODOLOGY In this paper, we identify significant terms from e-mail subject. These significant terms by themselves or a group of such terms constitute a topic. To achieve this we use LDA topic modeling proposed in [Blei et al. 2003]. Once topics are identified, we construct a distribution of such topics over a period of time and estimate how similar are the distributions. To estimate the similarity between distributions we use Kullback directed divergence measure proposed in [Lin 1991]. Lower the deviance measure higher the similarity and vice-versa. Degree of similarity of topic distributions provides insights on how they are related with one another. In addition to this, one can identify people who are involved in the email exchanges of these significant topics. These people tend to be influential. In this paper we also validate the claim whether the e-mail data could have been used to prevent bankruptcy.

DATA PRE-PROCESSING The Enron dataset stored in HDFS comprises of several XML files. Each file has a number of e-mails. In data pre-processing our goal is to parse these individual XML files to obtain the information of interest such as from, to, cc, subject, CC, Subject, DateSent, HasAttachments and AttachmentCount and store them in a single table.

The data pre-processing consists of two passes. The first pass involves in parsing all the Enron e-mail data stored in HDFS and converting it to a tabular format. The first pass comprises of the following steps

• Pass 1: Step 1 – Save the list of all the XML files in HDFS into a file. Create directories in HDFS to store the aggregated data and metadata after pre-processing. Load the list of XML files into memory, and extract the number of such files that are in HDFS.

• Pass 1: Step 2 – Initialize the HDP library by providing appropriate parameters and create the table in HDFS where all the parsed data should be stored.

• Pass 1: Step 3 – Create an SXLE XML mapping schema (SXLEMap), which contains the xpath from which the columns should be extracted from the XML file, their data type and size. [SXLEMap].

• Pass 1: Step 4 – For every XML file in HDFS, download the file to local directory. After downloading the file parse it using XML LIBNAME engine [XML LIBNAME Engine]. Store the parsed data into the final table stored. We used a SAS macro to perform this entire step and the tutorial by C. Yindra guided us in macro creation [Yindra].

Once the first pass of pre-processing is complete, the parsed data from all XML files is stored as Hive table in HDFS. The second pass in data pre-processing comprises of the following steps

• Pass 2: Step 1 – Convert the time and date to appropriate format (i.e.) from string to time and date respectively. In addition to that remove records without any subject, as they will not be useful in identifying topics.

• Pass 2: Step 2 – There are erroneous values for year in date and thus the year ranges between 1584 and 9719. Enron filed for bankruptcy in 2001, so for this analysis we choose e-mails exchanged between January-1997 and December 2001

SAS® Code for all the above steps is given in the Appendix2, 3, 4, 5, 6, and 7.

ANALYSIS

Before performing topic modeling, we performed some exploratory analysis on the parsed dataset. The top 25 users who sent most number of e-mails in the time period 1997 to 2001 is shown in Figure 1 in Appendix8. Outlook Migration Team, Steven J Kean and Jeff Dasovich were the top three senders of e-mails. SAS® code used to generate this plot is also in Appendix8. We also analyzed the distribution of the number of e-mail’s sent. The plot (Figure 2) and SAS® code for this analysis is in Appendix9. From Figure 2 it is evident that the number of e-mail messages sent start to increase from the year 2000 and it attains its peak value in the year 2001.

TOPIC MODELING REVIEW

One of the emerging approached to analyze unstructured data is a topic model. Probabilistic topic models model text collections. According to Vorontsov and Potapenko [Vorontsov et al. 2015], Latent Dirichlet allocation (LDA) presented in [Blei et al. 2003] is the most popular probabilistic topic model. LDA is a hierarchical Bayesian generative model. It consists of two levels. Topics are modeled as distribution over a fixed set of words (vocabulary) and documents are distributions over topics. A dataset is assumed to be a collection of D documents and each document is a collection of words. The order of words is not considered, i.e. the document is considered as a bag of words. Following is the general description of the generative process of LDA model [Blei 2012]:

1. Randomly choose a distribution over topics 2. For each word in the document

a. Randomly choose a topic from the distribution over topics in step #1 b. Randomly choose a word from the corresponding distribution over vocabulary.

Discovery of topics from collections of documents is the goal of topic modeling. The topics are latent variables. Topics are inferred from the documents which are observed. This is the central inferential problem for LDA. A formal description adopted verbatim from (Reed 2012) is given below:

For each document:

(a) Draw a topic distribution, θd ∼ Dir(α), where Dir(·) is a draw from a uniform

Dirichlet distribution with scaling parameter α

(b) For each word in the document: (i) Draw a specific topic zd,n ∼ multi(θd) where multi(·) is a multinomial (ii) Draw a word wd,n ∼ βzd,n

A draw from a k dimensional Dirichlet distribution returns a k dimensional multinomial, θ in this case, where the k values must sum to one. The normalization requirement for θ ensures that θ lies on a (k − 1) dimensional simplex and has the probability density:

𝑝 𝜃 𝛼 = Γ( 𝛼! )!

!!!

Γ(𝛼!)!!!!

𝜃!!!!!

!

!!!

• w represents a word and wv represents the vth word in the vocabulary where wv = 1 and wu = 0 if the v /= u

• w represents a document (a vector of words) where w = (w1, w2, . . . , wN )

• α is the parameter of the Dirichlet distribution, technically α = (α1, α2, . . . , αk ), but unless otherwise noted, all elements of α will be the same.

• z represents a vector of topics, where if the i th element of z is 1 then w draws from the ith topic

• β is a k × V word-probability matrix for each topic (row) and each term (column), where βij = p(wj = 1|zi = 1)

The central inferential problem for LDA is determining the posterior distribution of the latent variables given the document:

Before performing topic modeling on the e-mail subjects, we should note the fact that all the words that constitute an e-mail subject are not useful for our analysis. These words, which are of no relevance for our analysis, are classified as stop-words. A comprehensive list of stop words in English language is available at [Stop-words]. We use this exhaustive list to remove stop words from e-mail subjects. We created a separate table in HDFS to store all the words remaining after eliminating stop-words. The SAS® code for the aforementioned process in available in Appendix10. After stop-word elimination, the most popular words in e-mail subject line are

• Gas • Trade • Option • Report • Price • Power • Enron

In our analysis we treat all these terms as topics as they are of greater significance. After identification of topics, the next step is to obtain the probability of occurrence of these topics over the given period of time. SAS® code for computing the distribution of these topics and storing the results in the same datasets is in Appendix11. All the topic distributions are compared as shown Figure 3. Both the plot and the SAS® code to generate it are in Appendix12.

Once the topic distributions are calculated, then the deviance between them is computed by the following Equation [Lin 1991].

𝐼 𝑃!,𝑃! = 𝑃! 𝑥 ∗ 𝑃!(𝑥)𝑃! (𝑥)

! ∈!

Where X is a discrete random variable; P1 and P2 are probability distributions of X; I is the directed divergence [Lin 1991]. In our analysis we estimate the deviance between Enron and other topics. The SAS® code to compute the Kullback deviance measure is found in Appendix13. The code to visualize this distance is in Appendix14 along with the plot (Figure 4).

From Figure 4, it is evident that the deviance between the topics Enron and Option; Enron and Gas are minimal compared to the other topics.

p(θ,z|w,α,β) = p(θ, z, w|α, β) p(w|α,β)

We, then identify the people who are involved in exchanging e-mails with these topics. There were about 590 people who sent e-mails with subjects containing “Enron” and “Gas”. There were also about 242 people who sent e-mails with subjects containing “Enron” and “Option”. The SAS® code for this estimate is in Appendix15 and 16.

CONCLUSION

In this paper we analyzed Enron e-mail dataset using LDA, a topic-modeling algorithm. We initially parsed the dataset from XML format to a tabular format and then identified topics of greater significance from the e-mail subject line. For our analysis we considered each word to be a topic. After identifying topics, we constructed their probability distribution over the given time period. Once the distributions were constructed, we computed the similarity of the distribution. The similarity of the distributions could be used to classify words for examining the body of e-mails. We also identified the users who were involved in exchanging e-mails with these topics. These users are considered to be important, as they were involved in e-mail exchanges involving significant topics.

The insights that one can obtain from subject of e-mails are limited. This issue is compounded with the lack of access to the body of these e-mails messages and the attachments that comes with these e-mails. Our analysis shows that there weren’t enough clues in the subject of these e-mails to avoid the scandal.

FUTURE WORK

As future work, we propose to apply the methodology developed for this research to examine publicly available data sets. LDA assumes topic distributions are independent and does not take into account potential correlations. So, we also propose to compare LDA and correlated topic model using e-mail data sets.

REFERENCES [Blei et al. 2003] Blei, D.M., Ng, A.Y., Jordan, M.I., (2003), Latent Dirichlet allocation. Journal of Machine Learning Research 3, pp 993–1022.

[Blei 2012] Blei, D.M., (2012), Probabilistic topic models, Communications of the ACM, vol. 55, no. 4, pp 77-84.

[Daud et.al 2010] Daud, A., Li, J., Zhou, L., Muhammad, F., (2010) Knowledge discovery through directed probabilistic topic models: a survey, Frontiers of Computer Science in China, Volume 4, Issue 2, pp 280-301.

[DiLallo 2015] DiLallo, M., (2015), Enron Scandal: A Devastating Reminder of the Dangers of Debt. Available at http://www.fool.com/investing/general/2015/06/21/enron-scandal-a-devastating-reminder-of-the-danger.aspx

[EDRM Format] EDRM Format, Available at http://www.edrm.net/projects/xml

[Enron-Wikipedia] Wikipedia, Enron. Available at https://en.wikipedia.org/wiki/Enron

[Enron Scandal-Wikipedia] Wikipedia, Enron Scandal. Available at https://en.wikipedia.org/wiki/Enron_scandal

[Lin 1991] Lin, J., (1991), Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), pp.145-151.

[Reed 2012] Reed, C., (2012), Latent Dirichlet Allocation: Towards a Deeper Understanding, January 2012, http://obphio.us/pdfs/lda_tutorial.pdf

[Sarbenes-Oxley Act] Sarbenes-Oxley Act., (2002), Available at http://www.soxlaw.com/

[SXLEMap] SXLEMap documentation. Available at http://support.sas.com/documentation/cdl/en/omamodref/61849/HTML/default/viewer.htm#sxlemap.htm

[Stop-words] Stop-words – Long Stop-word List, Available at http://www.ranks.nl/stopwords

[Vorontsov 2015] Vorontsov, K., Potapenko, A., (2015), Additive regularization of topic models, Machine Learning October 2015, Volume 101, Issue 1, pp 303-323.

[XML LIBNAME Engine] XML LIBNAME Engine Documentation. Available at https://support.sas.com/documentation/cdl/en/engxml/65362/HTML/default/viewer.htm#p0c5z18meh4ilzn1cw71t8waayjq.htm

[Yindra] Yindra, C., %SYSFUNC - The Brave New Macro World, Available at http://www2.sas.com/proceedings/sugi23/Advtutor/p44.pdf

APPENDIX 1Sample E-mail from Enron Dataset <Document DocID="" DocType="Message" MimeType="message/rfc822"> <Tags> <Tag TagName="#From" TagDataType="Text" TagValue=""/> <Tag TagName="#To" TagDataType="Text" TagValue=""/> <Tag TagName="#Subject" TagDataType="Text" TagValue=""/> <Tag TagName="#DateSent" TagDataType="DateTime" TagValue="”/> <Tag TagName="#HasAttachments" TagDataType="Boolean" TagValue=""/> <Tag TagName="X-SDOC" TagDataType="Text" TagValue=""/> <Tag TagName="X-ZLID" TagDataType="Text" TagValue=""/> </Tags> <Files> <File FileType=""> <ExternalFile FilePath="" FileName="" FileSize="" Hash=""/> </File> </Files> <Locations> <Location> <Custodian></Custodian> <LocationURI></LocationURI> </Location>

</Locations> </Document> 2Data Pre-Processing – Pass 1: Step 1 proc hadoop username=’user’ password=’password’ verbose; hdfs LS = '/contest/data/enron/edrm-enron-v2' out = 'localFile’; hdfs mkdir = '/contest/team03/metadata'; hdfs mkdir = '/contest/team03/dataload'; run; /*Convert the locally stored file into a SAS dataset. We are interested in processing XML files only. Thus in addition to converting the local file into SAS dataset, this procedure also adds a flag (isXML), which states whether the file is in XML format or not. It's value is 1 if the file has xml format, otherwise it's value is set to zero*/ data filesList; infile 'localFile’ dlm=' ' truncover; length permissions $10. owner $5. folder $11. volume 8 data_created $10. time_created $8. full_path $300.; input permissions$ owner$ folder$ volume data_created$ time_created$ full_path$; isXML = ifn(compare(upcase(substr(full_path, index(full_path, '.')+1,index(full_path, '.')+3)), upcase('xml'))=0,1, 0); run; /*Select all the XML files to parse into a variable - paths & to select the number of records into variable record count*/ proc sql noprint; select full_path into :paths separated by ' ' from WORK.filesList where isXML EQ 1; select count(*) into :recordCount from WORK.filesList where isXML EQ 1; quit; 3Data Pre-Processing – Pass 1: Step 2 /*Initialize all the relevant values for Hive*/ libname HDP hadoop server="10.0.1.142" user=’user’ password=’password’ hdfs_tempdir = '/tmp' hdfs_metadir = '/contest/team03/metadata' hdfs_permdir = '/contest/team03/dataload'; /*Delete if table exists already*/ proc delete data=hdp.one_pass; run; /*Create dataset to store parsed XML data*/ proc sql noprint; create table hdp.one_pass( Owner char(100), From char(10000),

To char(25000), CC char(25000), Subject char(10000), DateSent char(500), HasAttachments char(5), AttachmentCount char(10)); run; 4Data Pre-Processing – Pass 1: Step 3 <SXLEMAP name="SXLEMAP" version="1.2"> <TABLE name="parseEmail"> <TABLE-PATH syntax="XPATH">//Document/Tags</TABLE-PATH> <COLUMN name ="From"> <PATH>//Document/Tags/Tag@TagValue[@TagName="#From"]</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>10000</LENGTH> </COLUMN> <COLUMN name ="To"> <PATH>//Document/Tags/Tag@TagValue[@TagName="#To"]</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>25000</LENGTH> </COLUMN> <COLUMN name ="CC"> <PATH>//Document/Tags/Tag@TagValue[@TagName="#CC"]</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>10000</LENGTH> </COLUMN> <COLUMN name ="Subject"> <PATH>//Document/Tags/Tag@TagValue[@TagName="#Subject"]</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>10000</LENGTH> </COLUMN> <COLUMN name ="DateSent"> <PATH>//Document/Tags/Tag@TagValue[@TagName="#DateSent"]</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> </COLUMN> <COLUMN name ="HasAttachments"> <PATH>//Document/Tags/Tag@TagValue[@TagName="#HasAttachments"] </PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>5</LENGTH> </COLUMN> <COLUMN name ="AttachmentCount">

<PATH>//Document/Tags/Tag@TagValue[@TagName="#AttachmentCount"] </PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>5</LENGTH> </COLUMN> </TABLE> </SXLEMAP> 5Data Pre-Processing – Pass 1: Step 4 /*Macro to iterate on the xml files*/ %MACRO CreateDataset; %do i = 1 %to &recordCount; %let currentFileName = %scan(&paths,&i,%STR(' ')); %put "Processing file: &currentFileName"; %let localFilePath="temp.xml"; /*Get the owner name from the file name*/ %let ownerName = %scan(&currentFileName,2,%STR('_')); %put "Owner name obtainesd from the file is &ownerName"; /*Download the file to local directory*/ proc hadoop username=’user’ password=’password’ verbose; hdfs copytolocal="&currentFileName" out=&localFilePath; run; /*Parse the downloaded XML file*/ filename map 'mapping.xml'; filename myxml &localFilePath; libname myxml xml xmlmap=map; /*Add data into table*/ proc sql noprint; insert into hdp.one_pass (Owner, From, To, CC, Subject, DateSent, HasAttachments, AttachmentCount)

select "&ownerName" As Owner, From, To, CC, Subject, DateSent, HasAttachments, AttachmentCount from myxml.parseEmail;

run; /*Delete the local file after being used*/ %put %sysfunc(fdelete(myxml)); %end; %MEND CreateDataset; /*Call the macro created*/ %CreateDataset; 6Data Pre-Processing – Pass 2: Step 1 /*Pass Two - Format date and remove records with no subject*/ data hdp.two_pass; set hdp.one_pass; where Subject NE ''; dateInEmail = ifn(length(DateSent)=0,'.',input(scan(DateSent,1,'T'),E8601DA.)); timeInEmail = ifn(length(DateSent)=0,'.',input(DateSent,e8601dt19.)); format dateInEmail DDMMYY10.; format timeInEmail time.;

run; 7Data Pre-Processing – Pass 2: Step 2 proc delete data=hdp.filtered_data; run; /*Filter out the data based on the dates. The dates we'd need are from 1-Jan-1997 to 31-Dec-2001*/ proc sql number; create table hdp.filtered_data as select *,monotonic() as rownum from hdp.two_pass where dateInEmail between input('01/01/1997',DDMMYY10.) and input('31/12/2001',DDMMYY10.); quit; 8Analysis – People who sent most e-mail’s /*Create a table to get from statistics*/ proc sql outobs=25; create table enron.from_stat as select From,count(From) as from_count from hdp.filtered_data where From ne '' group by From order by from_count desc; run; /*Sort the dataset*/ proc sort data=enron.from_stat out=sorted_from_stats; key from_count / descending; run; /*Plot the given data*/ ods graphics on/width=30cm height=30cm noborder; proc sgplot data=sorted_from_stats noautolegend; title "Top 25 people who sent more e-mails"; vbar From / response=from_count categoryorder=respdesc nostatlabel; vline From / response=from_count; xaxis label='Sender'; yaxis label='Count'; run;

Figure 1 People who sent most e-mail’s

9Analysis – Distribution of E-mail exchanges between 1999 and 2001 procsql; createtableenron.date_statasselectput(dateInEmail,yymmd7.)asChronological_Order,count(*) asdate_countfromhdp.filtered_datagroupby1orderby1;quit;odsgraphicson/width=30cmheight=30cmnoborder;procsgplotdata=enron.date_statnoautolegend; title"Distributionofemailexchangesbetween1997and2001"; vbarChronological_Order/response=date_count; vlineChronological_Order/response=date_count; xaxislabel='Time'; xaxisFITPOLICY=THIN; yaxislabel='Count';run;

Figure 2 Distribution of E-mail exchanges

10Analysis – Eliminating Stopwords /*Codetogenerateaninformattoidentifystopwords*/procformat;invaluestopwords“a”, "able", "about", "above", "abst", "accordance", "according", "accordingly", "across", "act","actually", "added", "adj", "affected", "affecting", "affects", "after", "afterwards", "again", "against","ah", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among","amongst", "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone","anything", "anyway", "anyways", "anywhere", "apparently", "approximately", "are", "aren", "arent","arise","around","as","aside","ask","asking","at","auth","available","away","awfully","b","back","be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand","begin","beginning","beginnings","begins","behind","being","believe","below","beside","besides",

"between", "beyond", "biol", "both", "brief", "briefly", "but", "by", "c", "ca", "came", "can", "cannot","can't","cause","causes","certain","certainly","co","com","come","comes","contain","containing","contains","could","couldnt","d","date","did","didn't","different","do","does","doesn't","doing","done","don't","down","downwards","due","during","e","each","ed","edu","effect","eg","eight","eighty", "either", "else", "elsewhere", "end", "ending", "enough", "especially", "et", "et-al", "etc","even","ever","every","everybody","everyone","everything","everywhere","ex","except","f","far","few", "ff", "fifth", "first", "five", "fix", "followed", "following", "follows", "for", "former", "formerly","forth", "found", "four", "from", "further", "furthermore", "g", "gave", "get", "gets", "getting", "give","given", "gives", "giving", "go", "goes", "gone","got", "gotten", "h", "had", "happens", "hardly", "has","hasn't", "have", "haven't", "having", "he","hed", "hence", "her", "here", "hereafter", "hereby","herein", "heres", "hereupon", "hers", "herself", "hes", "hi", "hid", "him", "himself", "his", "hither","home", "how", "howbeit", "however", "hundred", "i", "id", "ie", "if", "i'll", "im", "immediate","immediately", "importance", "important", "in", "inc", "indeed", "index", "information", "instead","into", "invention", "inward", "is", "isn't", "it", "itd", "it'll", "its", "itself", "i've", "j", "just", "k", "keep","keeps", "kept", "kg", "km", "know", "known", "knows", "l", "largely", "last", "lately", "later", "latter","latterly","least","less","lest","let","lets","like","liked","likely","line","little","'ll","look","looking","looks", "ltd", "m", "made", "mainly", "make", "makes", "many", "may", "maybe", "me", "mean","means", "meantime", "meanwhile", "merely", "mg", "might", "million", "miss", "ml", "more","moreover","most","mostly","mr","mrs","much","mug","must","my","myself","n","na","name","namely","nay","nd","near","nearly","necessarily","necessary","need","needs","neither","never","nevertheless","new","next","nine","ninety","no","nobody","non","none","nonetheless","noone","nor", "normally", "nos", "not", "noted", "nothing", "now", "nowhere", "o", "obtain", "obtained","obviously","of","off","often","oh","ok","okay","old","omitted","on","once","one","ones","only","onto","or","ord","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","owing","own","p","page","pages","part","particular","particularly","past","per","perhaps", "placed", "please", "plus", "poorly", "possible", "possibly", "potentially", "pp","predominantly", "present", "previously", "primarily", "probably", "promptly", "proud", "provides","put", "q", "que", "quickly", "quite", "qv", "r", "ran", "rather", "rd", "re", "readily", "really", "recent","recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research","respectively", "resulted", "resulting", "results", "right", "run", "s", "said", "same", "saw", "say","saying","says","sec","section","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sent","seven","several","shall","she","shed","she'll","shes","should","shouldn't","show","showed","shown","showns","shows","significant","significantly","similar","similarly","since","six","slightly", "so", "some", "somebody", "somehow", "someone", "somethan", "something", "sometime","sometimes", "somewhat", "somewhere", "soon", "sorry", "specifically", "specified", "specify","specifying", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", "sufficiently","suggest", "sup", "sure", "t", "take", "taken", "taking", "tell", "tends", "th", "than", "thank", "thanks","thanx", "that", "that'll", "thats", "that've", "the", "their", "theirs", "them", "themselves", "then","thence", "there", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof","therere", "theres", "thereto", "thereupon", "there've", "these", "they", "theyd", "they'll", "theyre","they've", "think", "this", "those", "thou", "though", "thoughh", "thousand", "throug", "through","throughout", "thru", "thus", "til", "tip", "to", "together", "too", "took", "toward", "towards", "tried","tries", "truly", "try", "trying", "ts", "twice", "two", "u", "un", "under", "unfortunately", "unless","unlike", "unlikely", "until", "unto", "up", "upon", "ups", "us", "use", "used", "useful", "usefully","usefulness", "uses", "using", "usually", "v", "value", "various", "'ve", "very", "via", "viz", "vol", "vols","vs", "w", "want", "wants", "was", "wasnt", "way", "we", "wed", "welcome", "we'll", "went", "were","werent", "we've", "what", "whatever", "what'll", "whats", "when", "whence", "whenever", "where","whereafter", "whereas", "whereby", "wherein", "wheres", "whereupon", "wherever", "whether",

"which", "while", "whim", "whither", "who", "whod", "whoever", "whole", "who'll", "whom","whomever", "whos", "whose", "why", "widely", "willing", "wish", "with", "within", "without", "wont","words", "world", "would", "wouldnt", "www", "x", "y", "yes", "yet", "you", "youd", "you'll", "your","youre", "yours", "yourself", "yourselves", "you've", "z", "zero", "Re", "Re.", "Re .", "Re:", "Re :", "RE","RE.", "RE .", "RE:", "RE :","Fwd", "Fwd.", "Fwd .", "Fwd:", "Fwd :", "FWD", "FWD.", "FWD .", "FWD:","FWD:","Fw","Fw.","Fw.","Fw:","Fw:","FW","FW.","FW.","FW:","FW:"=1other=0;run;optionsnosourcenonotes;/*Ahivetabletoholdsubjects*/procdeletedata=hdp.subjects;run;procsqlnoprint; createtablehdp.subjects( wordchar(1000));run;/*RecordcountoftablefiltereddataisstoredinvariablerecordCount*/procsqlnoprint; selectcount(*)into:recordCountfromhdp.filtered_data;/*Macrotochangetheformatoftheword.Ifawordisastopworditsvalueisoneelsezero*/%macrochangeFormat(invar,infmt); %LET&INVAR=%SYSFUNC(INPUTN(&&&INVAR,&INFMT));%mendchangeFormat;/*Macrotoiteratethrougheachrecordandtostripthesubjectofanystopwords.Remainingwordsinthesubjectareinsertedtoatablefortopicmodeling*/%macroeliminateStopWords;%doi=1%to&recordCount; %putprocessingrecord&i;procsqlnoprint;/*Selectappropriaterow*/selectSubjectinto:subjectVarfromhdp.filtered_datawhererownum=&i;/*Removeleading&/trailingwhitespaces*/%letformattedSub=%bquote(&subjectVar);/*Countthenumberofwordsinthesubject*/%letword_cnt=%sysfunc(countw(&formattedSub));/*%putWordcount&word_cnt;*//*Iterateallthewords*/%doj=1%to&word_cnt;/*Getappropriateword*/%letword=%qscan(&formattedSub,&j,%str());/*%putcurrentwordbeforemodifications&word;*/%letaps_occurrence=%index(&word,%nrstr(%'));%if&aps_occurrencene0%then%do; %LETword=%SUBSTR(&word,1,&aps_occurrence-1);

%end;/*%putcurrentwordbeforemodifications1&word;*/%letword=%STR(&word);%letword=%sysfunc(tranwrd(&word,%NRSTR(%"),%Str()));%letword="&word";/*%putcurrentwordbeforemodifications&word;*/ %letword=%sysfunc(tranwrd(&word,%NRSTR(%(),%Str())); %letword=%sysfunc(tranwrd(&word,%NRSTR(%)),%Str())); %letword=%sysfunc(tranwrd(&word,%NRSTR(,),%Str())); %letword=%sysfunc(tranwrd(&word,%NRSTR(-),%Str())); %letword=%sysfunc(tranwrd(&word,%NRSTR(&),%Str())); %letword=%cmpres(&word); %letword=%sysfunc(dequote(&word)); %letword1=&word;/*%putcurrentword&word;*//*Getwordlength-toeliminateemptywords*/%letwordLength=%length(%TRIM(%QUOTE(&word)));/*%put&wordLength;*/%if&wordLength>1%then%do; /*Changetheformat*/ %changeFormat(word,stopwords.); /*Ifthewordisnotastopwordtheninsertitintothetable*/ %if&wordne1%then%do; %letword1=%qlowcase(&word1); /*%putwordToBeInserted=&word1;*/ procsqlnoprint; insertintohdp.subjects(word)values("&word1"); %end;%end;%end;%end;%mendeliminateStopWords;optionsnosourcenonotes;%eliminateStopWords11Analysis – Compute Topic Distribution /*Macrotodoafulljoin(outer)ontwodatasets*/%macromergeTables;datawork.final_counts;mergework.final_countswork.temp_counts;byChronological_order;%mendmergeTables;/*Thismacroacceptsafrequentlyusedword,computesit'sprobabilityofoccurrenceforeachmonthfrom1997to2001*/

%macroprocessDataset; %put&current_word; %letcolumn_name=%sysfunc(dequote(&column_name)); %if&iter_numeq1%then%do; procsql; droptableWORK.temp; droptableWORK.final_counts; reset; createtableWORK.tempasselectdateInEmailfromhdp.filtered_datawhere lower(Subject)like&current_word; createtableWORK.final_countsasselectput(dateInEmail,yymmd7.)as Chronological_Order,count(*)as&column_namefrom work.tempgroupby1orderby1; updateWORK.final_countsset&column_name=&column_name/ (selectcount(*)fromwork.temp); %end; %else%do; procsql; droptableWORK.temp; droptableWORK.temp_counts; reset; createtableWORK.tempasselectdateInEmailfromhdp.filtered_data wherelower(Subject)like&current_word; createtableWORK.temp_countsasselectput(dateInEmail,yymmd7.)as Chronological_Order,count(*)as&column_namefromwork.temp groupby1orderby1; updateWORK.temp_countsset&column_name=&column_name/( selectcount(*)fromwork.temp); %mergeTables; %end;%mendprocessDataset;/*Userdefinedfunctiontocallmacro*/procfcmpoutlib=enron.funcs.sql;functionprocess_data_set(current_word$,iter_num,column_name$);rc=run_macro('processDataset',current_word,iter_num,column_name);return(rc);endsub;optionscmplib=enron.funcs;optionsnosourcenonotes;DATA_NULL_;/*Topicsofimportance*/arraytopics[7]$10("enron""gas""trade""option""report""price""power");doi=1to7; tem=process_data_set(cats('%',trim(left(topics[i])),'%'),i,trim(left(topics[i])));

end;RUN;/*Replaceallmissingvaluesbyzero.Sourcecodefrom[http://stackoverflow.com/questions/16877705/replace-missing-values-in-sas]*/dataenron.final_counts; setwork.final_counts; arraya(*)_numeric_; doi=1todim(a); ifa(i)=.thena(i)=0; end; dropi;run;12Analysis – Visualizing Topic Distribution /*Displaytheprobabilitydistributionofalltheidentifiedtopics*/odsgraphicson/width=30cmheight=30cmnoborder;procsgplotdata=enron.final_counts; seriesx=Chronological_Ordery=enron/markers; seriesx=Chronological_Ordery=gas/markers; seriesx=Chronological_Ordery=trade/markers; seriesx=Chronological_Ordery=option/markers; seriesx=Chronological_Ordery=report/markers; seriesx=Chronological_Ordery=price/markers; seriesx=Chronological_Ordery=power/markers; xaxislabel='Time'; xaxisFITPOLICY=THIN; yaxislabel='Probability'; title'ProbabilityofOccurrencesofselectedwords'; run;

Figure 3 Comparison of Topic Distribution

13Analysis – Computation of Deviance Measure between Enron and other topics /*ComputingDistancebetweentopics*/datawork.topic_dist; setenron.final_counts; dropChronological_Order; if(gas>0)thendo; gas=enron*log(enron/gas); end;elsegas=0; if(trade>0)thendo; trade=enron*log(enron/trade);

end;elsetrade=0; if(option>0)thendo; option=enron*log(enron/option); end;elseoption=0; if(report>0)thendo; report=enron*log(enron/report); end;elsereport=0; if(price>0)thendo; price=enron*log(enron/price); end;elseprice=0; if(power>0)thendo; power=enron*log(enron/power); end;elsepower=0; arraya(*)_numeric_; doi=1todim(a); ifa(i)=.thena(i)=0; end; dropi;run;/*Getthedistancefromenron*/procsql; createtableenron.topic_distas select0assum_enron,sum(gas)assum_gas, sum(trade)assum_trade,sum(option)assum_option, sum(report)assum_report,sum(price)assum_price, sum(power)assum_powerfromwork.topic_dist; select*fromenron.topic_dist;quit;

14Analysis – Visualizing Deviance Measure between Enron and other topics /*CodeToVisualizeDistancesbetweendistributions*/%ANNOMAC;dataenron.anno_data; setenron.topic_dist; dropsum_enronsum_gassum_reportsum_optionsum_powersum_tradesum_price; /*SetLengthofthefieldsappropriately*/ LENGTHFUNCTIONCOLORSTYLE$8; LENGTHTEXT$25; /*Drawcirclesthatrepresentthewordspickedup*/ %slice(10,20,360,360,2,CREAM,PS,both); %slice(10,20+(sum_gas*250),360,360,2,LIME,PS,both); %slice(10,20+(sum_trade*250),360,360,2,CYAN,PS,both); %slice(10,20+(sum_option*250),360,360,2,STEEL,PS,both);

%slice(10,20+(sum_report*250),360,360,2,RED,PS,both); %slice(10,20+(sum_price*250),360,360,2,BROWN,PS,both); %slice(10,20+(sum_power*250),360,360,2,GOLD,PS,both); /*CreatelabelswhichnamesthesefilledcirclesandthedistancebetweenthemselvesandEnron*/ %label(23,20,'ENRON[dist=0]',black,0,0,1.5,triplex); %label(23,20+(sum_gas*250),'GAS[dist=0.047]',black,0,0,1.5,triplex); %label(23,20+(sum_trade*250),'TRADE[dist=0.090]',black,0,0,1.5,triplex); %label(23,20+(sum_option*250),'OPTION[dist=0.040]',black,0,0,1.5,triplex); %label(23,20+(sum_report*250),'REPORT[dist=0.107]',black,0,0,1.5,triplex); %label(23,20+(sum_price*250),'PRICE[dist=0.122]',black,0,0,1.5,triplex); %label(23,20+(sum_power*250),'POWER[dist=0.062]',black,0,0,1.5,triplex); /*DrawacentralLine*/ %line(10,20,10,20+(sum_price*250),black,1,1); /*SmallFilledCenters*/ %slice(10,20,360,360,0.3,BLACK,PS,both); %slice(10,20+(sum_gas*250),360,360,0.3,BLACK,PS,both); %slice(10,20+(sum_trade*250),360,360,0.3,BLACK,PS,both); %slice(10,20+(sum_option*250),360,360,0.3,BLACK,PS,both); %slice(10,20+(sum_report*250),360,360,0.3,BLACK,PS,both); %slice(10,20+(sum_price*250),360,360,0.3,BLACK,PS,both); %slice(10,20+(sum_power*250),360,360,0.3,BLACK,PS,both);run;procgannoannotate=enron.anno_data;run;

Figure 4 Measure of deviance between Enron and other topics

15Analysis – Identifying common people between topics “Enron” and “Gas” procsql; createtableenron_gas_usersas (selectdistinctFromfromhdp.filtered_datawherelower(Subject)like'%enron%' unionall selectdistinctFromfromhdp.filtered_datawherelower(Subject)like'%gas%'); deletefromenron_gas_userswhereFrom=''; createtableenron.enron_gas_users_countas (selectFrom,count(From)asNumExchangesfromenron_gas_usersgroupbyFrom); selectcount(*)fromenron.enron_gas_users_countwhereNumExchanges>1;quit;16Analysis – Identifying common people between topics “Enron” and “Option” procsql; createtableenron_gas_usersas (selectdistinctFromfromhdp.filtered_datawherelower(Subject)like'%enron%' unionall selectdistinctFromfromhdp.filtered_datawherelower(Subject)like'%option%'); deletefromenron_gas_userswhereFrom=''; createtableenron.enron_gas_users_countas (selectFrom,count(From)asNumExchangesfromenron_gas_usersgroupbyFrom); selectcount(*)fromenron.enron_gas_users_countwhereNumExchanges>1;quit;

Team 03 - Data Technophilessurveygizmolibrary.s3.amazonaws.com/library/155312/12524_blind.pdf ·...

Documents

Transcript of Team 03 - Data Technophilessurveygizmolibrary.s3.amazonaws.com/library/155312/12524_blind.pdf ·...