Data Mining at Duke

PowerPoint Presentation

Data Mining at Duke(What to do with all of those hard drives)

Molly Tamarkin Joel HerndonAssociate University Librarian for Information Technology Services Head, Data & GIS Services

Todays TalkRise of text analysis questionsChallenges in providing text analysis servicesDuke University Libraries responseBullet one when we say text analysis we are discussing a range of research questions including 1.) content analysis 2.) topic modeling 3) and many other qualitative approaches using computer assisted text analysis. We are not asserting that text analysis is completely new- rather we are now in period where text analysis is moving to the mainstream research community. Bullet two what are the implications for research libraries supporting this type of research Bullet three what we are doing2

Brandaleone Center for Data and GIS ServicesBackground/Narrative Story Line weve been supporting numeric and geospatial data for some time (background story for DGS) more recently we have increased resources for data visualization as well.3The Rise of Text as DataTransition slide4New Questions for Research LibrariesHow has the North American press covered environmental issues over the last 20 years?Can we analyze all (17000) journal articles on German studies in the 20th century?What might tweets reveal about the Arab Spring in social media?

In recent years, researchers across the disciplines from the sciences to the humanities are seeking support for applying these research tools to a range of topics (patent analysis, press analysis, literature studies). We are also seeing a range of questions on how to manage, clean, organize, and analyze text collections.5

http://sites.duke.edu/digital/While library sees more text analysis questions librarys digital scholarship program has created its own forum on text mining work that showcases research on campus while connecting researchers across campus eager to share research or learn from the research of others.6

Likewise, outside the library we also see a rise of groups focusing on questions about how to deal to text based analysis strategies ranging from focus groups to topic modeling to web scraping. These groups have helped the library refine its focus on collections/resources for text analysis. 7Challenges in Providing Text Analysis ServicesTransition slide8ChallengesCollectionsLicensing Infrastructure Service modelGiven the new requests at Perkins what are the challenges in meeting these needs. Licensing research libraries hold a wealth of digital text, but (with few notable exceptions) rarely have explicit rights for opening these collections of text analysis Infrastructure Text analysis tools and larger text collections (or corpora) require specialized infrastructure- we consider these needs- Finally, Services How do you provide services that support and assist researchers with varying levels of text analysis concerns?9Open (or mostly open) Access

Fortunately, we already have a wide range of projects in the research community that have generously provided text corpora (or at least frequency counts!) and tools for analyzing these texts (and some user supplied texts) online. (Sites MONK Workbench tools and text, Text Creation Partnership (TCP) Text, Internet Archive (Text), TAPOR2 tools and a few text)- Jstor Data for Research text counts < Ive left Hathi out due to comments from Allen- what do you think? >10Licensing

http://chronicle.com/article/Hot-Type-Elsevier-Experiments/131789/We also recently have had a few isolated success stories where researchers have pressed for text analysis rights with major library vendors and secured access for their researchers (according to Heather- Elsevier has indicated a willingness to add text mining to contracts)11

http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspxWe found some text mining in fields such as biomedical sciences and chemistry and some early adoption within the social sciences and humanities however most text mining in UKFHE is based on Open Access documents or bespoke arrangements. key findings (p.2)

Heathers success story with Elsevier tends to illustrate one of the larger challenges for research libraries- while many groups are trying to add text mining to standard license agreements- at present most access to large text corpora tends to be through Open Access portals or isolated (bespoke) one time arrangements (as the JISC report mentions above) 12Licensing

At Duke, we have discussed the types of access required for our researchers with both Gale and Lexis Nexis. For many of the questions we are now seeing, the library has licensed access for traditional research methods, but we have not yet secured rights for text analysis. 13

Photo from editorsweblog.orgPhoto from editorsweblog.orgWith Lexis Nexis, we have a large community of researchers who wish to mine the news sources at Lexis Nexis as well as more legal and congressional records. We are currently discussing possibilities for this type of access. (Lexis Web Services Toolkit).14ECCO Project

In the Gale Hard Drives collection - weve have received permission for one project with the ECCO Drives- a Duke researcher who was interested in evaluating the scanned images of German language materials (which are available on the drives- but not readily available online) and the related XML used the ECCO backups to evaluate the potential for a text mining project (using Latent Dirichlet Analysis). This project has informed much of our infrastructure planning for supporting text mining.15 Big Data

~63 Drives~63 terabytes>40 Topics

Big Data (could joke about how a book truck of 1 TB hard drives turns a few heads) The drives hold images and XML related to the collections. While the marked up text is often of prime interest to the research- the images can also be of value especially for collections that introduce challenges for OCR- such as materials containing accented characters (foreign language collections).16

Gales text databases present a particularly appealing set of text /databases corpora for our nascent text analysis program. Perkins text databases from Gale cover a broad range of topics of interest to both Social Science and Humanities Scholars. Political Science and American Studies are two areas where we expect growing interest in text analysis of the historical record. (network analysis of subjects covered in the gale databases licensed by Duke larger circles indicate size of collection at Duke, darker colors indicate more connections across subjects - source: Angela Zoss)17

Even more appealing- we receive a backup hard drive for most Gale products that we license. The translates into: (big data)18Gale Backup Drive Collection

But. Things have begun to change as increasing text corpora have become the new data collection.19InfrastructureInfrastructure represents a second challenge in supporting text analysis 20Six Methods of Text AnalysisReadingCounting WordsHuman Coding (researchers coding events/texts)Dictionary Methods (sentiment analysis)Supervised machine learning (using corpora)Unsupervised Machine Learning (topic modeling)

http://aeshin.org/textmining/http://dx.doi.org/10.1111/j.1540-5907.2009.00427.xInfrastructure needs tend to vary by the form of text analysis- This slide is from a presentation on Text Analsysis by Ryan Shaw (UNC) in Perkins librarys Text to Data Lecture Series. I feel its safe to assume that most libraries have method one covered (and two may be a safe bet as well for libraries with internet connections), but methods three through six can entail specialized hardware and software to meet researchers needs.

21Infrastructure IssuesStorage/ scratch spaceProcessing powerTools for analyticsMore specifically you need 1.) Storage as mentioned previously- text collections (and scans) can be quite large our lab provide a TB of storage for our users ON THE MACHINE 2.) Matrices in techniques such as topic modeling can grow quite substantial our machines have 16 GB of RAM with 3.5 GHZ quad core processors that are able to deal with larger text challenges 3.) 22Our Workstations16 gigs of memory1 TB of storage64 bit computingIntel Xeon 3.5 GHz, 4 core Scanner availableFast networking

The lab is designed to provide a responsive environment for tackling the most difficult big text OR big data challenges.23Swappable Drives?

As the issue of providing fast access to large (1 TB and up) collections is now facing the lab we are exploring adding swappable drives in the cluster for fast access (SATA connections) to text collections.24General Software

We offer a range of packages designed for general qualitative text research- Duke has site licensed Nvivo which is helpful for manually encoding documents. R and python were already in place for our numeric and geospatial support- but the topic models package in R (CRAN) and the gensim library in Python (http://radimrehurek.com/gensim/) are available.25Specialized Software

We also offer software tools that are specifically designed for a certain type of text analysis such as topic modeling (or in this case Latent Dirichlet Analysis)26Service ModelPresently, Duke is still in the early stages of developing a service model for text analysis- but even in the early stages several elements are fairly clear27Services - Staffing

Staff familiar with research design issues, data management, data munging, and ideally text analysis strategies. (Comment about how Angela Zoss on the right embodies all of these elements) As Perkins Library has begun to move a to Research Commons model of public service we realize that this often may take a team of library staff who hold different skillsets.

28Expert on Visualization

Services - Staffing

http://aeshin.org/textmining/As Ryan Shaw (UNC) notes- one area of focus for research support should be on cultivating staff who can advise on preparing texts for analysis. Perhaps if there is anywhere to start with staffing- having staff that can assist with text processing/data processing provides a firm base for text analysis support that can later be expanded to include higher level research support. 30Services Guides

http://library.duke.edu/data/guides/index.htmlServices Workshops

Speakers and contests dont hurt either for engaging your qualitative research community.32In SummaryLots of research potentialLicensing may be an issue for someEasy way to get started text mining with little investment but maybe some risk?

Questions?Joel Herndon [email protected] Tamarkin [email protected]

Data Mining at Duke

Documents

Transcript of Data Mining at Duke