SEASR Analytics Loretta Auvil [email protected] Automated Learning Group Data-Intensive...
-
Upload
britney-cleopatra-james -
Category
Documents
-
view
215 -
download
0
Transcript of SEASR Analytics Loretta Auvil [email protected] Automated Learning Group Data-Intensive...
![Page 1: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/1.jpg)
SEASR Analytics
Loretta Auvil
Automated Learning GroupData-Intensive Technologies and Applications,
National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
The SEASR project and its Meandre infrastructureare sponsored by The Andrew W. Mellon Foundation
![Page 2: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/2.jpg)
SEASR Overview
![Page 3: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/3.jpg)
SEASR Focus
• Project’s focus:– Supporting framework
– Developing
– Integrating
– Deploying
– Sustaining a set of
• Reusable and
• Expandable software components and
• SEASR can provide benefit a broad set of data mining applications for scholars in humanities
![Page 4: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/4.jpg)
SEASR Goals
• The key goals are:– Support the development of a state-of-the-art software
environment for unstructured data management and analysis of digital libraries, repositories and archives
– Develop user interfaces, a data-flow engine and the data-flows that data management, analysis and visualization
– Support education and training through workshops to promote its usage among scholars
![Page 5: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/5.jpg)
Workshop Objective
The objective of the workshop is to:
• Introduction of SEASR
• Learn what analytics SEASR can do
![Page 6: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/6.jpg)
The SEASR Picture
![Page 7: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/7.jpg)
SEASR Enables Scholarly Research
Discovery
– What are the words used in the corpus?
– What named entities (people, locations, dates) can be extracted?
– What hypothesis or rules can be generated by the “features” of the corpus?
– What “features” or language of the corpus best describes the corpus?
– What are the “similarities” between elements, documents, or corpuses to each other?
– What patterns can be identified?
![Page 8: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/8.jpg)
Enables Scholar to Ask…
Pattern identification using automated learning
– Which patterns are characteristic of the English language?
– Which patterns are characteristic of a particular author, work, topic, or time?
– Which patterns based on words, phrases, sentences, etc. can be extracted from literary bodies?
– Which patterns are identified based on grammar or plot constructs?
– When are correlated patterns meaningful?
– Can they be categorized based on specific criteria?
– Can an author’s intent be identified given an extracted pattern?
![Page 9: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/9.jpg)
Tag Cloud
• Counts tokens• Several different filtering options supported
![Page 10: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/10.jpg)
Flesch-Kincaid Readability Test
• Results show scores for each item selected– Designed to indicate
comprehension difficulty when reading a passage of contemporary academic English
– Flesch Reading Ease: higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read
– Flesch–Kincaid Grade Level: result is a number that corresponds with a grade level
![Page 11: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/11.jpg)
Dunning Loglikelihood
• Feature comparison of tokens
• Specify an analysis document/collection
• Specify a reference document/collection
• Perform Statistics comparison using Dunning Loglikelihood
Example showing over-representedAnalysis Set: The Project Gutenberg EBook of A Tale of Two Cities, by Charles DickensReference Set: The Project Gutenberg EBook of Great Expectations, by Charles Dickens
![Page 12: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/12.jpg)
Date Entities to Simile Timeline
• Entity Extraction with OpenNLP
• Dates viewed on Simile Timeline
![Page 13: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/13.jpg)
Frequent Patterns
• Given: Set of documents • Find Frequent Patterns such that
– Common words patterns used in the collection
• Evaluation: What Is Good Patterns?
• Results:1060 patterns discovered
322: Lincoln147: Abe117: man100: Mr.100: time98: Lincoln Abe91: father85: Lincoln Mr.85: Lincoln man75: day70: Abraham
70: President68: boy67: Lincoln time65: Lincoln Abraham65: life63: Lincoln father57: men57: work52: Lincoln day…
![Page 14: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/14.jpg)
HITS Summarizer
• Find the top sentences and tokens from all items submitted
![Page 15: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/15.jpg)
Text Clustering
• Clustering of Text by token counts
• Filtering options for stop words, Part of Speech
• Dendogram Visualization
![Page 16: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/16.jpg)
• NEMA: Executes a SEASR flow for each run– Loads audio data
– Extracts features for every 10 sec moving window of audio
– Loads and applies the models
– Sends results back to the WebUI
• NESTER: Annotation of Audio via Spectral Analysis
Audio Analysis
![Page 17: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/17.jpg)
Emotion Tracking
Goal is to have this type of Visualization to track emotions across a text document (Leveraging flare.prefuse.org)
![Page 18: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/18.jpg)
Future: Application for Meme
“MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs”
![Page 19: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/19.jpg)
Where can I Run SEASR Analysis
• Services that can be executed from
– SEASR website
– Zotero
– MONK
– VUE
![Page 20: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/20.jpg)
SEASR Community Hub
• Explore existing flows to find others of interest
– Keyword Cloud
– Connections
• Find related flows
• Execute flow
• Comments
![Page 21: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/21.jpg)
What is Zotero? (from Zotero Quick Start Guide)
• A citation manager. It is designed to store, manage, and cite bibliographic references, such as books and articles. In Zotero, each of these references constitutes an item.
• An extension for the Firefox web-browser by the Center for History and New Media at George Mason University.
• Installed by visiting zotero.org and clicking the download button on the page.
![Page 22: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/22.jpg)
SEASR Analytics for Zotero
• An extension for the Firefox web-browser by the SEASR Team
• Uses your Zotero Collections
• Performs analysis using SEASR Services
![Page 23: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/23.jpg)
The Value Add for SEASR & Zotero
• Analytical Results are saved as Zotero items (View Snapshot)– Includes metadata – Item naming strategy identifies the item or collection
processed– Creator indicates the Menu Label of the SEASR Analysis
• Related Tab links to the items processed in the Analysis
• No need to install the analysis, it runs as web service
![Page 24: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/24.jpg)
MONK
Executes flows for each analysis requested
– Predictive modeling using Naïve Bayes
– Predictive modeling using Support Vector Machines (SVM)
– Feature comparisons
![Page 25: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/25.jpg)
SEASR Support in VUE
• Goal: Provide functionality in VUE to use SEASR flows
• Implementations:
– Add content to map
– Get metadata for content
– Get information about content
![Page 26: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/26.jpg)
Meandre Workbench
• Web-based UI
• Components and flows are retrieved from server
• Additional locations of components and flows can be added to server
• Create flow using a graphical drag and drop interface
• Change property values
• Execute the flow
The SEASR project and its Meandre infrastructureare sponsored by The Andrew W. Mellon Foundation
![Page 27: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/27.jpg)
Extensible to Analysis that You Create
• You can leverage the flows we have on your server or request your university to host this analysis
• You can modify these flows and redeploy
• You can create new flows
– Perhaps you want to see only nouns or verbs
– Perhaps you want to see a list of extracted entities
• You can share these flows back to the community
![Page 28: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/28.jpg)
Repository Search & Browse
Web Service
Interactive Web
Application
Zotero Upload to Repository
Zotero to SEASR : Fedora
![Page 29: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/29.jpg)
JSTOR Data for Research:SEASR Accesses APIs
• Access JSTOR API in SEASR components• Use the output of these components with existing
SEASR components
![Page 30: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/30.jpg)
feedback | login | searchcentral
Categories Recently Added Top 50 Submit About RSS Categories Recently Added Top 50 Submit About RSS
Featured Component [read more]
Word Counter by Jane Doe
Description Amazing component that given text stream, counts all the different words that appear on the text
Rights: NCSA/UofI open source license
Featured Component [read more]
Word Counter by Jane Doe
Description Amazing component that given text stream, counts all the different words that appear on the text
Rights: NCSA/UofI open source license
Featured Flow [read more]
FPGrowth by Joe Does
Featured Flow [read more]
FPGrowth by Joe Does
Browse Browse
By Joe DoeRights: NCSA/UofIDescription:Webservices given a Zotero entry tries to retrieve the content and measure its
By Joe DoeRights: NCSA/UofIDescription:Webservices given a Zotero entry tries to retrieve the content and measure its
Type
Component
Flows
Categories
Image
JSTOR
Zotero
Name
Author Centrality
Readability
Upload Fedora
SEASR Central
• Sharing and finding flows and components
![Page 31: SEASR Analytics Loretta Auvil lauvil@illinois.edu Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.](https://reader034.fdocuments.in/reader034/viewer/2022050714/56649f415503460f94c60c4b/html5/thumbnails/31.jpg)
Discussion Questions
• What kinds of data assets are you interested?
• What analysis would you like to use against this data?