TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis,...

16
TextMOLE: Text Mining TextMOLE: Text Mining Operations Library and Operations Library and Environment Environment Daniel B. Waegel Daniel B. Waegel and and April Kontostathis, Ph.D. April Kontostathis, Ph.D. Ursinus College Ursinus College Collegeville PA Collegeville PA
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis,...

Page 1: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

TextMOLE: Text Mining Operations TextMOLE: Text Mining Operations Library and EnvironmentLibrary and Environment

Daniel B. Waegel Daniel B. Waegel andand

April Kontostathis, Ph.D.April Kontostathis, Ph.D.

Ursinus CollegeUrsinus CollegeCollegeville PACollegeville PA

Page 2: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

What?What?

Advanced application for indexing and Advanced application for indexing and searching a text database. searching a text database.

Allows users to quickly analyze a corpus Allows users to quickly analyze a corpus of documents and determine which of documents and determine which parameters will provide maximal retrieval parameters will provide maximal retrieval performance.performance.

Page 3: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Who?Who?

Instructors - demonstrate information retrieval Instructors - demonstrate information retrieval concepts in the classroomconcepts in the classroom

Students – hands-on exploration of concepts Students – hands-on exploration of concepts often covered in an introductory course in often covered in an introductory course in information retrieval or artificial intelligence information retrieval or artificial intelligence

Reseachers - ‘quick and dirty’ analysis of an Reseachers - ‘quick and dirty’ analysis of an unfamiliar collectionunfamiliar collection

Juniors and Seniors – capstone experiences in Juniors and Seniors – capstone experiences in computer sciencecomputer science

Page 4: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Why?Why?

Students unfamiliar with applications which require Students unfamiliar with applications which require manipulation of unstructured textmanipulation of unstructured textIR students develop basic IR systems, but do not have IR students develop basic IR systems, but do not have time to implement and test a variety of parameterstime to implement and test a variety of parametersExisting systems do not tightly integrate indexing and Existing systems do not tightly integrate indexing and retrieval functionsretrieval functions– R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval.

Addison Wesley/ACM Press, New York, 1999.– R. K. Belew. Finding Out About. Cambridge University Press, 2000.– G. Salton. The SMART Retrieval System–Experiments in Automatic

Document Processing. Prentice Hall, Englewood Cliffs, New Jersey, 1971.

Time! Students in AI do not even have time to implement a basic IR system.

Page 5: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

How?How?

Overview of the ApplicationOverview of the Application– IndexingIndexing– Single Query RetrievalSingle Query Retrieval– Multiple Query RetrievalMultiple Query Retrieval

Sample AssignmentsSample Assignments– Artificial IntelligenceArtificial Intelligence– Information RetrievalInformation Retrieval– Capstone ProjectsCapstone Projects

Page 6: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

IndexingIndexing

Page 7: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Single Query SpecificationSingle Query Specification

Page 8: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Single Query ResultsSingle Query Results

Page 9: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Multiple Query SpecificationMultiple Query Specification

Page 10: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Multiple Query ResultsMultiple Query Results

Page 11: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

How?How?

Overview of the ApplicationOverview of the Application– IndexingIndexing– Single Query RetrievalSingle Query Retrieval– Multiple Query RetrievalMultiple Query Retrieval

Sample AssignmentsSample Assignments– Artificial IntelligenceArtificial Intelligence– Information RetrievalInformation Retrieval– Capstone projectsCapstone projects

Page 12: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Information Retrieval CourseInformation Retrieval Course

Assignment 2Assignment 2– Assumes Assignment 1 was having students develop Assumes Assignment 1 was having students develop

their own rudimentary IR systemstheir own rudimentary IR systems– Using a corpus provided by the instructor or Using a corpus provided by the instructor or

developed by the student (min. 100 documents)developed by the student (min. 100 documents)Convert to XML formatConvert to XML formatParse with TextMOLEParse with TextMOLEIdentify a set of standard queries for the collection (truth set Identify a set of standard queries for the collection (truth set not necessary)not necessary)Vary parameters (stemming vs. no stemming, various Vary parameters (stemming vs. no stemming, various weighting schemes, various stop lists)weighting schemes, various stop lists)Decide which set of parameters work best for your collection. Decide which set of parameters work best for your collection. Write a paper describing your experiments and the results, Write a paper describing your experiments and the results, be sure to defend your conclusions!be sure to defend your conclusions!

Page 13: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Information Retrieval CourseInformation Retrieval Course

Assigment 3 or 4Assigment 3 or 4– Using the corpus from the previous assignment Using the corpus from the previous assignment

(minimum of 100 documents)(minimum of 100 documents)– Develop a set of standard queriesDevelop a set of standard queries– Determine which documents are truly relevant to Determine which documents are truly relevant to

these queries (involves lots of reading and frustration)these queries (involves lots of reading and frustration)– Use the Multiple Query function of TextMOLE to Use the Multiple Query function of TextMOLE to

determine precision and recall determine precision and recall

AlternateAlternate– Use one or more of the Gold Standard Collections Use one or more of the Gold Standard Collections

that have set of standard queries with truth sets that have set of standard queries with truth sets (TextMOLE can convert them to XML format)(TextMOLE can convert them to XML format)

Page 14: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Artificial Intelligence CourseArtificial Intelligence Course

IR AssignmentIR Assignment– Instructor provides set of documents in XML format Instructor provides set of documents in XML format

and set of standard queries (with or without result set)and set of standard queries (with or without result set)– Instructor provides students with parameters to use Instructor provides students with parameters to use

(ex. Stemming, log entropy weighting for both (ex. Stemming, log entropy weighting for both indexing and retrieval)indexing and retrieval)

– Students try to find the ‘best’ stop word list for this Students try to find the ‘best’ stop word list for this collectioncollection

– Write brief paper describing experiments and resultsWrite brief paper describing experiments and results

Page 15: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Capstone Experiences in Capstone Experiences in Computer ScienceComputer Science

Migrate TextMOLE to another Migrate TextMOLE to another platform platform – Open GLOpen GL– JavaJava– Web basedWeb based– Relational DatabaseRelational Database– Library Functions Library Functions

Add additional parameters to Add additional parameters to basic Search and Retrievalbasic Search and Retrieval– N-grams instead of wordsN-grams instead of words– Noun phrases (using a tool Noun phrases (using a tool

like flex)like flex)– ClusteringClustering– Latent Semantic IndexingLatent Semantic Indexing

Add additional IR applications Add additional IR applications – Emerging trend detectionEmerging trend detection– ClassificationClassification– First Story DetectionFirst Story Detection– FilteringFiltering– SummarizationSummarization

Research in Computer Research in Computer ScienceScience– Develop your own weighting Develop your own weighting

schemescheme– Identify additional features for Identify additional features for

indexingindexing– Develop a new Gold Standard Develop a new Gold Standard

collectioncollection

Page 16: TextMOLE: Text Mining Operations Library and Environment Daniel B. Waegel and April Kontostathis, Ph.D. Ursinus College Collegeville PA.

Where?Where?

Version 1.0 now available online!Version 1.0 now available online!http://webpages.ursinus.edu/akontostathis/TextMOLEhttp://webpages.ursinus.edu/akontostathis/TextMOLE

Contact Contact [email protected]@ursinus.edu with with questions and commentsquestions and comments