Bar camp2011 concept extraction

17
Automatic Concept Map Extraction A Presentation by Peter Mancini http://nectarineimp.com

description

This is the presentation I gave at BarCamp Nashville, October 15th 2011. Missing are the notes and narration. Go to http://nectarineimp.com to get the full presentation. Search for bcn11.

Transcript of Bar camp2011 concept extraction

  • 1. AutomaticConcept MapExtractionA Presentation by Peter Mancinihttp://nectarineimp.com

2. In The Beginning It was 1939In the course of the last four months it hasbeen made probable - through the work ofJoliot in France as well as Fermi and Szilard in America - that it may become possible to set up a nuclear chain reaction in a large mass ofuranium, by which vast amounts of powerand large quantities of new radium-likeelements would be generated. 3. The Manhattan Project Ended an age based upon chemistry Thrust the World into a more dangerous balance Put to an end the ability to wage global war Forced new thought on how to manage information 4. The Information Age Vannavar Bush, 1945 Essay: As We May Think First Science Advisor to the President Predicted devices similar to smart phones Wanted the Knowledge of the Ages in the hands of everyone. 5. Information Age 2.0 The 9/11 Attacks showed that even with massiveinformation retrieval capabilities we were stillvulnerable. We needed to both connect the dots known unknowns We needed to also find dots we didnt know about unknown unknowns. In Decision Management Theory these are calledUnanticipated Decision Variables. 6. The Age of Discovery The difference between 1999 and 2011 is theemergence of discovery tools Discovery is different than Search Discovery can help along every leg of a problem thatneeds to be solved Search can help you find who was the King of Spain in1231 (try Googeling it from your phone of netbook) Discovery can help you determine if Iran has pre-knowledge of 9/11 (try Googeling that) Data Mining is the New Manhattan Project 7. If you can read this you are doing what acomputer has to do to read text 8. Extracting Meaning Meaning is represented by theway concepts relate to eachother We know what the concepts arebecause we can detect thenouns in a document The meaning of the documentcomes from the concept itspeaks to Here is how we do this using theNatural Language Toolkit 9. Extracting MeaningI believe that banking institutions are more dangerous toour liberties than standing armies. If the American peopleever allow private banks to control the issue of theircurrency, first by inflation, then by deflation, the banksand corporations that will grow up around the banks willdeprive the people of all property until their children wake-up homeless on the continent their fathers conquered. Theissuing power should be taken from the banks and restoredto the people, to whom it properly belongs.Thomas Jefferson 10. Extracting Meaningimport nltk, pprintfrom nltk.tokenize import *paragraph = nltk.sent_tokenize(plaintext)tokenizer = PunktWordTokenizer()tokenizedSentences = [tokenizer.tokenize(sentence) forsentence in paragraph]## Noise Reduction Goes Here#POSTaggedSentences = [nltk.pos_tag(sentence) for sentencein tokenizedSentences] 11. Plaintext Tokenized [I,believe,that,banking,institutions,are,more,dangerous,to,our,liberties,than,standing,armies.], 12. Plaintext Tokenized [(I, PRP), (believe, VBP), (that, IN), (banking, NN), (institutions, NNS), (are, VBP), (more, RBR), (dangerous, JJ), (to, TO), (our, PRP$), (liberties, NNS), (than, IN), (standing, NN), (armies., NNP)], 13. The Coefficient of Variation Not every file is useful some dont have enoughinformation in them to determine what is mostimportant. The coefficient is derived by looking at thefrequency of each noun, taking the standarddeviation of the set and dividing it by the mean.The higher the number the more variation thereis. We can determine which files to tag based uponthis analysis the ones with the higher variationare better for determining what they are about. 14. Three zones represent an arbitrary decision on which files producedgood Concept Maps, OK ones and finally poor ones. The lower theRSD the worse the map. 1/3rd of the files were tossed but the other2/3rds contained 97% of the accumulated data. 15. What is this Concept Map About? control panel Machine Setup scanner glass document feeder printer driver Mac OS X SyncThru Web Service Macintosh icon media Prints Report 16. Questions and AnswersAll slides available at NectarineImp.comAdditional inquiry can be sent to:[email protected]