I Can Do Text Analytics! Designing Development Tools for Novice Developers
-
Upload
huahai-yang -
Category
Technology
-
view
341 -
download
1
description
Transcript of I Can Do Text Analytics! Designing Development Tools for Novice Developers
I Can Do Text Analytics!Designing Development Tools for Novice
Developers
Huahai Yang* Daina Pupons-Wickham** Laura Chiticariu*Yunyao Li* Benjamin Nguyen** Arnaldo Carreno-fuentes*
*IBM Research - Almaden **IBM Software - Silicon Valley
OUTLINE
• Problem motivation– Text analytics– User population and needs
• Formative design iterations– Expert interviews– User studies in lab and field
• Current design and evaluation– Workflow Guide and Extraction Plan– Evaluation by competition
TEXT ANALYTICS
Public Text
Web Text
Private Text
TextAnalytics
MarketingFinancial investmentDrug discoveryLaw enforcement…
Applications
Social media
News
SEC
InternalData
SubscriptionData
USPTO
HIDDEN VALUES IN TEXT
DREAM
REALITY
TEXT ANALYTICS IS HARD
ML DOES NOT SAVE THE DAY
Wagstaff, K. Machine Learning that Matters. In ICML (2012)
ANNOTATION QUERY LANGUAGE (AQL)
• A declarative language for developing text analytics extractors [Chiticariu et al., 2010]
• Very expressive• Runs very fast
SIMPLE EXAMPLE: OPINION ON A MOVIE
Movie
Mission Impossible has an entertaining plot, but terrible acting.
Input
Opinion
(Movie Name, Aspect, Opinion)
(Mission Impossible, plot, positive)
(Mission Impossible, acting, negative)
Desired Output
Aspect Opinion Aspect
SAMPLE AQL FOR OPINION ON A MOBILE
<Movie> <Opinion>
0-15 tokens
create view MovieReviewSnippet asselect M.name as name, O.value as value, A.aspect as aspect CombineSpans(M.name,A.aspect) as reviewfrom Movie M, Opinion O, Aspect Awhere FollowsTok(M.name, O.value, 0, 15) and FollowsTok(O.value, A.aspect, 0, 0);
create view Opinion asextract dict ‘opinion.dict’ on D.textfrom Document D;
<Aspect>
0 token
create view Aspect asextract dict ‘aspect.dict’ on D.textfrom Document D;
SKILLED PROGRAMMER, BUT NOVICE DEVELOPER IN TEXT ANALYTICS
SKILLED PROGRAMMER, BUT NOVICE DEVELOPER IN TEXT ANALYTICS
Named Entities
Sentiment
PurchaseIntent
ConsumerProfile
RootCause
RiskAnalysis
ProteinInteraction
CAN NOVICE DEVELOPER BE PRODUCTIVE?
EXPIRED
WHAT IS MISSING HERE?
BRING TEXT BACK TO TEXT ANALYTICS
WHAT EXPERT DEVELOPERS KNOW?
WHAT EXPERT DEVELOPERS KNOW?
WHAT EXPERT DEVELOPERS KNOW?
We designed tools to embody the best practice
FORMATIVE LAB STUDY
• 14 novice developers• First given a tutorial on AQL• Task: extract revenue by divisions from
company annual report• Without tool, none complete the task• With tool, all completed within 90 minute
FORMATIVE FIELD STUDY
• 12 week, 10 project members, 4 doing text analytics (4 or 5 hours per week)
• Built profiles for pharmaceutical companies• Interviews– Participants reported that the tool was easy to use– Participants made many suggestions for UI
enhancement
MAIN FEATURE: WORKFLOW GUIDE
MAIN FEATURE: EXTRACTION PLAN
CODE TEMPLATE FROM EXTRACTION PLAN
EVALUATION BY COMPETITION
• Task: buzz identification - identifying tweets mentioning the top 10 Billboard songs in the week of May 5, 2012
• Participants: summer interns, 6 registered, 4 submitted answers
• Price: $500 for the winner
• Setup: – Participants were given labeled training data (159 tweets)– Participants wrote extractors independently with our tool– Extractor quality measured on unseen test data (100 tweets)
Pre-competition Briefing
TASK HARDER THAN IT LOOKS
• RT @ardanradio #NowPlaying FUN feat Janelle Monae - We Are Young | #RIAUW
• RT @arieladriane: @1DirectionIndo what makes you beautiful - one direction cover by glee. http://t.co/t4BmvZbM
• @Cimorelliband @LisaCim @LaurenCimorelli payphone was amazing can you guys please do we are young by fun!! Thanks.
• RT @Jadore1Dx: Dear Mothers & fathers of 1D - as The Wanted would say, im glad you came.
• RT @Melisaaa11: My boyfriend knows hes jealous of my relationship with Justin Bieber
• Now u just somebody that I used to know!
• RT @ardanradio #NowPlaying FUN feat Janelle Monae - We Are Young | #RIAUW
• RT @arieladriane: @1DirectionIndo what makes you beautiful - one direction cover by glee. http://t.co/t4BmvZbM
• @Cimorelliband @LisaCim @LaurenCimorelli payphone was amazing can you guys please do we are young by fun!! Thanks.
• RT @Jadore1Dx: Dear Mothers & fathers of 1D - as The Wanted would say, im glad you came.
• RT @Melisaaa11: My boyfriend knows hes jealous of my relationship with Justin Bieber
• Now u just somebody that I used to know!
PERFORMANCE MEASURE
• Precision– Proportion of identified buzz that are real:
• Recall– Proportion of real buzz identified:
• F1– Combining precision and recall:
All test tweets
Tweets identified as
buzz
Real buzz
EVALUATION RESULTS
• State of the art F1 is around 80% for similar tasks [Ritter et al. EMNLP’11; Liu et al. ACL’12]
INTERVIEW
• Interviewed before announcing winners• All worked only the day before deadline• The winner worked only 5 hours
“Because the process is very clear, the wizard is very easy to follow”
“is quite helpful to analyze the sample data and define basic concepts. I used it extensively to create my dictionaries”
“I did not face any problems using the tool”
LOWER BARRIER TO COMPLEX DOMAIN
CONTRIBUTIONS
• Summarized the best practice of text analytics via expert interviews
• Built UI features to support the text analytics best practice
• Lowered barrier and raised ceiling for text analytics
FUTURE WORK
• Enable non-programmers to build text extractors with similar power as AQL
• Collaborative text analytics
Q & A
More Info
Huahai Yang IBM Research - Almaden
• IBM InfoSphere BigInsights Text Analytics YouTube videos: http://bit.ly/10pfDgY
• Online classes: http://BigDataUniversity.com