Human Language Technology in a Big Data World · Human Language Technology in a Big Data World...
Transcript of Human Language Technology in a Big Data World · Human Language Technology in a Big Data World...
Human Language Technology in a Big Data World
@chris_biow #HLTCon (2016)
Big Data Universe
Sooo Big!
Tooo Big!
Taming Big
Taming Too Big
Exponential Hyperbole!!
Yer gonna die. Standard mountaineering warning
● Data is exploding without limit ● I can draw a curve on a semi-log-scale
graph ● Even if that almost never happens in reality
● Buy my vision or drown in data
Wgsimon / Wikimedia Commons / Creative Commons Attribution-Share Alike 3.0 Unported
Exponential Reality
Qef / Wikimedia Commons / Public Domain
Human Language World
Exponential Sobriety
Most growth is exponential. Chris Lindblad
MarkLogic Founder
Measure 10^ 2^ Example
Kilobyte 3 10 12 lines of 80 characters
Megabyte 6 20 500 pages, 48 hours typing
Gigabyte 9 30 30 minutes Twitter text feed
Terabyte 12 40 2 weeks Twitter text feed
Petabyte 15 50 Humanity typing for 8 hours
Exabyte 18 60 Humanity typing for 1 year
Zettabyte 21 70 Global IP traffic 2016 [Cisco 2013]
Yottabyte 24 80 (break glass in case of need)
Distinguishing Big Data Follow the money.
Volume Bounded
Variety Text and voice
Velocity Latency
Value Fixed % of all
Veracity Not necc. required
Big Data Tech
I shall not today attempt further to define [it], and perhaps I could never succeed in intelligibly doing so. But I know it when I see it…
Justice Potter Stewart, 1964 (emphasis added)
Defining Big Data
Data whose volume, velocity, and variety determines your choice of software and infrastructure.
Achieving Big Data
Year Company Customer Project Quantity (M)
Size (GB)
Project Cost ($M)
2003 Verity TRW, DIA WISE 40 200 10
2006 Veronomy Bloomberg News 200 1,000 30
2009 MarkLogic Gov & Comm. OSINT 2,000 200,000 100
2014 MongoDB AWS ReInvent goo.gl/xZVgdl
7,000 1,000,000 0.003
Features & Functions
Text-Ready Tech
State of the Mission in Text Analytics
Entity Extraction
Text Translation
Relationship Extraction
Name Translation
Search
Database
Language ID
Sentiment Analysis Rare, new
Languages
Name Translation
Alerting
Voice of the X
Partial Parse
Gap Solved
What language? bú
ana raye7 el gam3a el sa3a 3 el 3asr. el gaw 3amel eh elnaharda f eskendereya?
Lessons Learned • Requirements are wrong
• Every power of 4 will invalidate some requirements and solutions
• Agile processes fit Big HLT
• Measure to costs and to mission at each increment
• Express requirements exponentially
• Expect competence and confidence with Big Data
• Progress exponentially (powers of 4)
• Adjust requirements as you learn how they meet the mission