Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt...
Transcript of Hadoop and NoSQL Basics: Big Data Demystified - Schedschd.ws/hosted_files/nyinnovates2013/c5/Matt...
Hadoop and NoSQL Basics: Big Data Demystified
NYS Innovation Summit, 12/17/2013
Matt LeMay, @mattlemay
“When I want people to think I’m smart, I just say ‘HADOOP’ really loud.”
“Big Data!”
“Data Science!”
“Hadoop! There it is.”
“Algorithms!”
... why are we thinking about this at all?
=ALL the data
created until the year 2003
ALL the data created every
two days
Writes > 12 terabytes of data per day.
*the 451 group
... how did we get here?
HIERARCHICAL DATABASE MODEL
RELATIONAL DATABASE MODEL
DOCUMENT DATABASE MODEL
HIERARCHICAL DATABASE MODEL
• Used in early mainframe computing !• Stores data in one-to-many “trees” !• Not very flexible
Fruit
AppleOrange Grape
Granny Smith Honeycrisp Red Delicious
RELATIONAL DATABASE MODEL
• Invented in 1970 by Edgar F. Codd at IBM !• Stores data in “tuples” which resemble rows of a table !• Still the most widely used database model
Fruit_Variety Fruit
Granny Smith Apple
Honeycrisp Apple
Red Delicious Apple
Navel Orange
RELATIONAL DATABASE MODEL
• ... can also store hierarchical data!
Fruit_ID Fruit_Name
1 Orange
2 Apple
3 Grape
Variety_ID Variety_Name Fruit_ID
1 Granny Smith 2
2 Honeycrisp 2
3 Red Delicious 2
4 Navel 1
RELATIONAL DATABASE MODEL
• Has rigid structure or “schema.”
Fruit_ID Fruit_Name
1 Orange
2 Apple
3 Grape
Variety_ID Variety_Name Fruit_ID
1 Granny Smith 2
2 Honeycrisp 2
3 Red Delicious 2
4 Navel 1
RELATIONAL DATABASE MODEL
• Uses unique “keys” for consistency across “tables”
Fruit_ID Fruit_Name
1 Orange
2 Apple
3 Grape
Variety_ID Variety_Name Fruit_ID
1 Granny Smith 2
2 Honeycrisp 2
3 Red Delicious 2
4 Navel 1
DOCUMENT DATABASE MODEL
Red Delicious AppleHoneycrisp Apple
Granny Smith Apple
Navel Orange
• Doesn’t have a single structure or “schema” that each entry must follow !• Developed in 1995 for use with Lotus Notes !• SO TRENDY
DOCUMENT DATABASE MODEL
• CAN have structured elements, but structure doesn’t need to be consistent across entries
{!“Fruits”: [!{!“Type”: “Apple”,!“Variety”: “Red Delicious”!
},!{!“Name”: “Granny Smith Apple”!
},!“Navel Orange”!
]!}!
!
HIERARCHICAL DATABASE MODEL
RELATIONAL DATABASE MODEL
DOCUMENT DATABASE MODEL
RIGID
FLEXIBLE
HIERARCHICAL DATABASE MODEL
RELATIONAL DATABASE MODEL
DOCUMENT DATABASE MODEL
RIGID
FLEXIBLE
Relational Database is to Document Database !
As Excel Spreadsheet is to Word Document
... as SQL is to NoSQL
Relational Database is to Document Database !
As Excel Spreadsheet is to Word Document
Relational Database is to Document Database !
As Excel Spreadsheet is to Word Document
*... mostly / sorta. Stay tuned!
... as SQL is to NoSQL*
SQL, or “Structured Query Language,” is a language for getting data into and out of a relational database.
“SELECT Variety_Name FROM fruits WHERE fruit_id = 2”
!Variety_Name!---------------------- !Granny Smith!Honeycrisp!Red Delicious!
Depending on who you ask, “NoSQL” means “NOT SQL” or “NOT ONLY SQL.”
(in fact, some characterize NoSQL as a “movement,” not a particular
technology or set of technologies.)
“SQL Databases” are highly standardized. !
“NoSQL Databases” are highly fragmented.
“SQL Databases” are highly standardized. !
“NoSQL Databases” are highly fragmented. Some are document model databases, some use a variation of a key-value store.
Document Databases
So, what are the characteristics of NoSQL databases* that make them so
trendy and exciting?
* Generally
Relational databases have strict “schemas” dictating the structure of data.
NoSQL databases are generally “schemaless,” even when they use key-value stores.
NoSQL databases are generally “schemaless,” even when they use key-value stores.
Can start entering data before deciding on how that data will be formatted
Less structured, consistent
More flexible
NoSQL databases are generally “schemaless,” even when they use key-value stores.
Can start entering data before deciding on how that data will be formatted
Less structured, consistent
More flexible
Relational databases can scale up (on one computer) but not easily out (across many computers).
NoSQL databases are designed to scale out across many computers.
NoSQL databases are designed to scale out across many computers.
Lots of machines == BIG data
More complicated to set up
Can scale quickly if needed
No single point of failure
Relational databases read and write information directly to a disk drive.
NoSQL databases store information in memory, and/or include robust built-in caching in memory.
NoSQL databases store information in memory, and/or include robust built-in caching in memory.
Faster
Memory more expensive than disk
Potential reliability issues
Relational databases follow the “ACID” model:
NoSQL databases do not follow the “ACID” model.
More freedom to handle requests in a way that honors the uniqueness of “things.”
Much greater room for (potentially serious) errors.
NoSQL databases do not follow the “ACID” model.
Relational databases represent data as “rows” and “columns.”
NoSQL databases often represent data in formats such as JSON, which are native to
many programming languages.
NoSQL databases often represent data in formats such as JSON, which are native to
many programming languages.
Easier, faster for programmers
Harder for non-programmers
SO WAIT, THOUGH, how the f*** do you find anything in a NoSQL database????
HADOOP is an open source framework for doing MapReduce.
!
MapReduce is one way to make sense of a document database.
!
(That’s how GOOGLE does it.)
MapReduce has two core steps: !
Map !
and !
Reduce. !
!
!
... both are pretty much what they sound like.
This is what it actually looks like:
function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)
“For a given document, map each word phrase or item to the number of times that word phrase or item appears.”
MAP:
“NOW, take all of those maps from every document, and reduce them to a single list of items and counts.”
REDUCE:
function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
Red Delicious Apple
Honeycrisp Apple Granny Smith Apple
Navel Orange
Red Delicious Apple
Honeycrisp Apple Granny Smith Apple
Navel Orange
(Red, 1) (Delicious, 1) (Apple, 1)
(Honeycrisp, 1) (Apple, 1)
(Navel, 1) (Orange, 1)
(Granny, 1) (Smith, 1) (Apple, 1)
MAP
Red Delicious Apple
Honeycrisp Apple Granny Smith Apple
Navel Orange
(Red, 1) (Delicious, 1) (Apple, 1)
(Honeycrisp, 1) (Apple, 1)
(Navel, 1) (Orange, 1)
(Granny, 1) (Smith, 1) (Apple, 1)
MAP
(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)
REDUCE
Red Delicious Apple
Honeycrisp Apple Granny Smith Apple
Navel Orange
(Red, 1) (Delicious, 1) (Apple, 1)
(Honeycrisp, 1) (Apple, 1)
(Navel, 1) (Orange, 1)
(Granny, 1) (Smith, 1) (Apple, 1)
MAP
(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)
REDUCE
The hard work is distributed
The hard work is distributed
The easy work is centralized
Red Delicious Apple
Honeycrisp Apple Granny Smith Apple
Navel Orange
COMP 1 COMP 2
... but what if we’ve got our documents stored on multiple machines?
Red Delicious Apple
Honeycrisp Apple Granny Smith Apple
Navel Orange
(Red, 1) (Delicious, 1) (Apple, 1)
(Honeycrisp, 1) (Apple, 1)
(Navel, 1) (Orange, 1)
(Granny, 1) (Smith, 1) (Apple, 1)
COMP 1 COMP 2
MAP MAP
Red Delicious Apple
Honeycrisp Apple Granny Smith Apple
Navel Orange
(Red, 1) (Delicious, 1) (Apple, 1)
(Honeycrisp, 1) (Apple, 1)
(Navel, 1) (Orange, 1)
(Granny, 1) (Smith, 1) (Apple, 1)
(Red, 1) (Delicious, 1) (Apple, 2) (Honeycrisp, 1)
(Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1) (Apple, 1)
COMP 1 COMP 2
MAP MAP
REDUCE REDUCE
Red Delicious Apple
Honeycrisp Apple Granny Smith Apple
Navel Orange
(Red, 1) (Delicious, 1) (Apple, 1)
(Honeycrisp, 1) (Apple, 1)
(Navel, 1) (Orange, 1)
(Granny, 1) (Smith, 1) (Apple, 1)
(Red, 1) (Delicious, 1) (Apple, 3) (Honeycrisp, 1) (Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1)
(Red, 1) (Delicious, 1) (Apple, 2) (Honeycrisp, 1)
(Navel, 1) (Orange, 1) (Granny, 1) (Smith, 1) (Apple, 1)
COMP 1 COMP 2
MAP MAP
REDUCE REDUCE
REDUCE
Is this the easiest way to count apples?
NOT
*
* relational database
Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”
Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”
Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”
(1808, +.9)
MAP (WITH MATH + SENTIMENT)
Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”
(33, -.6)(Distance in Miles, Sentiment Score)
Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”
(1808, +.9)
MAP (WITH MATH + SENTIMENT)
Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”
(33, -.6)(Distance in Miles, Sentiment Score)
REDUCE
(1808, +.9) (33, -.6)
Tweet Text: “I am so happy!” Tweet Location: “Albuquerque, NM” User Home: “New York, NY”
(1808, +.9)
MAP (WITH MATH + SENTIMENT)
Tweet Text: “#FML #FML #FML” Tweet Location: “Palo Alto, CA” User Home: “San Francisco, CA”
(33, -.6)(Distance in Miles, Sentiment Score)
REDUCE
(1808, +.9) (33, -.6)
RINSE AND REPEAT LIKE A MILLION TIMES
... none of this is magic.
... in fact, the “magic” part is just a precursor to doing the actual hard work.
Danah Boyd’s Six Provocations for Big Data:
1. Automating Research Changes the Definition of Knowledge. !2. Claims to Objectivity and Accuracy are Misleading !3. Bigger Data are Not Always Better Data !4. Not All Data Are Equivalent !5. Just Because it is Accessible Doesn’t Make it Ethical !6. Limited Access to Big Data Creates New Digital Divides
What about THE FUTURE?
HIERARCHICAL DATABASE MODEL
RELATIONAL DATABASE MODEL
DOCUMENT DATABASE MODEL
RIGID
FLEXIBLE
HIERARCHICAL DATABASE MODEL
RELATIONAL DATABASE MODEL
DOCUMENT DATABASE MODEL
RIGID
FLEXIBLE
?
Further Reading:
Martin Fowler on NoSQL: http://martinfowler.com/nosql.html !Helpful Stack Overflow thread: http://stackoverflow.com/questions/11844603/technology-decision-sql-vs-nosql-vs-newsql !Finding Friends with MapReduce: http://stevekrenzel.com/finding-friends-with-mapreduce !Choosing a Database That’s Right for Your Business: http://slashdot.org/topic/bi/choosing-a-database-right-for-business-2/ !Demystifying the Role of Big Data in Marketing: http://www.guardian.co.uk/media-network/media-network-blog/2013/mar/12/big-data-marketing-demystified !The NoSQL Movement: http://strata.oreilly.com/2012/02/nosql-non-relational-database.html !Big Data Tools Cost Too Much, Do Too Little: http://www.theregister.co.uk/2013/02/28/hadoop_no_sql_dont_believe_the_hype/ !Is Big Data an Economic Big Dud?: http://www.nytimes.com/2013/08/18/sunday-review/is-big-data-an-economic-big-dud.html?hp&_r=1& !Six Provocations for Big Data: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431