CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... ·...
Transcript of CS317 - File and Database Systemsmercury.pr.erau.edu/~siewerts/cs317/documents/Lectures/... ·...
September 24, 2018 Sam Siewert
CS317 - File and Database Systems
Lecture 5, Part-2 http://www.ibmbigdatahub.com/video/ibm-big-data-minute-drowning-petabytes
http://bigdata-madesimple.com/dilberts-20-funniest-cartoons-on-big-data/
“Drowning in Data - Every 2 days the world generates as much data as it had through all of history up to the year 2003. ....” - IBM Big Data & Analytics Hub
Survey Says …Flip a few classes (SQL practice, teams) to engageOnly 4 Quizzes total (1 make-up at the end for take-away)
Sam Siewert 2
SQL Theory and Standards
DBMS DesignBig Data
Sam Siewert
3
For Discussion…Big Data – Velocity, volume, variety, veracity [2014]
1. Daily – 2.5 quintillion bytes (2,500,000,000,000,000,000) or 2 Exabytes, or 46,566,128 50GB Blu-Ray Discs, IBM Estimate
2. Annually – 7.5 billion in global population, produce/consume 2.25 unique Blu-Rays per Year, or 23 DVDs (assuming even distribution – unlikely)
3. Annually – If produced/consumed by US population alone – 53 Blu-Rays per Year or 564 DVDs per person
4. Data in Total is 40 trillion gigabytes or 800 billion Blu-Rays for just over 100 (unique) Blu-Rays per person globally
5. Data by Powers of 10 and 2 – 264 is 16 Exabytes of Addressable Data [PC limit]
6. Data Max Veolicity is 100 Gbps is Fastest Ethernet [8b/10b – 10 billion bytes per second]
7. How much is Truly Unique Data vs. Duplicated
8. What is the Quality (Veracity) of this Data? Sam Siewert 4
Data Archives - Digital Tape vs. VTLTape still competitive -Roadmap 2015-2025
Disk aerial density > tape, but total capacity less - e.g. MIT Tech Review on HAMR
IEEE Spectrum on Mag Tape
Sam Siewert 5
LTO-8 (12TB), < $200 on Amazon, $0.016/GB; E.g. Spectra Logic (640PB in 5 42U Racks), Nathan T.Seagate Exos (12TB), < $400 on Amazon, $0.032/GB, HAMR HDD; E.g. DDN Exascaler (35PB in 4 42U Racks)Tape is ½ cost, and 14x uncompressed storage density (1U = 1.75 inches, 42U is just over 6 foot tall)
Data CentersLarge - to host an Exabyte (large room, e.g. 30+ person classroom)Thermal and power challenge120 6+ foot, 19” wide, 28.5” deep Racks!
Sam Siewert 6https://news.microsoft.com/features/under-the-sea-microsoft-tests-a-datacenter-thats-quick-to-deploy-could-provide-internet-connectivity-for-years/
“The Project Natick data center has 12 racks containing a total of 864 servers and associated cooling system infrastructure.” - Microsoft AI & Research
Big DataVolume and Velocity Can Be Estimated as Shown– Disk drives shipped and in use– Online data only, or removable and archive media as well?– Bit-rot (media eventually fails, limited storage lifetime)
Variety, Depends on Level of Data Duplication– Enterprise Storage System Deduplication – E.g. EMC Deduplication– Internet Archive [petabytes] and Wayback machine– http://www.loc.gov/about/general-information/ [traditional volumes]– Stanford Digital Repository, National Archives, National A/V
Conservation
Veracity, perhaps Most Challenging Part– Is the Data Correct – Not Corrupted– Is it Valid – From a Known, Trusted Source, Corresponding to
Metadata Description– Has the Data Been Processed and if so, How?– Is it Raw Data (from a sensor, user, other)?– Veracity is difficult – E.g. http://berkeleyearth.org/about-data-set
Sam Siewert 7
Why NoSQL instead of MySQL?MongoDB - Linux install (SE Workstation for projects)
C++ with Persistence (STLplus C++ library)
Redis - “in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes…”Cassandra - “a free and open-source, distributed, wide column store, NoSQLdatabase management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure” -Wikipedia
Sam Siewert 8
Semi-structured
Unstructured DataBLOBs - Binary Large Objects– Images– Digital Video and Audio – Digital Media– Binary Data (Documents and Code), Perhaps Proprietary– Moose-to-Skeleton.png– Sled-Dogs.jpg– korean-air-profile.jpg
CLOBs – Character Large Objects– Log files and Traces (IT)– Transaction Logs
Semi-Structured (Self-describing)– XML, HTML, XDS, etc. [Web documents typically via HTTP,
HTTPS]– JSON– NoSQL
Sam Siewert 9
Semi-Structured DataHTML - Web pages
XML - Extensible Markup Language
JSON - “JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML.” -Wikipedia
OO Schemas Sam Siewert 10
OO Concepts – “Real World”OOA – Object Oriented Analysis– Define Class Hierarchies (Abstract Classes with Attributes) and
Interfaces (Public, Private) and Methods (Operations)– Inheritance and Multiple Inheritance
OOD – OO Design– Encapsulation of Methods with Data (Attributes) for Abstract and
Derived Classes– Instantiation and Use of Objects [Use Cases]
OOP – Object Oriented Programming (Java, C++, …)– Programming Language – Direct Implementation of OOD– Implementation of Re-useable OO Code Libraries
Boost - http://www.boost.org/OpenCV [C++ version]Many More … in other OOPLs
Sam Siewert 11
Classes Useful in Real WorldE.g. Biology – Kingdom, Phylum, Class, Order, Genus, Species [Multiple Inheritance Examples], Proven Use
Parts – Components compose Sub-system(s) compose System(s) compose System of Systems
Supports Re-Use of Objects Instantiated from Class Hierarchy
Multiple Inheritance – Odd?
Can be Abstract, Derived and Concrete
– E.g. Mathematical, Data Structures, Image Processing
– Organization of Information (Classes in Ontological Web Language)
– Simulation of Physical Systems– Most Often Software Libraries
Sam Siewert 12
http://en.wikipedia.org/wiki/Platypus#mediaviewer/File:Wild_Platypus_4.jpg
https://www.youtube.com/watch?v=kDay5OWDPn4#t=26
Quick Review of OO [not just C++]Encapsulation of Data and Methods in an Instantiated Object
Objects are Instances from a Class Hierarchy– Classes Define Encapsulated Data and Methods
Virtual Functions can Be RefinedPure Virtual Functions in Abstract Classes Defined must be Refined
– Can Inherit Data and Methods from Parent Classes– Can In Fact Have Multiple Inheritance– Instantiated Objects Call Dynamically Bound Methods [Determined at Runtime]
Enables Semantic Overload [Can be Done without OO too]– Overloaded Functions (Methods), Resolved by Type Signatures or Subtype/Sub-
class– Overloaded Operators (E.g. math operators work not only on integers and real
numbers, but also vectors, matrices, and complex numbers)– Derived Data Types from Base types
Polymorphism– Parametric – Re-useable Templates (E.g. Ada and Java Generic, C++ Template)– Functional Semantic Overloading– Dynamic or Subtype or Subclass Polymorphism using Late Binding
OOPs – Smalltalk to more current Java, C++, Ada95, … CLOS Sam Siewert 13
Operator and Function OverloadingWhat is Required to Be OO?
Common Consensus is –Encapsulation, Class Hierarchy, Polymorphism(Parametric & Subtype or Subclass with Late Binding), Inheritance
Operator Overloading Not Required (E.g. Java Frowns Upon, No Support)
Some PLs have OO Features, but not All Sam Siewert 14http://en.wikipedia.org/wiki/Operator_overloading
Storing Objects in Relational Databases
One approach to achieving persistence with an OOPL is touse an RDBMS as the underlying storage engine.– O2 – merged with Informix and acquired by IBM– ObjectStore - http://www.objectstore.com/– Objectivity - http://www.objectivity.com/products/objectivitydb– Versant - http://www.actian.com/products/operational-databases/
Requires mapping class instances (i.e. objects) to one ormore tuples distributed over one or more relations.
To handle class hierarchy, have two basics tasks to perform:(1) design relations to represent class hierarchy;(2) design how objects will be accessed.
Pearson Education © 2009 15
Stonebraker’s View
Pearson Education © 2014 16
SQL:2011 - New OO FeaturesType constructors for row types and reference types.
User-defined types (distinct types and structured types)that can participate in supertype/subtype relationships.
User-defined procedures, functions, methods, andoperators.
Type constructors for collection types (arrays, sets, lists,and multisets).
Support for large objects – BLOBs and CLOBs.
Recursion.
Pearson Education © 2014 17