XML. Structure of XML Data XML Document Schema Querying and Transformation.
-
Upload
brent-osborne -
Category
Documents
-
view
251 -
download
1
Transcript of XML. Structure of XML Data XML Document Schema Querying and Transformation.
XML
XML
Structure of XML Data XML Document Schema Querying and Transformation
INTRODUCTION
XML: Extensible Markup Language Defined by the WWW Consortium (W3C) Derived from SGML (Standard Generalized
Markup Language), but simpler to use than SGML Documents have tags giving extra information
about sections of the document E.g. <title> XML </title> <slide> Introduction
…</slide> Extensible, unlike HTML
Users can add new tags, and separately specify how the tag should be handled for display
XML INTRODUCTION (CONT.) The ability to specify new tags, and to create nested
tag structures make XML a great way to exchange data, not just documents. Much of the use of XML has been in data exchange applications, not as a
replacement for HTML
Tags make data (relatively) self-documenting E.g.
<university> <department> <dept_name> Comp. Sci. </dept_name> <building> Taylor </building> <budget> 100000 </budget> </department> <course> <course_id> CS-101 </course_id> <title> Intro. to Computer Science </title> <dept_name> Comp. Sci </dept_name> <credits> 4 </credits> </course>
</university>
XML: MOTIVATION
Data interchange is critical in today’s networked world Examples:
Banking: funds transfer Order processing (especially inter-company orders) Scientific data
Chemistry: ChemML, … Genetics: BSML (Bio-Sequence Markup Language), …
Paper flow of information between organizations is being replaced by electronic flow of information
Each application area has its own set of standards for representing information
XML has become the basis for all new generation data interchange formats
XML MOTIVATION (CONT.)
Earlier generation formats were based on plain text with line headers indicating the meaning of fields Similar in concept to email headers Does not allow for nested structures, no standard “type”
language Tied too closely to low level document structure (lines,
spaces, etc) Each XML based standard defines what are valid
elements, using XML type specification languages to specify the syntax
DTD (Document Type Descriptors) XML Schema
Plus textual descriptions of the semantics XML allows new tags to be defined as required
However, this may be constrained by DTDs A wide variety of tools is available for parsing,
browsing and querying XML documents/data
COMPARISON WITH RELATIONAL DATA
Inefficient: tags, which in effect represent schema information, are repeated
Better than relational tuples as a data-exchange format Unlike relational tuples, XML data is self-
documenting due to presence of tags Non-rigid format: tags can be added Allows nested structures Wide acceptance, not only in database systems,
but also in browsers, tools, and applications
STRUCTURE OF XML DATA Tag: label for a section of data Element: section of data beginning with
<tagname> and ending with matching </tagname>
Elements must be properly nested Proper nesting
<course> … <title> …. </title> </course> Improper nesting
<course> … <title> …. </course> </title> Formally: every start tag must have a unique
matching end tag, that is in the context of the same parent element.
Every document must have a single top-level element
EXAMPLE OF NESTED ELEMENTS
<purchase_order> <identifier> P-101 </identifier> <purchaser> …. </purchaser> <itemlist> <item> <identifier> RS1 </identifier> <description> Atom powered rocket sled </description> <quantity> 2 </quantity> <price> 199.95 </price> </item> <item> <identifier> SG2 </identifier> <description> Superb glue </description> <quantity> 1 </quantity> <unit-of-measure> liter </unit-of-measure> <price> 29.95 </price> </item> </itemlist></purchase_order>
MOTIVATION FOR NESTING
Nesting of data is useful in data transfer Example: elements representing item nested within
an itemlist element Nesting is not supported, or discouraged, in
relational databases With multiple orders, customer name and address
are stored redundantly normalization replaces nested structures in each
order by foreign key into table storing customer name and address information
Nesting is supported in object-relational databases But nesting is appropriate when transferring
data External application does not have direct access to
data referenced by a foreign key
STRUCTURE OF XML DATA (CONT.) Mixture of text with sub-elements is legal in
XML. Example: <course>
This course is being offered for the first time in 2009. <course id> BIO-399 </course id> <title> Computational Biology </title> <dept name> Biology </dept name> <credits> 3 </credits></course>
Useful for document markup, but discouraged for data representation
ATTRIBUTES
Elements can have attributes <course course_id= “CS-101”> <title> Intro. to Computer Science</title> <dept name> Comp. Sci. </dept name> <credits> 4 </credits> </course>
Attributes are specified by name=value pairs inside the starting tag of an element
An element may have several attributes, but each attribute name can only occur once
<course course_id = “CS-101” credits=“4”>
ATTRIBUTES VS. SUBELEMENTS
Distinction between subelement and attribute In the context of documents, attributes are part
of markup, while subelement contents are part of the basic document contents
In the context of data representation, the difference is unclear and may be confusing Same information can be represented in two ways
<course course_id= “CS-101”> … </course> <course>
<course_id>CS-101</course_id> … </course>
Suggestion: use attributes for identifiers of elements, and use subelements for contents
NAMESPACES
XML data has to be exchanged between organizations Same tag name may have different meaning in
different organizations, causing confusion on exchanged documents
Specifying a unique string as an element name avoids confusion
Better solution: use unique-name:element-name Avoid using long unique names all over document by
using XML Namespaces <university xmlns:yale=“http://www.yale.edu”>
… <yale:course>
<yale:course_id> CS-101 </yale:course_id> <yale:title> Intro. to Computer Science</yale:title> <yale:dept_name> Comp. Sci. </yale:dept_name> <yale:credits> 4 </yale:credits>
</yale:course>…
</university>
MORE ON XML SYNTAX
Elements without subelements or text content can be abbreviated by ending the start tag with a /> and deleting the end tag <course course_id=“CS-101” Title=“Intro. To
Computer Science” dept_name = “Comp. Sci.” credits=“4” />
To store string data that may contain tags, without the tags being interpreted as subelements, use CDATA as below <![CDATA[<course> … </course>]]>Here, <course> and </course> are treated as just
stringsCDATA stands for “character data”
XML DOCUMENT SCHEMA Database schemas constrain what
information can be stored, and the data types of stored values
XML documents are not required to have an associated schema
However, schemas are very important for XML data exchange Otherwise, a site cannot automatically interpret
data received from another site Two mechanisms for specifying XML schema
Document Type Definition (DTD) Widely used
XML Schema Newer, increasing use
DOCUMENT TYPE DEFINITION (DTD)
The type of an XML document can be specified using a DTD
DTD constraints structure of XML data What elements can occur What attributes can/must an element have What subelements can/must occur inside each
element, and how many times. DTD does not constrain data types
All values represented as strings in XML DTD syntax
<!ELEMENT element (subelements-specification) >
<!ATTLIST element (attributes) >
ELEMENT SPECIFICATION IN DTD Subelements can be specified as
names of elements, or #PCDATA (parsed character data), i.e., character strings EMPTY (no subelements) or ANY (anything can be a subelement)
Example<! ELEMENT department (dept_name building, budget)>
<! ELEMENT dept_name (#PCDATA)><! ELEMENT budget (#PCDATA)>
Subelement specification may have regular expressions <!ELEMENT university ( ( department | course | instructor |
teaches )+)> Notation:
“|” - alternatives “+” - 1 or more occurrences “*” - 0 or more occurrences
UNIVERSITY DTD
<!DOCTYPE university [<!ELEMENT university ( (department|course|instructor|teaches)+)><!ELEMENT department ( dept name, building, budget)><!ELEMENT course ( course id, title, dept name, credits)><!ELEMENT instructor (IID, name, dept name, salary)><!ELEMENT teaches (IID, course id)><!ELEMENT dept name( #PCDATA )><!ELEMENT building( #PCDATA )><!ELEMENT budget( #PCDATA )><!ELEMENT course id ( #PCDATA )><!ELEMENT title ( #PCDATA )><!ELEMENT credits( #PCDATA )><!ELEMENT IID( #PCDATA )><!ELEMENT name( #PCDATA )><!ELEMENT salary( #PCDATA )>
]>
ATTRIBUTE SPECIFICATION IN DTD
Attribute specification : for each attribute Name Type of attribute
CDATA ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)
more on this later Whether
mandatory (#REQUIRED) has a default value (value), or neither (#IMPLIED)
Examples <!ATTLIST course course_id CDATA #REQUIRED>, or <!ATTLIST course
course_id ID #REQUIRED dept_name IDREF #REQUIRED
instructors IDREFS #IMPLIED >
IDS AND IDREFS
An element can have at most one attribute of type ID
The ID attribute value of each element in an XML document must be distinct Thus the ID attribute value is an object identifier
An attribute of type IDREF must contain the ID value of an element in the same document
An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value must contain the ID value of an element in the same document
UNIVERSITY DTD WITH ATTRIBUTES University DTD with ID and IDREF attribute types.
<!DOCTYPE university-3 [ <!ELEMENT university ( (department|course|instructor)+)> <!ELEMENT department ( building, budget )> <!ATTLIST department dept_name ID #REQUIRED > <!ELEMENT course (title, credits )> <!ATTLIST course course_id ID #REQUIRED dept_name IDREF #REQUIRED instructors IDREFS #IMPLIED > <!ELEMENT instructor ( name, salary )> <!ATTLIST instructor IID ID #REQUIRED dept_name IDREF #REQUIRED > · · · declarations for title, credits, building, budget, name and salary · · ·]>
XML DATA WITH ID AND IDREF ATTRIBUTES<university-3> <department dept name=“Comp. Sci.”> <building> Taylor </building> <budget> 100000 </budget> </department> <department dept name=“Biology”> <building> Watson </building> <budget> 90000 </budget> </department> <course course id=“CS-101” dept name=“Comp. Sci” instructors=“10101 83821”> <title> Intro. to Computer Science </title> <credits> 4 </credits> </course> …. <instructor IID=“10101” dept name=“Comp. Sci.”> <name> Srinivasan </name> <salary> 65000 </salary> </instructor> ….</university-3>
LIMITATIONS OF DTDS
No typing of text elements and attributes All values are strings, no integers, reals, etc.
Difficult to specify unordered sets of subelements Order is usually irrelevant in databases (unlike in
the document-layout environment from which XML evolved)
(A | B)* allows specification of an unordered set, but Cannot ensure that each of A and B occurs only once
IDs and IDREFs are untyped The instructors attribute of an course may contain a
reference to another course, which is meaningless instructors attribute should ideally be constrained to refer
to instructor elements
XML SCHEMA XML Schema is a more sophisticated schema
language which addresses the drawbacks of DTDs. Supports Typing of values
E.g. integer, string, etc Also, constraints on min/max values
User-defined, comlex types Many more features, including
uniqueness and foreign key constraints, inheritance XML Schema is itself specified in XML syntax,
unlike DTDs More-standard representation, but verbose
XML Scheme is integrated with namespaces BUT: XML Schema is significantly more
complicated than DTDs.
XML SCHEMA VERSION OF UNIV. DTD<xs:schema xmlns:xs=“http://www.w3.org/2001/XMLSchema”><xs:element name=“university” type=“universityType” /><xs:element name=“department”> <xs:complexType> <xs:sequence> <xs:element name=“dept name” type=“xs:string”/> <xs:element name=“building” type=“xs:string”/> <xs:element name=“budget” type=“xs:decimal”/> </xs:sequence> </xs:complexType></xs:element>….<xs:element name=“instructor”> <xs:complexType> <xs:sequence> <xs:element name=“IID” type=“xs:string”/> <xs:element name=“name” type=“xs:string”/> <xs:element name=“dept name” type=“xs:string”/> <xs:element name=“salary” type=“xs:decimal”/> </xs:sequence> </xs:complexType></xs:element>… Contd.
XML SCHEMA VERSION OF UNIV. DTD (CONT.)….
<xs:complexType name=“UniversityType”> <xs:sequence> <xs:element ref=“department” minOccurs=“0” maxOccurs=“unbounded”/> <xs:element ref=“course” minOccurs=“0” maxOccurs=“unbounded”/> <xs:element ref=“instructor” minOccurs=“0” maxOccurs=“unbounded”/> <xs:element ref=“teaches” minOccurs=“0” maxOccurs=“unbounded”/> </xs:sequence></xs:complexType></xs:schema>
Choice of “xs:” was ours -- any other namespace prefix could be chosen
Element “university” has type “universityType”, which is defined separately
xs:complexType is used later to create the named complex type “UniversityType”
MORE FEATURES OF XML SCHEMA Attributes specified by xs:attribute tag:
<xs:attribute name = “dept_name”/> adding the attribute use = “required” means value
must be specified Key constraint: “department names form a key for
department elements under the root university element:
<xs:key name = “deptKey”><xs:selector xpath =
“/university/department”/><xs:field xpath = “dept_name”/>
<\xs:key> Foreign key constraint from course to department:
<xs:keyref name = “courseDeptFKey” refer=“deptKey”>
<xs:selector xpath = “/university/course”/>
<xs:field xpath = “dept_name”/><\xs:keyref>
QUERYING AND TRANSFORMING XML DATA Translation of information from one XML
schema to another Querying on XML data Above two are closely related, and handled
by the same tools Standard XML querying/translation languages
XPath Simple language consisting of path expressions
XSLT Simple language designed for translation from XML to
XML and XML to HTML XQuery
An XML query language with a rich set of features
TREE MODEL OF XML DATA
Query and transformation languages are based on a tree model of XML data
An XML document is modeled as a tree, with nodes corresponding to elements and attributes Element nodes have child nodes, which can be
attributes or subelements Text in an element is modeled as a text node child of
the element Children of a node are ordered according to their order
in the XML document Element and attribute nodes (except for the root
node) have a single parent, which is an element node The root node has a single child, which is the root
element of the document
XPATH XPath is used to address (select) parts of
documents using path expressions
A path expression is a sequence of steps separated by “/” Think of file names in a directory hierarchy
Result of path expression: set of values that along with their containing elements/attributes match the specified path
E.g. /university-3/instructor/name evaluated on the university-3 data we saw earlier returns <name>Srinivasan</name>
<name>Brandt</name> E.g. /university-3/instructor/name/text( ) returns the same names, but without the
enclosing tags
XPATH (CONT.)
The initial “/” denotes root of the document (above the top-level tag)
Path expressions are evaluated left to right Each step operates on the set of instances produced by the
previous step Selection predicates may follow any step in a path, in
[ ] E.g. /university-3/course[credits >= 4]
returns account elements with a balance value greater than 400 /university-3/course[credits] returns account elements
containing a credits subelement
Attributes are accessed using “@” E.g. /university-3/course[credits >= 4]/@course_id
returns the course identifiers of courses with credits >= 4 IDREF attributes are not dereferenced automatically (more
on this later)
FUNCTIONS IN XPATH XPath provides several functions
The function count() at the end of a path counts the number of elements in the set generated by the path E.g. /university-2/instructor[count(./teaches/course)> 2]
Returns instructors teaching more than 2 courses (on university-2 schema)
Also function for testing position (1, 2, ..) of node w.r.t. siblings
Boolean connectives and and or and function not() can be used in predicates
IDREFs can be referenced using function id() id() can also be applied to sets of references such as
IDREFS and even to strings containing multiple references separated by blanks
E.g. /university-3/course/id(@dept_name) returns all department elements referred to from the
dept_name attribute of course elements.
MORE XPATH FEATURES
Operator “|” used to implement union E.g. /university-3/course[@dept name=“Comp. Sci”] |
/university-3/course[@dept name=“Biology”] Gives union of Comp. Sci. and Biology courses However, “|” cannot be nested inside other operators.
“//” can be used to skip multiple levels of nodes E.g. /university-3//name
finds any name element anywhere under the /university-3 element, regardless of the element in which it is contained.
A step in the path can go to parents, siblings, ancestors and descendants of the nodes generated by the previous step, not just to the children “//”, described above, is a short from for specifying “all
descendants” “..” specifies the parent.
doc(name) returns the root of a named document
XQUERY XQuery is a general purpose query language for
XML data Currently being standardized by the World Wide
Web Consortium (W3C) The textbook description is based on a January 2005
draft of the standard. The final version may differ, but major features likely to stay unchanged.
XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL
XQuery uses a for … let … where … order by …result … syntax for SQL from where SQL where order by SQL order by result SQL select let allows temporary variables, and has no equivalent in SQL
FLWOR SYNTAX IN XQUERY
For clause uses XPath expressions, and variable in for clause ranges over values in the set returned by XPath
Simple FLWOR expression in XQuery find all courses with credits > 3, with each result enclosed in
an <course_id> .. </course_id> tag for $x in /university-3/course let $courseId := $x/@course_id where $x/credits > 3 return <course_id> { $courseId } </course id>
Items in the return clause are XML text unless enclosed in {}, in which case they are evaluated
Let clause not really needed in this query, and selection can be done In XPath. Query can be written as:
for $x in /university-3/course[credits > 3] return <course_id> { $x/@course_id } </course_id>
Alternative notation for constructing elements: return element course_id { element
$x/@course_id }
JOINS
Joins are specified in a manner very similar to SQL for $c in /university/course,
$i in /university/instructor, $t in /university/teaches where $c/course_id= $t/course id and $t/IID = $i/IID return <course_instructor> { $c $i } </course_instructor>
The same query can be expressed with the selections specified as XPath selections:
for $c in /university/course, $i in /university/instructor, $t in /university/teaches[ $c/course_id= $t/course_id and $t/IID = $i/IID] return <course_instructor> { $c $i } </course_instructor>
NESTED QUERIES
The following query converts data from the flat structure for university information into the nested structure used in university-1
<university-1> { for $d in /university/department return <department> { $d/* } { for $c in /university/course[dept name = $d/dept name] return $c } </department>}{ for $i in /university/instructor return <instructor> { $i/* } { for $c in /university/teaches[IID = $i/IID] return $c/course id } </instructor>}</university-1>
$c/* denotes all the children of the node to which $c is bound, without the enclosing top-level tag
GROUPING AND AGGREGATION
Nested queries are used for groupingfor $d in /university/departmentreturn <department-total-salary> <dept_name> { $d/dept name } </dept_name> <total_salary> { fn:sum( for $i in /university/instructor[dept_name = $d/dept_name] return $i/salary ) } </total_salary> </department-total-salary>
SORTING IN XQUERY
The order by clause can be used at the end of any expression. E.g. to return instructors sorted by name for $i in /university/instructor order by $i/name return <instructor> { $i/* } </instructor>
Use order by $i/name descending to sort in descending order
Can sort at multiple levels of nesting (sort departments by dept_name, and by courses sorted to course_id within each department)
<university-1> { for $d in /university/department order by $d/dept name return <department> { $d/* } { for $c in /university/course[dept name = $d/dept name] order by $c/course id return <course> { $c/* } </course> } </department> } </university-1>
FUNCTIONS AND OTHER XQUERY FEATURES
User defined functions with the type system of XMLSchema declare function local:dept_courses($iid as xs:string) as element(course)* { for $i in /university/instructor[IID = $iid], $c in /university/courses[dept_name = $i/dept name] return $c}
Types are optional for function parameters and return values
The * (as in decimal*) indicates a sequence of values of that type
Universal and existential quantification in where clause predicates some $e in path satisfies P every $e in path satisfies P Add and fn:exists($e) to prevent empty $e from satisfying
every clause XQuery also supports If-then-else clauses
For example, to find departments where every instructor has a salary greater than $50,000, we can use the following query:
for $d in /university/department where every $i in
/university/instructor[dept name=$d/dept name] satisfies $i/salary > 50000
return $d Note, however, that if a department has no
instructor, it will trivially satisfy the above condition. An extra clause:
and fn:exists(/university/instructor[dept name=$d/dept name])
XSLT
A stylesheet stores formatting options for a document, usually separately from document E.g. an HTML style sheet may specify font colors and
sizes for headings, etc. The XML Stylesheet Language (XSL) was
originally designed for generating HTML from XML XSLT is a general-purpose transformation
language Can translate XML to XML, and XML to HTML
XSLT transformations are expressed using rules called templates Templates combine selection using XPath with
construction of results
END OF XML
JSON
JSON: JavaScript Object Notation.
syntax for storing and exchanging data.
JSON is an easier to use alternative to XML.
JASON
JSON is lightweight data interchange format
JSON is language independent .
JSON is "self-describing" and easy to understand
JSON uses JavaScript syntax, but the JSON format is text only, just like XML.Text can be read and used as a data format by any programming language.
JSON Example{"employees":[
{"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"}]}
XML Example <employees>
<employee> <firstName>John</firstName> <lastName>Doe</lastName> </employee> <employee> <firstName>Anna</firstName> <lastName>Smith</lastName> </employee> <employee> <firstName>Peter</firstName> <lastName>Jones</lastName> </employee></employees>
The JSON format is syntactically identical to the code for creating JavaScript objects.
<!DOCTYPE html><html><body><h2>JSON Object Creation in JavaScript</h2><p id="demo"></p><script>var text = '{"name":"John Johnson","street":"Oslo West 16","phone":"555 1234567"}'
var obj = JSON.parse(text);
document.getElementById("demo").innerHTML =obj.name + "<br>" +obj.street + "<br>" +obj.phone;</script>
</body></html>
LIKE XML
Both JSON and XML is plain text Both JSON and XML is "self-describing"
(human readable) Both JSON and XML is hierarchical (values
within values) Both JSON and XML can be fetched with an
HttpRequest
UNLIKE XML
JSON doesn't use end tag JSON is shorter JSON is quicker to read and write JSON can use arrays The biggest difference is: XML has to be parsed with an XML parser,
JSON can be parsed by a standard JavaScript function.
JSON SYNTAX
JSON syntax is a subset of the JavaScript object notation syntax:
Data is in name/value pairs Data is separated by commas Curly braces hold objects Square brackets hold arrays
JSON DATA - A NAME AND A VALUE
JSON data is written as name/value pairs. A name/value pair consists of a field name (in double quotes),
followed by a colon, followed by a value: "firstName":"John“
JSON Values can be: A number (integer or floating point) A string (in double quotes) A Boolean (true or false) An array (in square brackets) An object (in curly braces) null
JSON Objects written inside curly braces. Just like in JavaScript, objects can contain multiple
name/values pairs: {"firstName":"John", "lastName":"Doe"}
JSON Arrays JSON arrays are written inside square brackets. Just like in JavaScript, an array can contain multiple objects: "employees":[
{"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"}]
In the example above, the object "employees" is an array containing three objects. Each object is a record of a person (with a first name and a last name).
JSON Uses JavaScript Syntax Because JSON uses JavaScript syntax, no extra software is
needed to work with JSON within JavaScript. With JavaScript you can create an array of objects and
assign data to it like this: Example var employees = [
{"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName": "Jones"}];
The first entry in the JavaScript object array can be accessed like this:
employees[0].firstName + " " + employees[0].lastName;
The returned content will be:
John Doe The data can be modified like this:
employees[0].firstName = "Gilbert";
A common use of JSON is to read data from a web server, and display the data in a web page. For simplicity, this can be demonstrated by using a string as input .
Create a JavaScript string containing JSON syntax: var text = '{ "employees" : [' +
'{ "firstName":"John" , "lastName":"Doe" },' +'{ "firstName":"Anna" , "lastName":"Smith" },' +'{ "firstName":"Peter" , "lastName":"Jones" } ]}';
The JavaScript function JSON.parse(text) can be used to convert a JSON text into a JavaScript object:
var obj = JSON.parse(text);
Use the new JavaScript object in your page: Example <p id="demo"></p>
<script>document.getElementById("demo").innerHTML =obj.employees[1].firstName + " " + obj.employees[1].lastName; </script>
ResultCreate Object from JSON String
Anna Smith
USING EVAL()
Older browsers without the support for the JavaScript function JSON.parse() can use the eval() function to convert a JSON text into a JavaScript object:
var obj = eval ("(" + text + ")");
EVAL
The JavaScript eval(string) method compiles and executes the given string The string can be an expression, a statement, or
a sequence of statements Expressions can include variables and object
properties eval returns the value of the last expression
evaluated When applied to JSON, eval returns the
described object
BIG DATA
3 VS
Velocity
Volume
Variety.
WHAT IS HADOOP
Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is
Accessible Robust Scalable Simple
WHAT IS HBASE?
HBase is . . . – Open-Source
– Sparse– Multidimensional– Persistent– Distributed– Sorted Map– Runs on top of HDFS– Modeled after Google’s BigTable
HBase is a distributed column-oriented data store built on top of HDFS
HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing
Data is logically organized into tables, rows and columns
HBASE IS NOT A TRADITIONAL DATABASE
WHEN TO USE HBASE?
Use HBase if…
– You need random write, random read, or both (but not neither)
– You need to do many thousands of operations per second on
multiple TB of data
– Your access patterns are well-known and simple
Don’t use HBase if…
– You only append to your dataset, and tend to read the whole
thing
– Your data easily fits on one beefy node
HBASE DATA MODEL
Overview Tables are made of rows and columns Every row has a row key (analogous to a primary key)
– Rows are stored sorted by row key for fast lookups
All columns in HBase belong to a particular column family
A table may have one or more column families
– Common to have a small number of column families
– Column families should rarely change
– A column family can have any number of columns
Table cells are versioned, un-interpreted arrays of bytes
HBASE DATA MODEL (CONT …..)
HBase is based on Google’s Bigtable model Key-Value pairs
HBASE DATA MODEL (CONT …..)
HBase schema consists of several Tables Each table consists of a set of Column Families Columns are not part of the schema HBase has Dynamic Columns Because column names are encoded inside the cells Different cells can have different columns
“Roles” column family has different columns in different cells
HBASE LOGICAL VIEW
HBASE PHYSICAL MODEL
Each column family is stored in a separate file (called HTables) Key & Version numbers are replicated with each column family Empty cells are not stored
HBase maintains a multi-level index on values:<key, column family,
column name, timestamp>
THREE MAJOR COMPONENTS OF HBASE
The HBaseMaster- One master
The HRegionServer- Many region
servers
The HBase client
HBASE COMPONENTS
Region- A subset of a table’s rows, like horizontal range partitioning
- Automatically done RegionServer (many slaves)
- Manages data regions- Serves data for reads and writes (using a log)
Master - Responsible for coordinating the slaves- Assigns regions, detects failures- Admin functions
Zookeeper - A centralized service used to maintain configuration information and service for HBase Catalog Tables
- Keep track of the locations of RegionServers and Regions
HBASE COMPONENTS -- HMASTER
Responsible for coordinating the slaves (HRegionServers) Assigns regions, detects failures of HRegionServers Handles schema changes Master runs several background threads
– LoadBalancer Periodically reassigns Regions in the cluster
– CatalogJanitor periodically checks and cleans up the .META.
Table Can have multiple Masters
– Upon startup all compete to run the cluster
– If the active Master loses its lease in Zookeeper then the
remaining Masters compete for the Master role
HBASE COMPONENTS -- HREGION SERVERS
Serve data for reads and writes of rows contained in Regions Also will split a Region that has become too large Interface Methods exposed by HRegionRegionInterface
– Data Methods, Region Methods
– Get, put, delete, next, etc.
– splitRegion, compactRegion, etc. Runs several background threads
– CompactSplitThread check for splits,handle minor compactions
– MajorCompactionChecker checks for major compactions
– MemStoreFlusher periodically flushes in-memory writes in the MemStore to StoreFiles.
– LogRoller periodically checks the RegionServer's HLog
HBASE COMPONENTS -- REGIONS
HBASE COMPONENTS -- ZOOKEEPER AND CATALOG
Zookeeper Service
– Stores global information about the cluster
– Provides synchronization and detects Master node failure
– Holds the location of the -ROOT- table and the Master Catalog
-ROOT- Catalog Table
– A table that lists the location of the .META. table(s)
– The following is an example of: scan ‘-ROOT-’!
.META. Catalog Table
– A table that lists all the regions and their locations
– The following is an example of: scan ‘.META.’!
HBASE BIG PICTURE
COMPACTION
COMPACTION CONTINUE…..
RUNNING HBASE SHELL
HBASE – USEFUL COMMANDS
help– Lists all the shell commands
status– Shows basic status about the cluster
list– Lists all user tables in HBase
describe '<tablename>'– Returns the structure of the table
CREATE TABLE create '<tablename>' ,{NAME => '<colfam>' [, <options>]} [,{…}] create 't1', {NAME => 'fam1'} create 't1', {NAME => 'fam1', VERSIONS => 1} create 't1', {NAME => 'fam1'}, {NAME => 'fam2'} Shorthand: create 't1', 'fam1', 'fam2'
ACCESSING DATA IN TABLES
Getget '<tablename>', '<rowkey>' [,options]– get 't1', 'r1'– get 't1', 'r1', {COLUMN => 'fam1:c1'}– get 't1', 'r1', {COLUMN => 'fam1:c1',
VERSIONS=> 2} Put
put '<tablename>', '<rowkey>', '<colfam>:<col>',
'<value>' [,timestamp]– put 't1', 'r1', 'fam1:c1', 'value'– put 't1', 'r1', 'fam1:c1', 'value', 1274302629663
SCAN AND COUNT
Scan:scan '<tablename>' [,{<options>}]
– May include options for COLUMNS, START/STOP ROW, TIMESTAMP or COLUMNS
– scan 't1'– scan 't1', {COLUMNS => 'fam1:c1'}– scan 't1', {COLUMNS => 'fam1:'}– scan 't1', {STARTROW => 'r1', LIMIT => 10}
Count :count '<tablename>' [, interval]
– count 't1', 5000
REMOVING DATA AND TABLES
Delete columns in a row– delete '<tablename>', '<row key>', '<col>'– delete 't1', 'r1', 'fam1:c1'
Delete an entire row– deleteall '<tablename>', '<row key>'– deleteall 't1', 'r1'
Delete all the rows– truncate '<tablename>‘
disable '<tablename>‘ drop '<tablename>‘ major_compact '.META.’
CHANGING COLUMN FAMILIES
Delete columns in a row– delete '<tablename>', '<row key>', '<col>'– delete 't1', 'r1', 'fam1:c1'
Delete an entire row– deleteall '<tablename>', '<row key>'– deleteall 't1', 'r1'
Delete all the rows– truncate '<tablename>‘
disable '<tablename>‘ drop '<tablename>‘ major_compact '.META.’
INTRODUCTION TO HIVE
OVERVIEW
Intuitive Make the unstructured data looks like tables
regardless how it really lay out SQL based query can be directly against these
tables Generate specify execution plan for this query
What’s Hive A data warehousing system to store structured
data on Hadoop file system Provide an easy query these data by execution
Hadoop MapReduce plans99Introduction to Hive04/21/23
DATA MODEL
Tables Basic type columns (int, float, boolean) Complex type: List / Map ( associate array)
Partitions Buckets CREATE TABLE sales( id INT, items
ARRAY<STRUCT<id:INT,name:STRING>) PARITIONED BY (ds STRING)CLUSTERED BY (id) INTO 32 BUCKETS;
SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
100Introduction to Hive04/21/23
METADATA
Database namespace Table definitions
schema info, physical location In HDFS
Partition data
ORM Framework All the metadata can be stored in Derby by
default Any database with JDBC can be configed
101Introduction to Hive04/21/23
PERFORMANCE
GROUP BY operation Efficient execution plans based on:
Data skew:
how evenly distributed data across a number of physical nodes
bottleneck VS load balance Partial aggregation:
Group the data with the same group by value as soon as possible
In memory hash-table for mapper Earlier than combiner
102Introduction to Hive04/21/23
PERFORMANCE
JOIN operation Traditional Map-Reduce Join Early Map-side Join
very efficient for joining a small table with a large table Keep smaller table data in memory first Join with a chunk of larger table data each time Space complexity for time complexity
103Introduction to Hive7/20/2010
PERFORMANCE
Ser/De Describe how to load the data from the file into
a representation that make it looks like a table; Lazy load
Create the field object when necessary Reduce the overhead to create unnecessary
objects in Hive Java is expensive to create objects Increase performance
104Introduction to Hive7/20/2010
HIVE – PERFORMANCE
QueryA: SELECT count(1) FROM t; QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t; QueryC: SELECT * FROM t; map-side time only (incl. GzipCodec for comp/decompression) * These two features need to be tested with other queries.
http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs
Date SVN Revision Major Changes Query A Query B Query C
2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec
2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec
3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec
4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec
6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec
PROS
Pros A easy way to process large scale data Support SQL-based queries Provide more user defined interfaces to
extend Programmability Efficient execution plans for performance Interoperability with other database tools
106Introduction to Hive04/21/23
CONS
Cons No easy way to append data Files in HDFS are immutable
Future work Views / Variables More operator
In/Exists semantic
More future work in the mail list
107Introduction to Hive04/21/23
APPLICATION
Log processing Daily Report User Activity Measurement
Data/Text mining Machine learning (Training Data)
Business intelligence Advertising Delivery Spam Detection
108Introduction to Hive7/20/2010
RELATED WORK
Parallel databases: Gamma, Bubba, Volcano
Google: Sawzall Yahoo: Pig IBM: JAQL Microsoft: DradLINQ , SCOPE
109Introduction to Hive7/20/2010
INTRODUCTION TO CLOUDERA
Software organization started in 2009 Open source Hadoop distribution Focuses on distribution of various
technologies
PRODUCTS AND SERVICES
Annual subscription license
Cloudera Express- CDH+ Cloudera manager No roll backs and backup/disaster recovery
Downloaded for free but no technical support
Contains core elements of hadoop Reliable, scalable Distributed data processing of large data sets Security Availability Integration with hardware and software
AN INTRODUCTION TO BERKELEY DB
OVERVIEW OF BERKELEY DB Means the Berkeley Database
An open-source, embedded transactional data management system
A key/value store Embedded ?
As a library that is linked with an application Hides data management from end-user
Scales from Bytes to Petabytes Runs on everything from cell phone to large
servers.
BERKELEY DB : EXAMPLES OF APPLICATIONS Google Accounts
Store all user and service account information and preferences.
Amazon’s user-customization
Berkeley DB has high reliability and high performance.
BERKELEY DB: A BRIEF HISTORY (1) Began life in 1991 as a dynamic linear
hashing implementation. historic UNIX database libraries: dbm, ndbm and
hsearch Released as a library in the 4.4 BSD in 1992.
db-1.85 == Hash + B-Tree
The package LIBTP Transactional Implementation of db-1.85 A research prototype that was never released.
BERKELEY DB: A BRIEF HISTORY (2)
In 1996, Seltzer and Bostic started Sleepycat Software. for use in the Netscape browser
Berkeley DB 2.0, Released in 1997 Transactional implementation the first commercial release
Berkeley DB 3.0, Released in 1999 Transformed into an Object-Oriented Handle and
Method style API.
BERKELEY DB: A BRIEF HISTORY (3)
Berkeley DB 4.0, Released in 1999 Single-Master, Multiple-Reader Replication High Availability
replicas can take over for a failed master High Scalability
Read-only replicas can reduce master load Similar ideas are adopted in C-Store.
In Feb. 2006, Oracle acquired Sleepycat.
SLEEPYCAT PUBLIC LICENSE: A DUAL LICENSE The code
Is open source And may be downloaded and used freely
However, redistribution requires Either the package using Berkeley DB be
released as open source Or that the distributors obtain a commercial
license from Sleepycat (and now Oracle, acquired in Feb. 2006).
BERKELEY DB: PRODUCT FAMILY TODAY
The original Berkeley DB library Berkeley DB XML
Atop the library Berkeley DB Java Edition
100% pure Java implementation
BERKELEY DB : PRODUCT FAMILY ARCHITECTURE
BERKELEY DB: THE DESIGN PHILOSOPHY Provide mechanisms without specifying
policies
For example, Berkeley DB is abstracted as a store of <key, value> pairs. Both keys and values are opaque byte-strings. i.e., Berkeley DB has no schema, And the application that embeds Berkeley DB is
responsible for imposing its own schema on the data.
ADVANTAGES OF <KEY, VALUE> PAIRS An application is free to store data in
whatever form is most natural to it. Objects (like structures in C language) Rows in Oracle, SQL Server Columns in C-store
Different data formats can be stored in the same databases. As long as the application understands how to
interpret the data items.
INDEXING KEY VALUES Indexing methods
B-Tree Hash Queue A record-number-based index implemented atop
B-Tree Data manipulation
Put, store key/value pairs Get, retrieve key/value pairs Delete, remove key/value pairs
HOW APPLICATIONS ACCESS KEY/VALUE PAIRS? Through handles on databases
Similar to relational tables Or through cursor handles
Representing a specific place within a database Used for iteration, i.e., fetch a key/value pair
each time. Databases are implemented atop OS file
system. A file may contain one or more databases.
BERKELEY DB REPLICATION:A LOG-SHIPPING SYSTEM A Replication Group
A single Master One or more Read-Only Replicas.
All write operations must be processed transactionally by the Master
The Master sends log records to each of the Replicas.
The Replicas apply log records only when they receive a transaction commit record.
BERKELEY DB: CONFIGURATION FLEXIBILITY
Configuration flexibility is critical Due to a wide range of applications
Three ways Compile Time Configuration Feature Set Selection Runtime Configuration
COMPILE TIME CONFIGURATION Option 1: small footprint build
-enable-smallbuild For use in a cell phone The compiled library contains only B-Tree index, Omits replication, cryptography, statistics collection,
etc. The library is about 0.5 MB.
Option 2: higher concurrency locking -enable-fine-grained-lock-manager For use in a Data Center Lock-Based Concurrency Control
FEATURE SET SELECTION
1. The Data Store (DS) feature set Most similar to the original db-1.85 library Good for temporary data storage
2. The Concurrent Data Store (CDS) feature set
Acquires a single lock per API invocation Good for Read-Most applications
3. The Transactional Data Store (TDS) feature set
Currently the most widely used feature set Acquires a single lock per page
4. The High Availability (HA) feature set Can continue running even after a site fails.
RUNTIME CONFIGURATION Index Selection and Tuning
Applications can select the page size in an index Trading off Durability and Performance
No-force log write Extreme case: applications can run completely in
memory Trading off Two-Phase Locking and
Multiversion Concurrency Control. Note: C-Store adopts similar ideas for high
performance.
CHALLENGES OF BERKELEY DB’S FLEXIBILITY Need flexibility in Berkeley DB designers
Need flexibility in application developers
INTRODUCTION TO R
WHAT IS R?
The R statistical programming language is a free open source package based on the S language developed by Bell Labs.
The language is very powerful for writing programs.
Many statistical functions are already built in.
Contributed packages expand the functionality to cutting edge research.
Since it is a programming language, generating computer code to complete tasks is required.
GETTING STARTED Where to get R? Go to www.r-project.org Downloads: CRAN Set your Mirror: Anyone in the USA is fine. Select Windows 95 or later. Select base. Select R-2.4.1-win32.exe
The others are if you are a developer and wish to change the source code.
UNT course website for R: http://www.unt.edu/rss/SPLUSclasslinks.html
GETTING STARTED
The R GUI?
GETTING STARTED
Opening a script. This gives you a script window.
GETTING STARTED
Basic assignment and operations. Arithmetic Operations:
+, -, *, /, ^ are the standard arithmetic operators.
Matrix Arithmetic. * is element wise multiplication %*% is matrix multiplication
Assignment To assign a value to a variable use “<-”
GETTING STARTED
How to use help in R? R has a very good help system built in. If you know which function you want help with
simply use ?_______ with the function in the blank.
Ex: ?hist. If you don’t know which function to use, then use
help.search(“_______”). Ex: help.search(“histogram”).
IMPORTING DATA
How do we get data into R? Remember we have no point and click… First make sure your data is in an easy to
read format such as CSV (Comma Separated Values).
Use code: D <- read.table(“path”,sep=“,”,header=TRUE)
WORKING WITH DATA.
Accessing columns. D has our data in it…. But you can’t see it
directly. To select a column use D$column.
WORKING WITH DATA.
Subsetting data. Use a logical operator to do this.
==, >, <, <=, >=, <> are all logical operators. Note that the “equals” logical operator is two =
signs. Example:
D[D$Gender == “M”,] This will return the rows of D where Gender is
“M”. Remember R is case sensitive! This code does nothing to the original dataset.D.M <- D[D$Gender == “M”,] gives a dataset
with the appropriate rows.
BASIC GRAPHICS
Histogram hist(D$wg)
BASIC GRAPHICS
Add a title… The “main” statement
will give the plot an overall heading.
hist(D$wg , main=‘Weight Gain’)
BASIC GRAPHICS
Adding axis labels… Use “xlab” and “ylab”
to label the X and Y axes, respectively.
hist(D$wg , main=‘Weight Gain’,xlab=‘Weight Gain’, ylab =‘Frequency’)
BASIC GRAPHICS
Changing colors… Use the col
statement. ?colors will give you
help on the colors. Common colors may
simply put in using the name.
hist(D$wg, main=“Weight Gain”,xlab=“Weight Gain”, ylab =“Frequency”, col=“blue”)
BASIC GRAPHICS – COLORS
BASIC PLOTS
Box Plots boxplot(D$wg)
BOXPLOTS
Change it! boxplot(D$wg,main='Weight Gain',ylab='Weight Gain (lbs)')
BOX-PLOTS - GROUPINGS
What if we want several box plots side by side to be able to compare them.
First Subset the Data into separate variables. wg.m <- D[D$Gender=="M",] wg.f <- D[D$Gender=="F",]
Then Create the box plot. boxplot(wg.m$wg,wg.f$wg)
BOXPLOTS – GROUPINGS
BOXPLOTS - GROUPINGS
boxplot(wg.m$wg, wg.f$wg, main='Weight Gain (lbs)', ylab='Weight Gain', names = c('Male','Female'))
BOXPLOT GROUPINGS
Do it by shift wg.7a <- D[D$Shift=="7am",] wg.8a <- D[D$Shift=="8am",] wg.9a <- D[D$Shift=="9am",] wg.10a <- D[D$Shift=="10am",] wg.11a <- D[D$Shift=="11am",] wg.12p <- D[D$Shift=="12pm",] boxplot(wg.7a$wg, wg.8a$wg, wg.9a$wg, wg.10a$wg, wg.11a$wg, wg.12p$wg, main='Weight Gain', ylab='Weight Gain (lbs)', xlab='Shift', names = c('7am','8am','9am','10am','11am','12pm'))
BOXPLOTS GROUPINGS
SCATTER PLOTS
Suppose we have two variables and we wish to see the relationship between them.
A scatter plot works very well. R code:
plot(x,y)
Example plot(D$metmin,D$wg)
SCATTERPLOTS
SCATTERPLOTS
plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain', xlab='Mets (min)',ylab='Weight Gain (lbs)')
SCATTERPLOTS
plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain',
xlab='Mets (min)',ylab='Weight Gain (lbs)',pch=2)
LINE PLOTS
Often data comes through time. Consider Dell stock
D2 <- read.csv("H:\\Dell.csv",header=TRUE) t1 <- 1:nrow(D2) plot(t1,D2$DELL)
LINE PLOTS
LINE PLOTS
plot(t1,D2$DELL,type="l")
LINE PLOTS
plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',xlab='Time',ylab='Price $'))
OVERLAYING PLOTS
Often we have more than one variable measured against the same predictor (X). plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',xlab='Time',ylab='Price $'))
lines(t1,D2$Intel)
OVERLAYING GRAPHS
SUMMARY
All of the R code and files can be found at: http://www.cran.r-project.org/