Introduction to Databases on Linux

51
Introduction to Databases on Linux David Thomas DATABASE TRAINING ARCHITECT

Transcript of Introduction to Databases on Linux

Page 1: Introduction to Databases on Linux

Introduction to Databases on Linux

David ThomasDATABASE TRAINING ARCHITECT

Page 2: Introduction to Databases on Linux

COURSE/CHAPTER BREAKDOWN

Introduction to Databases

on Linux

Introduction

Distributed Databases

Embedded Databases/Flat Files

Conclusion

Relational Databases

NoSQL Databases

David ThomasDATABASE TRAINING ARCHITECT

Page 3: Introduction to Databases on Linux

Related Courses

David ThomasDATABASE TRAINING ARCHITECT

Page 4: Introduction to Databases on Linux

Related Courses

This is also an introduction level course, but also includes

installation demos.

Databases Essentials

Most of the labs and demos in this course are done using CentOS 7, but many of the

concepts discussed will still apply.

CentOS Enterprise Linux 8 Essentials

This introduction course is based on the LPI exam, but still provides a good general introduction to the Linux OS.

LPI Linux Essentials Certification

If you're looking for more information on using databases in the AWS cloud, this course is a great start.

Amazon Aurora - Cloud SQL DB Essentials

INTRODUCTION TO DATABASES ON LINUX

Page 5: Introduction to Databases on Linux

What Is a Database?

David ThomasDATABASE TRAINING ARCHITECT

Page 6: Introduction to Databases on Linux

What Is a Database?INTRODUCTION TO DATABASES ON LINUX

A Database is an organized collection of data. Typically this includes methods to manipulate the data. The data can be anything from the 1890 census results to the Netflix streaming movie catalogue.

When data is replicated across multiple servers within a

database cluster. • Multi-Master:

Everyone can read and write.

• Primary-Replica: Everyone can read, but only the primary writes.

Distributed Databases

Data is organized, using the Relational model, into tables consisting of columns and

rows with a unique key identifying each row.

Relational Databases

These include early simple databases

such as CSV, Berkeley DB, as well as newer JSON and XML data formats.

Embedded Databases and Flat Files

Better said as a non-relational

database, data in a NoSQL database is typically stored as

key/value pairs.

NoSQL

Page 7: Introduction to Databases on Linux

A Brief History of Embedded Databases

David ThomasDATABASE TRAINING ARCHITECT

Page 8: Introduction to Databases on Linux

A Brief History of Embedded Databases

INTRODUCTION TO DATABASES ON LINUX

In the 1890 US Census, Herman Hollerith created the first computerized flat-file database by tabulating data via hole punches in paper cards.

1890

Page 9: Introduction to Databases on Linux

1972

The IBM Fortran (level H extended) compiler under OS/360 includes support for CSV data.

1998

The XML 1.0 standard is published in 1998.

2001

JSON format specified by Douglas Crockford at State Software.

1991

The effort to remove or replace all code originating in the original AT&T Unix results in the first release of Berkeley DB in 1991.

1996

Netscape requests that the authors of Berkeley DB improve and extend the library, this leads to the creation of Sleepycat Software.

2004

The XML 1.1 standard is published in 2004

Page 10: Introduction to Databases on Linux

CSV is defined as a MIME Content Type by RFC4180.

2005 2013

Ecma International published the first edition of its JSON standard ECMA404.

The XML 1.1, second edition standard is published.

In February 2006, Sleepycat Software is acquired by Oracle Corporation, which continues to develop and sell Berkeley DB.

2006

The XML 1.1, fifth edition standard is published.

2008

W3C, in an attempt to enhance CSV with formal semantics, publicized the first drafts of recommendations for CSV-metadata standards. These began as recommendations in December of 2015

2015

Yahoo! begins offering some web services in JSON.

Page 11: Introduction to Databases on Linux

Parsing Files on the Command LineINTRODUCTION TO DATABASES ON LINUX

Speciality Tools

Binary vs. Text

Berkeley DB is a binary format where as CSV, XML, and JSON are all text-based formats.

This means that standard UNIX command-line tools can be used on CSV, XML and JSON format data.

While CLI tools can be used on all text based formats, speciality tools will provide the most features.

jq is a lightweight and flexible command-line JSON processor. https://stedolan.github.io/jq/ BaseX provides a CLI and GUI client for processing XML data. http://basex.org/

# print department of ID 1 grep ^1 demo.csv | awk -F, '{ print $3 }' # print ExpenseCode of ID 2 grep ^2 demo.csv | awk -F, '{ print $4 }'

# print all names awk -F, '{ print $2 }' demo.json | awk -F: '{ print $2 }' grep Name demo.xml | awk -F\> '{ print $2 }' | sed -e 's/\<.*//' # output can be sorted cat demo.csv | awk -F, '{ print $2 }’ | sort # duplicates can be removed cat demo.csv | awk -F, '{ print $2 }’ | uniq

Page 12: Introduction to Databases on Linux

Working with Berkeley DBINTRODUCTION TO DATABASES ON LINUX

File Formats

Architecture

Berkeley DB BDB is far simpler than other DBMS. It is not based on a client/server model and provides no network access. Data is accessed via in-process API calls.

A variety of record formats are supported: hashopen() - open hash format file btopen() - open btree format file rnopen() - open DB record format

[cloud_user@davidthomas1c ~]$ python Python 2.7.5 (default, Aug 7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import bsddb >>> db = bsddb.btopen('btree.db', 'c') >>> db['0'] = 'David' >>> db['1'] = 'Clay' >>> db['2'] = 'Sue' >>> db['3'] = 'Betty' >>> print ('The name of student 1 is ' + db['1']) The name of student 1 is Clay

Page 13: Introduction to Databases on Linux

Working with CSV Comma Separated Values)INTRODUCTION TO DATABASES ON LINUX

Applications

Basic Rules

• Data is stored in fields/columns separated by the comma character.

• Records/rows are terminated by newline character. • A specific character encoding, byte order, or line terminator

format, is not required. • All records should have the same number of fields, and be in the

same order. • Data within fields is interpreted as a sequence of characters, not

as a sequence of bits or bytes.

• Most all Spreadsheet applications like Excel or Google Sheets support CSV.

• Many text editors and programming languages provide support for CSV.

• CSV files can also be manipulated by CLI tools such as grep, sed, awk, sort, and uniq.

% cat demo.csv ID,Name,Department,ExpenseCode 0,David,Engineering,200 1,Clay,HR,100 2,Sue,Sales,300 3,Betty,Marketing,400

Page 14: Introduction to Databases on Linux

Working with JSON JavaScript Object Notation)INTRODUCTION TO DATABASES ON LINUX

jq Client

What Is JSON?

JSON stands for JavaScript Object Notation, and is a syntax for storing and exchanging data.

It is a text-based format that is easily usable in the JavaScript language.

jq is a lightweight and flexible command-line JSON processor. Written in C, it is intended to be like sed Stream Editor) for JSON data. It has no dependences, so the binary and be downloaded and run.

• Print all users' Names ./jq '.users[].Name' demo.json

• Print name and ID of 2nd user in list ./jq '.users[1].Name,.users[1].ID' demo.json

{"users":[ {"ID":"0", "Name":"David", "Department":"Engineeting", "ExpenseCode":"200" }, {"ID":"1", "Name":"Clay", "Department":"HR", "ExpenseCode":"100" }, {"ID":"2", "Name":"Sue", "Department":"Sales", "ExpenseCode":"300" }, {"ID":"3", "Name":"Betty", "Department":"Marketing", "ExpenseCode":"400" } ]}

Page 15: Introduction to Databases on Linux

Working with XML (eXtensible Markup Language)INTRODUCTION TO DATABASES ON LINUX

BaseX Client

What Is XML?

XML stands for eXtensible Markup Language and is a markup language much like HTML. It was designed to store and transport data in a way that is both human and machine readable.

Data stored in the XML format can be queried using XQuery, much like SQL is used to query other databases.

Java-based command line and GUI clients available: • Start BaseX client

• java -cp BaseX932.jar org.basex.BaseX • Print Name of row with ID "1"

• xquery doc('demo.xml')/users/user[ID=1]/Name • Print ID of user named "Clay"

• xquery doc('demo.xml')/users/user[Name="Clay"]/ID

<!--?xml version="1.0" encoding="UTF-8"?--> <users> <user> <id>0</id> <name>David</name> <department>Engineering</department> <expensecode>200</expensecode> </user> <user> <id>1</id> <name>Clay</name> <department>HR</department> <expensecode>100</expensecode> </user> <user> <id>2</id> <name>Sue</name> <department>Sales</department>

Page 16: Introduction to Databases on Linux

A Brief History of Relational Databases

David ThomasDATABASE TRAINING ARCHITECT

Page 17: Introduction to Databases on Linux

A Brief History of Relational Databases

INTRODUCTION TO DATABASES ON LINUX

The term "relational database" was coined by E. F. Codd at IBM in the Communications of the ACM journal.

June 1970

Page 18: Introduction to Databases on Linux

1977

Larry Ellison, Bob Miner and Ed Oates started a consultancy called Software Development Laboratories (SDL

1989

Microsoft SQL Server for OS/2 began as a project to port Sybase SQL Server onto OS/2.

1993

Microsoft SQL Server 4.2 for Windows NT is released.

1979

SDL releases initial version of Oracle v2 at 2.3. There was no v1 of Oracle Database, as Larry Ellison “knew no one would want to buy version 1"

1985

The post-ingres project started to address problems with Ingres database

1995

Postgres95 released with support for SQL. Previously only supported Ingres-influenced POSTQUEL query language.

Page 19: Introduction to Databases on Linux

Initial release on May 23, 1995.

MySQL was created by MySQL AB, a Swedish company founded by David Axmark, Allan Larsson and Michael "Monty" Widenius.

1995

2009

MariaDB forked from MySQL due to Oracle acquisition.

Version 5.1 released on October 29, 2009.

Postgres95 renamed to PostgreSQL to reflect SQL support. The first PostgreSQL release formed version 6.0 on January 29, 1997.

1996

Sun Microsystems acquires MySQL AB.

January 16, 2008 MySQL AB announced that it had agreed to be acquired by Sun Microsystems for approximately $1 billion.

2008

Microsoft SQL Server 2017 release adds Linux support.

Oracle acquired Sun Microsystems on January 27, 2010.

2010

2017

Page 20: Introduction to Databases on Linux

A Quick Introduction to SQL

David ThomasDATABASE TRAINING ARCHITECT

Page 21: Introduction to Databases on Linux

A Quick Introduction to SQLINTRODUCTION TO DATABASES ON LINUX

Common Commands

SQL

SQL, or Structured Query Language, is used to manipulate databases. SQL is an ANSI/ISO standard, however many different versions add proprietary extensions.

SELECT - extracts data from a database UPDATE - updates data in a database INSERT INTO - inserts new data into a database DELETE - deletes data from a database CREATE DATABASE - creates a new database DROP DATABASE - deletes a database CREATE TABLE - creates a new table ALTER TABLE - modifies a table DROP TABLE - deletes a table

STRUCTURED QUERY LANGUAGE

SELECT column_name FROM table_name;UPDATE table_name SET column_name = value;INSERT INTO table_name ( column1, column2) VALUES ( value1, value2);DELETE FROM table_name WHERE condition;

CREATE DATABASE databasename;DROP DATABASE databasename;

CREATE TABLE table_name ( column1 datatype, column2 datatype);ALTER TABLE table_name ADD column_name datatype;DROP TABLE table_name;

Page 22: Introduction to Databases on Linux

Working with MariaDBINTRODUCTION TO DATABASES ON LINUX

MariaDB

As we learned in a previous lesson, MariaDB was forked from MySQL. We will be using MariaDB for the demo. during this lesson, but most commands should work on MySQL as well.

A highly flexible RDBMS with support for multiple storage engines.

CREATE TABLE test_table ( ID int, Name string, Department string, ExpenseCode int);

INSERT INTO test_table ( ID, Name, Department, ExpenseCode) values ( '0', 'David', ‘Engineering', ’200’);SELECT * from test_table;

--- SHOW commands display information about the server.

SHOW STATUS; - Display server statusSHOW USER_STATISTICS; - Display information about user activity.SHOW DATABASES; - Display databases on given host.SHOW TABLES; - Displays tables in a given database.

mysql --host host_name —user=user_name db_name

Once connected to the server, standard SQL commands are supported.

More information is available via the SHOW command.

Page 23: Introduction to Databases on Linux

Working with Non-Free Databases on Linux

David ThomasDATABASE TRAINING ARCHITECT

Page 24: Introduction to Databases on Linux

Working with Non-Free Databases on LinuxINTRODUCTIONS TO DATABASES ON LINUX

Getting help using non-free databases on Linux will also differ. While users of FOSS software can find support via forum web sites

and mailing lists that are accessible to the public, support for non-free software typically comes directly

from the software vendor.

Non-free software licensing agreements limit how this software can be used, in ways that Free and Open Source licenses such as the

GPL and BSD licenses do not. These FOSS Free/Open Source

Software) licenses have their own sets of restrictions.

Installation tends to be similar as most non-free software vendors

supply RPM packages. Many (such as SQL Server) provide these via

YUM repositories.

Licensing InstallationSupport

Page 25: Introduction to Databases on Linux

Working with PostgreSQLINTRODUCTION TO DATABASES ON LINUX

psql

PostgreSQL is a highly extendible RDBMS, with support for many extensions and stored procedure languages. It supports connections using TCP/IP and local UNIX sockets.

STANDARD COMMANDLINE CLIENT

CREATE TABLE test_table ( ID int, Name string, Department string, ExpenseCode int);

INSERT INTO test_table ( ID, Name, Department, ExpenseCode) values ( '0', 'David', ‘Engineering', ’200’);SELECT * from test_table;

--- Meta-commands are processed by the psql client and begin with a backslash \conninfo - show current connection info\d[S+] [ pattern ] - list tables\dn[S+] [ pattern ] - list schemas\du[S+] [ pattern ] - list roles (users are roles that can login)

psql -h hostname -U username -p 5432 -d dbname

psql is the standard CLI client, and is typically included as part of the server installation.

A variety of GUI clients exist, but most admin functions are preformed via the CLI client.

Page 26: Introduction to Databases on Linux

A Brief History of NoSQL Databases

David ThomasDATABASE TRAINING ARCHITECT

Page 27: Introduction to Databases on Linux

A Brief History of NoSQL Databases

INTRODUCTION TO DATABASES ON LINUX

Carlo Storzzi used the term NoSQL to name his lightweight Strozzi NoSQL open-source relational database that did not expose a SQL interface.

This use differs from the circa-2009 general concept of NoSQL databases as it was still relational.

1998

Page 28: Introduction to Databases on Linux

2003

Memcached was first developed by Brad Fitzpatrick for his website LiveJournal.

2007

10gen software company began developing MongoDB in 2007 as a component of a planned platform as a service product.

2008

Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik developed Cassandra at Facebook to power the Facebook inbox search feature. It was released as an open-source project on Google code in July 2008.

2009

Johan Oskarsson, then a developer at Last.fm, reintroduced the term NoSQL to describe open-source distributed, non-relational databases.

Page 29: Introduction to Databases on Linux

In February, 10gen released MongoDB as an open-source project.

2009

2012

In January 2012, Couchbase Inc. released Couchbase Server 1.8.

On February 17, Cassandra becomes a top-level Apache project at https://cassandra.apache.org/

2010

Couchbase, Inc. was created as the result of the merger of Membase and CouchOne (a company with many of the principal players behind CouchDB.

2011

On October 20, 2017, MongoDB became a publicly traded company, listed on NASDAQ as MDB.

2017

Page 30: Introduction to Databases on Linux

Working with Apache Cassandra

David ThomasDATABASE TRAINING ARCHITECT

Page 31: Introduction to Databases on Linux

Working with Apache Cassandra INTRODUCTION TO DATABASES ON LINUX

Cassandra Query Language

Apache Cassandra is an open source, distributed, NoSQL database. It presents a partitioned wide column storage model with eventually consistent semantics.

For performance reasons, Cassandra does not support: • Cross partition transactions • Distributed joins • Foreign keys or referential integrity.

Cassandra provides the Cassandra Query Language (CQL, an SQL-like language, to create and update database schema and access data.

A keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster contains one keyspace per node.

$ /home/cloud_user/apache-cassandra-3.11.6/bin/nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 70.05 KiB 256 100.0% 122048be-be7b-4794-8356-3efd6f19a3c3 rack1

$ /home/cloud_user/apache-cassandra-3.11.6/cqlsh Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> CREATE KEYSPACE demo WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1};

Apache Cassandra

Page 32: Introduction to Databases on Linux

Working with Couchbase

David ThomasDATABASE TRAINING ARCHITECT

Page 33: Introduction to Databases on Linux

Working with CouchbaseINTRODUCTION TO DATABASES ON LINUX

N1QL

An open-source, distributed (shared-nothing architecture) multi-model NoSQL document-oriented database. It was the result of a merger between Membase (database project based on memcached) and CouchDB.

JSON formatted data is stored in documents that consist of a series of key-value (or name-value) pairs. These documents are grouped into buckets. The data can be manipulated via a query language called the non-first normal form query language, or N1QL (pronounced nickel).

# Create new bucket named 'demo-bucket' couchbase-cli bucket-create -c 127.0.0.1:8091 --username Administrator --password Omgpassword! --bucket demo-bucket --bucket-type couchbase --bucket-ramsize 512

# Connect to Couchbase Query console (CBQ) /opt/couchbase/bin/cbq -e http://localhost:8091 -u=Administrator

INSERT INTO `demo-bucket` ( KEY, VALUE ) Values ( "doc0",{"name": "David", "department": "Engineering"} ) RETURNING META().id as docid, *;

CREATE PRIMARY INDEX `demo-index` ON `demo-bucket`; SELECT * FROM `demo-bucket` WHERE name= "Betty"; UPDATE `demo-bucket` set department = "Sales" WHERE name= "Betty"; DELETE from `demo-bucket` WHERE name= "Clay";

Couchbase Server

Page 34: Introduction to Databases on Linux

Working with Memcached (Not Really a Database)

David ThomasDATABASE TRAINING ARCHITECT

Page 35: Introduction to Databases on Linux

Memcached uses a client–server architecture where the servers maintain a key–value associative array and the clients populate this array and query it by key. The keys can be up to 250 bytes long, but the values can be up to 1MB in size.

Memcached is not a true database, as the server storage is not persistent. The servers only store the values in RAM, and when the server runs out of RAM, it discards the oldest values. Clients must treat Memcached as a transitory cache, like short-term memory.

Other databases, such as Couchbase Server, do provide persistent storage while still maintaining Memcached protocol compatibility.

In February 2018 GitHub was the target of a DDoS attack using memcached servers. The memcached protocol over UDP has an amplification factor of more than 51000. In other words, for every byte the attacker sends out, the victim is receiving up to 51KB. Memcached version 1.5.6 disables the UDP protocol by default.

Working with Memcached (Not Really a Database)INTRODUCTION TO DATABASES ON LINUX

Memcached

Page 36: Introduction to Databases on Linux

Working with MongoDB

David ThomasDATABASE TRAINING ARCHITECT

Page 37: Introduction to Databases on Linux

Working with MongoDBINTRODUCTION TO DATABASES ON LINUX

Documents and Collections

MongoDB began as a component of a planned platform-as-a-service product, but in 2009 10gen (the software company behind MongoDB shifted to an open-source development model and began offering commercial support and other services.

MongoDB is a publicly traded company, listed on NASDAQ as MDB.

In MongoDB records are stored as documents composed of field and value pairs. This can be though of like rows in a traditional relational databases.

These documents are then grouped together in collections similar to tables.

> db.inventory.insertMany([ ... { id: 0, name: "David", department: "Engineering", expensecode: 200 }, ... { id: 1, name: "Clay", department: "HR", expensecode: 100 }, ... { id: 2, name: "Sue", department: "Sales", expensecode: 300 }, ... { id: 3, name: "Betty", department: "Marketing", expensecode: 400 }, ... ]) { "acknowledged" : true, "insertedIds" : [ ObjectId("5eda77cd9742638e05a080c5"), ObjectId("5eda77cd9742638e05a080c6"), ObjectId("5eda77cd9742638e05a080c7"), ObjectId("5eda77cd9742638e05a080c8") ] } > db.inventory.updateOne( ... { name: "Betty" }, ... { ... $set: { department: "HR" } ... } ... ) { "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 } > db.inventory.find( { name: "Betty" } ) { "_id" : ObjectId("5eda77cd9742638e05a080c8"), "id" : 3, "name" : "Betty", "department" : "HR", "expensecode" : 400 } >

MongoDB a document database designed for ease of development and scaling.

Page 38: Introduction to Databases on Linux

Advantages of Distributed Databases

David ThomasDATABASE TRAINING ARCHITECT

Page 39: Introduction to Databases on Linux

Data is closer to the user, which results in faster response times.

Lower Latency

Redundant copies of the data on each server provide backups. If one node dies, other nodes can continue.

Fault Tolerance

User requests can be spread across multiple servers, allowing more requests to be served.

Load Distribution

Multiple servers contain the same data. Updates are replicated between the servers.

Distributed Databases

Page 40: Introduction to Databases on Linux

Disadvantages of Distributed Databases

David ThomasDATABASE TRAINING ARCHITECT

Page 41: Introduction to Databases on Linux

Data on separate nodes must be kept in sync. This process can increase latency.

Keeping it in sync

$ More servers means more money and time spent keeping it running.

More hardware = More problems

With all the advantages, why would you ever not use a distributed database?

Distributed Databases

$

$$

$

With your data stored on more servers, it is more vulnerable to attack.

Security of data

Page 42: Introduction to Databases on Linux

What Is High Availability?

David ThomasDATABASE TRAINING ARCHITECT

Page 43: Introduction to Databases on Linux

A characteristic of a system, which aims to ensure an agreed level of operational performance,

usually uptime, for a higher than normal period

Eliminate single points of failure

Single points of failure are reduced by replicating data between multiple servers

What Is High Availability?INTRODUCTION TO DATABASES

Detection of failures

Failed nodes should be removed promptly

Reliable crossover

Load balancer or some type of proxy that acts as gateway

to the cluster

Page 44: Introduction to Databases on Linux

Further Questions?

David ThomasDATABASE TRAINING ARCHITECT

Page 45: Introduction to Databases on Linux

[email protected] out to me directly with any question you may have about this course.

EMAIL ME

Page 46: Introduction to Databases on Linux

https://support.linuxacademy.com/hc/en-us/requests/new

For more immediate help with hands on labs or other site features, you can open a support request at the above URL.

TROUBLE WITH LABS?

Page 47: Introduction to Databases on Linux

Summary

David ThomasDATABASE TRAINING ARCHITECT

Page 48: Introduction to Databases on Linux

SummaryINTRODUCTION TO DATABASES ON LINUX

Embedded Databases/Flat Files

• Embedded Databases and flat files are some of the first examples of electronic databases.

• Includes Comma Separate Values (CSV, and newer JSON and XML formats.

Page 49: Introduction to Databases on Linux

SummaryINTRODUCTION TO DATABASES ON LINUX

Relational Databases

• Data is organized using the relational model into tables of columns and rows.

• They typically use Structured Query Language (SQL to manipulate data.

Page 50: Introduction to Databases on Linux

SummaryINTRODUCTION TO DATABASES ON LINUX

NoSQL Databases

• Non-relational databases are built for scalability and speed.

• They typically store data as key-value pairs.

Page 51: Introduction to Databases on Linux

SummaryINTRODUCTION TO DATABASES ON LINUX

Distributed Databases

• Data is replicated across many servers, providing redundancy and performance gains.

• Replication can add latency.