Introduction to Databases on Linux

Introduction to Databases on Linux

David ThomasDATABASE TRAINING ARCHITECT

COURSE/CHAPTER BREAKDOWN

Introduction to Databases

on Linux

Introduction

Distributed Databases

Embedded Databases/Flat Files

Conclusion

Relational Databases

NoSQL Databases


Related Courses


Related Courses

This is also an introduction level course, but also includes

installation demos.

Databases Essentials

Most of the labs and demos in this course are done using CentOS 7, but many of the

concepts discussed will still apply.

CentOS Enterprise Linux 8 Essentials

This introduction course is based on the LPI exam, but still provides a good general introduction to the Linux OS.

LPI Linux Essentials Certification

If you're looking for more information on using databases in the AWS cloud, this course is a great start.

Amazon Aurora - Cloud SQL DB Essentials

INTRODUCTION TO DATABASES ON LINUX

What Is a Database?


What Is a Database?INTRODUCTION TO DATABASES ON LINUX

A Database is an organized collection of data. Typically this includes methods to manipulate the data. The data can be anything from the 1890 census results to the Netflix streaming movie catalogue.

When data is replicated across multiple servers within a

database cluster. • Multi-Master:

Everyone can read and write.

• Primary-Replica: Everyone can read, but only the primary writes.


Data is organized, using the Relational model, into tables consisting of columns and

rows with a unique key identifying each row.


These include early simple databases

such as CSV, Berkeley DB, as well as newer JSON and XML data formats.

Embedded Databases and Flat Files

Better said as a non-relational

database, data in a NoSQL database is typically stored as

key/value pairs.

NoSQL

A Brief History of Embedded Databases


A Brief History of Embedded Databases


In the 1890 US Census, Herman Hollerith created the first computerized flat-file database by tabulating data via hole punches in paper cards.

1890

1972

The IBM Fortran (level H extended) compiler under OS/360 includes support for CSV data.

1998

The XML 1.0 standard is published in 1998.

2001

JSON format specified by Douglas Crockford at State Software.

1991

The effort to remove or replace all code originating in the original AT&T Unix results in the first release of Berkeley DB in 1991.

1996

Netscape requests that the authors of Berkeley DB improve and extend the library, this leads to the creation of Sleepycat Software.

2004

The XML 1.1 standard is published in 2004

CSV is defined as a MIME Content Type by RFC4180.

2005 2013

Ecma International published the first edition of its JSON standard ECMA404.

The XML 1.1, second edition standard is published.

In February 2006, Sleepycat Software is acquired by Oracle Corporation, which continues to develop and sell Berkeley DB.

2006

The XML 1.1, fifth edition standard is published.

2008

W3C, in an attempt to enhance CSV with formal semantics, publicized the first drafts of recommendations for CSV-metadata standards. These began as recommendations in December of 2015

2015

Yahoo! begins offering some web services in JSON.

Parsing Files on the Command LineINTRODUCTION TO DATABASES ON LINUX

Speciality Tools

Binary vs. Text

Berkeley DB is a binary format where as CSV, XML, and JSON are all text-based formats.

This means that standard UNIX command-line tools can be used on CSV, XML and JSON format data.

While CLI tools can be used on all text based formats, speciality tools will provide the most features.

jq is a lightweight and flexible command-line JSON processor. https://stedolan.github.io/jq/ BaseX provides a CLI and GUI client for processing XML data. http://basex.org/

# print department of ID 1 grep ^1 demo.csv | awk -F, '{ print $3 }' # print ExpenseCode of ID 2 grep ^2 demo.csv | awk -F, '{ print $4 }'

# print all names awk -F, '{ print $2 }' demo.json | awk -F: '{ print $2 }' grep Name demo.xml | awk -F\> '{ print $2 }' | sed -e 's/\<.*//' # output can be sorted cat demo.csv | awk -F, '{ print $2 }’ | sort # duplicates can be removed cat demo.csv | awk -F, '{ print $2 }’ | uniq

Working with Berkeley DBINTRODUCTION TO DATABASES ON LINUX

File Formats

Architecture

Berkeley DB BDB is far simpler than other DBMS. It is not based on a client/server model and provides no network access. Data is accessed via in-process API calls.

A variety of record formats are supported: hashopen() - open hash format file btopen() - open btree format file rnopen() - open DB record format

[cloud_user@davidthomas1c ~]$ python Python 2.7.5 (default, Aug 7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import bsddb >>> db = bsddb.btopen('btree.db', 'c') >>> db['0'] = 'David' >>> db['1'] = 'Clay' >>> db['2'] = 'Sue' >>> db['3'] = 'Betty' >>> print ('The name of student 1 is ' + db['1']) The name of student 1 is Clay

Working with CSV Comma Separated Values)INTRODUCTION TO DATABASES ON LINUX

Applications

Basic Rules

• Data is stored in fields/columns separated by the comma character.

• Records/rows are terminated by newline character. • A specific character encoding, byte order, or line terminator

format, is not required. • All records should have the same number of fields, and be in the

same order. • Data within fields is interpreted as a sequence of characters, not

as a sequence of bits or bytes.

• Most all Spreadsheet applications like Excel or Google Sheets support CSV.

• Many text editors and programming languages provide support for CSV.

• CSV files can also be manipulated by CLI tools such as grep, sed, awk, sort, and uniq.

% cat demo.csv ID,Name,Department,ExpenseCode 0,David,Engineering,200 1,Clay,HR,100 2,Sue,Sales,300 3,Betty,Marketing,400

Working with JSON JavaScript Object Notation)INTRODUCTION TO DATABASES ON LINUX

jq Client

What Is JSON?

JSON stands for JavaScript Object Notation, and is a syntax for storing and exchanging data.

It is a text-based format that is easily usable in the JavaScript language.

jq is a lightweight and flexible command-line JSON processor. Written in C, it is intended to be like sed Stream Editor) for JSON data. It has no dependences, so the binary and be downloaded and run.

• Print all users' Names ./jq '.users[].Name' demo.json

• Print name and ID of 2nd user in list ./jq '.users[1].Name,.users[1].ID' demo.json

{"users":[ {"ID":"0", "Name":"David", "Department":"Engineeting", "ExpenseCode":"200" }, {"ID":"1", "Name":"Clay", "Department":"HR", "ExpenseCode":"100" }, {"ID":"2", "Name":"Sue", "Department":"Sales", "ExpenseCode":"300" }, {"ID":"3", "Name":"Betty", "Department":"Marketing", "ExpenseCode":"400" } ]}

Working with XML (eXtensible Markup Language)INTRODUCTION TO DATABASES ON LINUX

BaseX Client

What Is XML?

XML stands for eXtensible Markup Language and is a markup language much like HTML. It was designed to store and transport data in a way that is both human and machine readable.

Data stored in the XML format can be queried using XQuery, much like SQL is used to query other databases.

Java-based command line and GUI clients available: • Start BaseX client

• java -cp BaseX932.jar org.basex.BaseX • Print Name of row with ID "1"

• xquery doc('demo.xml')/users/user[ID=1]/Name • Print ID of user named "Clay"

• xquery doc('demo.xml')/users/user[Name="Clay"]/ID

 <users> <user> <id>0</id> <name>David</name> <department>Engineering</department> <expensecode>200</expensecode> </user> <user> <id>1</id> <name>Clay</name> <department>HR</department> <expensecode>100</expensecode> </user> <user> <id>2</id> <name>Sue</name> <department>Sales</department>

A Brief History of Relational Databases


A Brief History of Relational Databases


The term "relational database" was coined by E. F. Codd at IBM in the Communications of the ACM journal.

June 1970

1977

Larry Ellison, Bob Miner and Ed Oates started a consultancy called Software Development Laboratories (SDL

1989

Microsoft SQL Server for OS/2 began as a project to port Sybase SQL Server onto OS/2.

1993

Microsoft SQL Server 4.2 for Windows NT is released.

1979

SDL releases initial version of Oracle v2 at 2.3. There was no v1 of Oracle Database, as Larry Ellison “knew no one would want to buy version 1"

1985

The post-ingres project started to address problems with Ingres database

1995

Postgres95 released with support for SQL. Previously only supported Ingres-influenced POSTQUEL query language.

Initial release on May 23, 1995.

MySQL was created by MySQL AB, a Swedish company founded by David Axmark, Allan Larsson and Michael "Monty" Widenius.

1995

2009

MariaDB forked from MySQL due to Oracle acquisition.

Version 5.1 released on October 29, 2009.

Postgres95 renamed to PostgreSQL to reflect SQL support. The first PostgreSQL release formed version 6.0 on January 29, 1997.

1996

Sun Microsystems acquires MySQL AB.

January 16, 2008 MySQL AB announced that it had agreed to be acquired by Sun Microsystems for approximately $1 billion.

2008

Microsoft SQL Server 2017 release adds Linux support.

Oracle acquired Sun Microsystems on January 27, 2010.

2010

2017

A Quick Introduction to SQL


A Quick Introduction to SQLINTRODUCTION TO DATABASES ON LINUX

Common Commands

SQL

SQL, or Structured Query Language, is used to manipulate databases. SQL is an ANSI/ISO standard, however many different versions add proprietary extensions.

SELECT - extracts data from a database UPDATE - updates data in a database INSERT INTO - inserts new data into a database DELETE - deletes data from a database CREATE DATABASE - creates a new database DROP DATABASE - deletes a database CREATE TABLE - creates a new table ALTER TABLE - modifies a table DROP TABLE - deletes a table

STRUCTURED QUERY LANGUAGE

SELECT column_name FROM table_name;UPDATE table_name SET column_name = value;INSERT INTO table_name ( column1, column2) VALUES ( value1, value2);DELETE FROM table_name WHERE condition;

CREATE DATABASE databasename;DROP DATABASE databasename;

CREATE TABLE table_name ( column1 datatype, column2 datatype);ALTER TABLE table_name ADD column_name datatype;DROP TABLE table_name;

Working with MariaDBINTRODUCTION TO DATABASES ON LINUX

MariaDB

As we learned in a previous lesson, MariaDB was forked from MySQL. We will be using MariaDB for the demo. during this lesson, but most commands should work on MySQL as well.

A highly flexible RDBMS with support for multiple storage engines.

CREATE TABLE test_table ( ID int, Name string, Department string, ExpenseCode int);

INSERT INTO test_table ( ID, Name, Department, ExpenseCode) values ( '0', 'David', ‘Engineering', ’200’);SELECT * from test_table;

--- SHOW commands display information about the server.

SHOW STATUS; - Display server statusSHOW USER_STATISTICS; - Display information about user activity.SHOW DATABASES; - Display databases on given host.SHOW TABLES; - Displays tables in a given database.

mysql --host host_name —user=user_name db_name

Once connected to the server, standard SQL commands are supported.

More information is available via the SHOW command.

Working with Non-Free Databases on Linux


Working with Non-Free Databases on LinuxINTRODUCTIONS TO DATABASES ON LINUX

Getting help using non-free databases on Linux will also differ. While users of FOSS software can find support via forum web sites

and mailing lists that are accessible to the public, support for non-free software typically comes directly

from the software vendor.

Non-free software licensing agreements limit how this software can be used, in ways that Free and Open Source licenses such as the

GPL and BSD licenses do not. These FOSS Free/Open Source

Software) licenses have their own sets of restrictions.

Installation tends to be similar as most non-free software vendors

supply RPM packages. Many (such as SQL Server) provide these via

YUM repositories.

Licensing InstallationSupport

Working with PostgreSQLINTRODUCTION TO DATABASES ON LINUX

psql

PostgreSQL is a highly extendible RDBMS, with support for many extensions and stored procedure languages. It supports connections using TCP/IP and local UNIX sockets.

STANDARD COMMANDLINE CLIENT

CREATE TABLE test_table ( ID int, Name string, Department string, ExpenseCode int);

INSERT INTO test_table ( ID, Name, Department, ExpenseCode) values ( '0', 'David', ‘Engineering', ’200’);SELECT * from test_table;

--- Meta-commands are processed by the psql client and begin with a backslash \conninfo - show current connection info\d[S+] [ pattern ] - list tables\dn[S+] [ pattern ] - list schemas\du[S+] [ pattern ] - list roles (users are roles that can login)

psql -h hostname -U username -p 5432 -d dbname

psql is the standard CLI client, and is typically included as part of the server installation.

A variety of GUI clients exist, but most admin functions are preformed via the CLI client.

A Brief History of NoSQL Databases


A Brief History of NoSQL Databases


Carlo Storzzi used the term NoSQL to name his lightweight Strozzi NoSQL open-source relational database that did not expose a SQL interface.

This use differs from the circa-2009 general concept of NoSQL databases as it was still relational.

1998

2003

Memcached was first developed by Brad Fitzpatrick for his website LiveJournal.

2007

10gen software company began developing MongoDB in 2007 as a component of a planned platform as a service product.

2008

Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik developed Cassandra at Facebook to power the Facebook inbox search feature. It was released as an open-source project on Google code in July 2008.

2009

Johan Oskarsson, then a developer at Last.fm, reintroduced the term NoSQL to describe open-source distributed, non-relational databases.

In February, 10gen released MongoDB as an open-source project.

2009

2012

In January 2012, Couchbase Inc. released Couchbase Server 1.8.

On February 17, Cassandra becomes a top-level Apache project at https://cassandra.apache.org/

2010

Couchbase, Inc. was created as the result of the merger of Membase and CouchOne (a company with many of the principal players behind CouchDB.

2011

On October 20, 2017, MongoDB became a publicly traded company, listed on NASDAQ as MDB.

2017

Working with Apache Cassandra


Working with Apache Cassandra INTRODUCTION TO DATABASES ON LINUX

Cassandra Query Language

Apache Cassandra is an open source, distributed, NoSQL database. It presents a partitioned wide column storage model with eventually consistent semantics.

For performance reasons, Cassandra does not support: • Cross partition transactions • Distributed joins • Foreign keys or referential integrity.

Cassandra provides the Cassandra Query Language (CQL, an SQL-like language, to create and update database schema and access data.

A keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster contains one keyspace per node.

$ /home/cloud_user/apache-cassandra-3.11.6/bin/nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 70.05 KiB 256 100.0% 122048be-be7b-4794-8356-3efd6f19a3c3 rack1

$ /home/cloud_user/apache-cassandra-3.11.6/cqlsh Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> CREATE KEYSPACE demo WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1};

Apache Cassandra

Working with Couchbase


Working with CouchbaseINTRODUCTION TO DATABASES ON LINUX

N1QL

An open-source, distributed (shared-nothing architecture) multi-model NoSQL document-oriented database. It was the result of a merger between Membase (database project based on memcached) and CouchDB.

JSON formatted data is stored in documents that consist of a series of key-value (or name-value) pairs. These documents are grouped into buckets. The data can be manipulated via a query language called the non-first normal form query language, or N1QL (pronounced nickel).

# Create new bucket named 'demo-bucket' couchbase-cli bucket-create -c 127.0.0.1:8091 --username Administrator --password Omgpassword! --bucket demo-bucket --bucket-type couchbase --bucket-ramsize 512

# Connect to Couchbase Query console (CBQ) /opt/couchbase/bin/cbq -e http://localhost:8091 -u=Administrator

INSERT INTO `demo-bucket` ( KEY, VALUE ) Values ( "doc0",{"name": "David", "department": "Engineering"} ) RETURNING META().id as docid, *;

CREATE PRIMARY INDEX `demo-index` ON `demo-bucket`; SELECT * FROM `demo-bucket` WHERE name= "Betty"; UPDATE `demo-bucket` set department = "Sales" WHERE name= "Betty"; DELETE from `demo-bucket` WHERE name= "Clay";

Couchbase Server

Working with Memcached (Not Really a Database)


Memcached uses a client–server architecture where the servers maintain a key–value associative array and the clients populate this array and query it by key. The keys can be up to 250 bytes long, but the values can be up to 1MB in size.

Memcached is not a true database, as the server storage is not persistent. The servers only store the values in RAM, and when the server runs out of RAM, it discards the oldest values. Clients must treat Memcached as a transitory cache, like short-term memory.

Other databases, such as Couchbase Server, do provide persistent storage while still maintaining Memcached protocol compatibility.

In February 2018 GitHub was the target of a DDoS attack using memcached servers. The memcached protocol over UDP has an amplification factor of more than 51000. In other words, for every byte the attacker sends out, the victim is receiving up to 51KB. Memcached version 1.5.6 disables the UDP protocol by default.

Working with Memcached (Not Really a Database)INTRODUCTION TO DATABASES ON LINUX

Memcached

Working with MongoDB


Working with MongoDBINTRODUCTION TO DATABASES ON LINUX

Documents and Collections

MongoDB began as a component of a planned platform-as-a-service product, but in 2009 10gen (the software company behind MongoDB shifted to an open-source development model and began offering commercial support and other services.

MongoDB is a publicly traded company, listed on NASDAQ as MDB.

In MongoDB records are stored as documents composed of field and value pairs. This can be though of like rows in a traditional relational databases.

These documents are then grouped together in collections similar to tables.

> db.inventory.insertMany([ ... { id: 0, name: "David", department: "Engineering", expensecode: 200 }, ... { id: 1, name: "Clay", department: "HR", expensecode: 100 }, ... { id: 2, name: "Sue", department: "Sales", expensecode: 300 }, ... { id: 3, name: "Betty", department: "Marketing", expensecode: 400 }, ... ]) { "acknowledged" : true, "insertedIds" : [ ObjectId("5eda77cd9742638e05a080c5"), ObjectId("5eda77cd9742638e05a080c6"), ObjectId("5eda77cd9742638e05a080c7"), ObjectId("5eda77cd9742638e05a080c8") ] } > db.inventory.updateOne( ... { name: "Betty" }, ... { ... $set: { department: "HR" } ... } ... ) { "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 } > db.inventory.find( { name: "Betty" } ) { "_id" : ObjectId("5eda77cd9742638e05a080c8"), "id" : 3, "name" : "Betty", "department" : "HR", "expensecode" : 400 } >

MongoDB a document database designed for ease of development and scaling.

Advantages of Distributed Databases


Data is closer to the user, which results in faster response times.

Lower Latency

Redundant copies of the data on each server provide backups. If one node dies, other nodes can continue.

Fault Tolerance

User requests can be spread across multiple servers, allowing more requests to be served.

Load Distribution

Multiple servers contain the same data. Updates are replicated between the servers.


Disadvantages of Distributed Databases


Data on separate nodes must be kept in sync. This process can increase latency.

Keeping it in sync

$ More servers means more money and time spent keeping it running.

More hardware = More problems

With all the advantages, why would you ever not use a distributed database?


$

$$

$

With your data stored on more servers, it is more vulnerable to attack.

Security of data

What Is High Availability?


A characteristic of a system, which aims to ensure an agreed level of operational performance,

usually uptime, for a higher than normal period

Eliminate single points of failure

Single points of failure are reduced by replicating data between multiple servers

What Is High Availability?INTRODUCTION TO DATABASES

Detection of failures

Failed nodes should be removed promptly

Reliable crossover

Load balancer or some type of proxy that acts as gateway

to the cluster

Further Questions?


[email protected] out to me directly with any question you may have about this course.

EMAIL ME

mailto:[email protected]

https://support.linuxacademy.com/hc/en-us/requests/new

For more immediate help with hands on labs or other site features, you can open a support request at the above URL.

TROUBLE WITH LABS?

Summary


SummaryINTRODUCTION TO DATABASES ON LINUX

Embedded Databases/Flat Files

• Embedded Databases and flat files are some of the first examples of electronic databases.

• Includes Comma Separate Values (CSV, and newer JSON and XML formats.



• Data is organized using the relational model into tables of columns and rows.

• They typically use Structured Query Language (SQL to manipulate data.


NoSQL Databases

• Non-relational databases are built for scalability and speed.

• They typically store data as key-value pairs.



• Data is replicated across many servers, providing redundancy and performance gains.

• Replication can add latency.

Introduction to Databases on Linux

Documents

Transcript of Introduction to Databases on Linux