Introduction to Databases on Linux
Transcript of Introduction to Databases on Linux
Introduction to Databases on Linux
David ThomasDATABASE TRAINING ARCHITECT
COURSE/CHAPTER BREAKDOWN
Introduction to Databases
on Linux
Introduction
Distributed Databases
Embedded Databases/Flat Files
Conclusion
Relational Databases
NoSQL Databases
David ThomasDATABASE TRAINING ARCHITECT
Related Courses
David ThomasDATABASE TRAINING ARCHITECT
Related Courses
This is also an introduction level course, but also includes
installation demos.
Databases Essentials
Most of the labs and demos in this course are done using CentOS 7, but many of the
concepts discussed will still apply.
CentOS Enterprise Linux 8 Essentials
This introduction course is based on the LPI exam, but still provides a good general introduction to the Linux OS.
LPI Linux Essentials Certification
If you're looking for more information on using databases in the AWS cloud, this course is a great start.
Amazon Aurora - Cloud SQL DB Essentials
INTRODUCTION TO DATABASES ON LINUX
What Is a Database?
David ThomasDATABASE TRAINING ARCHITECT
What Is a Database?INTRODUCTION TO DATABASES ON LINUX
A Database is an organized collection of data. Typically this includes methods to manipulate the data. The data can be anything from the 1890 census results to the Netflix streaming movie catalogue.
When data is replicated across multiple servers within a
database cluster. • Multi-Master:
Everyone can read and write.
• Primary-Replica: Everyone can read, but only the primary writes.
Distributed Databases
Data is organized, using the Relational model, into tables consisting of columns and
rows with a unique key identifying each row.
Relational Databases
These include early simple databases
such as CSV, Berkeley DB, as well as newer JSON and XML data formats.
Embedded Databases and Flat Files
Better said as a non-relational
database, data in a NoSQL database is typically stored as
key/value pairs.
NoSQL
A Brief History of Embedded Databases
David ThomasDATABASE TRAINING ARCHITECT
A Brief History of Embedded Databases
INTRODUCTION TO DATABASES ON LINUX
In the 1890 US Census, Herman Hollerith created the first computerized flat-file database by tabulating data via hole punches in paper cards.
1890
1972
The IBM Fortran (level H extended) compiler under OS/360 includes support for CSV data.
1998
The XML 1.0 standard is published in 1998.
2001
JSON format specified by Douglas Crockford at State Software.
1991
The effort to remove or replace all code originating in the original AT&T Unix results in the first release of Berkeley DB in 1991.
1996
Netscape requests that the authors of Berkeley DB improve and extend the library, this leads to the creation of Sleepycat Software.
2004
The XML 1.1 standard is published in 2004
CSV is defined as a MIME Content Type by RFC4180.
2005 2013
Ecma International published the first edition of its JSON standard ECMA404.
The XML 1.1, second edition standard is published.
In February 2006, Sleepycat Software is acquired by Oracle Corporation, which continues to develop and sell Berkeley DB.
2006
The XML 1.1, fifth edition standard is published.
2008
W3C, in an attempt to enhance CSV with formal semantics, publicized the first drafts of recommendations for CSV-metadata standards. These began as recommendations in December of 2015
2015
Yahoo! begins offering some web services in JSON.
Parsing Files on the Command LineINTRODUCTION TO DATABASES ON LINUX
Speciality Tools
Binary vs. Text
Berkeley DB is a binary format where as CSV, XML, and JSON are all text-based formats.
This means that standard UNIX command-line tools can be used on CSV, XML and JSON format data.
While CLI tools can be used on all text based formats, speciality tools will provide the most features.
jq is a lightweight and flexible command-line JSON processor. https://stedolan.github.io/jq/ BaseX provides a CLI and GUI client for processing XML data. http://basex.org/
# print department of ID 1 grep ^1 demo.csv | awk -F, '{ print $3 }' # print ExpenseCode of ID 2 grep ^2 demo.csv | awk -F, '{ print $4 }'
# print all names awk -F, '{ print $2 }' demo.json | awk -F: '{ print $2 }' grep Name demo.xml | awk -F\> '{ print $2 }' | sed -e 's/\<.*//' # output can be sorted cat demo.csv | awk -F, '{ print $2 }’ | sort # duplicates can be removed cat demo.csv | awk -F, '{ print $2 }’ | uniq
Working with Berkeley DBINTRODUCTION TO DATABASES ON LINUX
File Formats
Architecture
Berkeley DB BDB is far simpler than other DBMS. It is not based on a client/server model and provides no network access. Data is accessed via in-process API calls.
A variety of record formats are supported: hashopen() - open hash format file btopen() - open btree format file rnopen() - open DB record format
[cloud_user@davidthomas1c ~]$ python Python 2.7.5 (default, Aug 7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import bsddb >>> db = bsddb.btopen('btree.db', 'c') >>> db['0'] = 'David' >>> db['1'] = 'Clay' >>> db['2'] = 'Sue' >>> db['3'] = 'Betty' >>> print ('The name of student 1 is ' + db['1']) The name of student 1 is Clay
Working with CSV Comma Separated Values)INTRODUCTION TO DATABASES ON LINUX
Applications
Basic Rules
• Data is stored in fields/columns separated by the comma character.
• Records/rows are terminated by newline character. • A specific character encoding, byte order, or line terminator
format, is not required. • All records should have the same number of fields, and be in the
same order. • Data within fields is interpreted as a sequence of characters, not
as a sequence of bits or bytes.
• Most all Spreadsheet applications like Excel or Google Sheets support CSV.
• Many text editors and programming languages provide support for CSV.
• CSV files can also be manipulated by CLI tools such as grep, sed, awk, sort, and uniq.
% cat demo.csv ID,Name,Department,ExpenseCode 0,David,Engineering,200 1,Clay,HR,100 2,Sue,Sales,300 3,Betty,Marketing,400
Working with JSON JavaScript Object Notation)INTRODUCTION TO DATABASES ON LINUX
jq Client
What Is JSON?
JSON stands for JavaScript Object Notation, and is a syntax for storing and exchanging data.
It is a text-based format that is easily usable in the JavaScript language.
jq is a lightweight and flexible command-line JSON processor. Written in C, it is intended to be like sed Stream Editor) for JSON data. It has no dependences, so the binary and be downloaded and run.
• Print all users' Names ./jq '.users[].Name' demo.json
• Print name and ID of 2nd user in list ./jq '.users[1].Name,.users[1].ID' demo.json
{"users":[ {"ID":"0", "Name":"David", "Department":"Engineeting", "ExpenseCode":"200" }, {"ID":"1", "Name":"Clay", "Department":"HR", "ExpenseCode":"100" }, {"ID":"2", "Name":"Sue", "Department":"Sales", "ExpenseCode":"300" }, {"ID":"3", "Name":"Betty", "Department":"Marketing", "ExpenseCode":"400" } ]}
Working with XML (eXtensible Markup Language)INTRODUCTION TO DATABASES ON LINUX
BaseX Client
What Is XML?
XML stands for eXtensible Markup Language and is a markup language much like HTML. It was designed to store and transport data in a way that is both human and machine readable.
Data stored in the XML format can be queried using XQuery, much like SQL is used to query other databases.
Java-based command line and GUI clients available: • Start BaseX client
• java -cp BaseX932.jar org.basex.BaseX • Print Name of row with ID "1"
• xquery doc('demo.xml')/users/user[ID=1]/Name • Print ID of user named "Clay"
• xquery doc('demo.xml')/users/user[Name="Clay"]/ID
<!--?xml version="1.0" encoding="UTF-8"?--> <users> <user> <id>0</id> <name>David</name> <department>Engineering</department> <expensecode>200</expensecode> </user> <user> <id>1</id> <name>Clay</name> <department>HR</department> <expensecode>100</expensecode> </user> <user> <id>2</id> <name>Sue</name> <department>Sales</department>
A Brief History of Relational Databases
David ThomasDATABASE TRAINING ARCHITECT
A Brief History of Relational Databases
INTRODUCTION TO DATABASES ON LINUX
The term "relational database" was coined by E. F. Codd at IBM in the Communications of the ACM journal.
June 1970
1977
Larry Ellison, Bob Miner and Ed Oates started a consultancy called Software Development Laboratories (SDL
1989
Microsoft SQL Server for OS/2 began as a project to port Sybase SQL Server onto OS/2.
1993
Microsoft SQL Server 4.2 for Windows NT is released.
1979
SDL releases initial version of Oracle v2 at 2.3. There was no v1 of Oracle Database, as Larry Ellison “knew no one would want to buy version 1"
1985
The post-ingres project started to address problems with Ingres database
1995
Postgres95 released with support for SQL. Previously only supported Ingres-influenced POSTQUEL query language.
Initial release on May 23, 1995.
MySQL was created by MySQL AB, a Swedish company founded by David Axmark, Allan Larsson and Michael "Monty" Widenius.
1995
2009
MariaDB forked from MySQL due to Oracle acquisition.
Version 5.1 released on October 29, 2009.
Postgres95 renamed to PostgreSQL to reflect SQL support. The first PostgreSQL release formed version 6.0 on January 29, 1997.
1996
Sun Microsystems acquires MySQL AB.
January 16, 2008 MySQL AB announced that it had agreed to be acquired by Sun Microsystems for approximately $1 billion.
2008
Microsoft SQL Server 2017 release adds Linux support.
Oracle acquired Sun Microsystems on January 27, 2010.
2010
2017
A Quick Introduction to SQL
David ThomasDATABASE TRAINING ARCHITECT
A Quick Introduction to SQLINTRODUCTION TO DATABASES ON LINUX
Common Commands
SQL
SQL, or Structured Query Language, is used to manipulate databases. SQL is an ANSI/ISO standard, however many different versions add proprietary extensions.
SELECT - extracts data from a database UPDATE - updates data in a database INSERT INTO - inserts new data into a database DELETE - deletes data from a database CREATE DATABASE - creates a new database DROP DATABASE - deletes a database CREATE TABLE - creates a new table ALTER TABLE - modifies a table DROP TABLE - deletes a table
STRUCTURED QUERY LANGUAGE
SELECT column_name FROM table_name;UPDATE table_name SET column_name = value;INSERT INTO table_name ( column1, column2) VALUES ( value1, value2);DELETE FROM table_name WHERE condition;
CREATE DATABASE databasename;DROP DATABASE databasename;
CREATE TABLE table_name ( column1 datatype, column2 datatype);ALTER TABLE table_name ADD column_name datatype;DROP TABLE table_name;
Working with MariaDBINTRODUCTION TO DATABASES ON LINUX
MariaDB
As we learned in a previous lesson, MariaDB was forked from MySQL. We will be using MariaDB for the demo. during this lesson, but most commands should work on MySQL as well.
A highly flexible RDBMS with support for multiple storage engines.
CREATE TABLE test_table ( ID int, Name string, Department string, ExpenseCode int);
INSERT INTO test_table ( ID, Name, Department, ExpenseCode) values ( '0', 'David', ‘Engineering', ’200’);SELECT * from test_table;
--- SHOW commands display information about the server.
SHOW STATUS; - Display server statusSHOW USER_STATISTICS; - Display information about user activity.SHOW DATABASES; - Display databases on given host.SHOW TABLES; - Displays tables in a given database.
mysql --host host_name —user=user_name db_name
Once connected to the server, standard SQL commands are supported.
More information is available via the SHOW command.
Working with Non-Free Databases on Linux
David ThomasDATABASE TRAINING ARCHITECT
Working with Non-Free Databases on LinuxINTRODUCTIONS TO DATABASES ON LINUX
Getting help using non-free databases on Linux will also differ. While users of FOSS software can find support via forum web sites
and mailing lists that are accessible to the public, support for non-free software typically comes directly
from the software vendor.
Non-free software licensing agreements limit how this software can be used, in ways that Free and Open Source licenses such as the
GPL and BSD licenses do not. These FOSS Free/Open Source
Software) licenses have their own sets of restrictions.
Installation tends to be similar as most non-free software vendors
supply RPM packages. Many (such as SQL Server) provide these via
YUM repositories.
Licensing InstallationSupport
Working with PostgreSQLINTRODUCTION TO DATABASES ON LINUX
psql
PostgreSQL is a highly extendible RDBMS, with support for many extensions and stored procedure languages. It supports connections using TCP/IP and local UNIX sockets.
STANDARD COMMANDLINE CLIENT
CREATE TABLE test_table ( ID int, Name string, Department string, ExpenseCode int);
INSERT INTO test_table ( ID, Name, Department, ExpenseCode) values ( '0', 'David', ‘Engineering', ’200’);SELECT * from test_table;
--- Meta-commands are processed by the psql client and begin with a backslash \conninfo - show current connection info\d[S+] [ pattern ] - list tables\dn[S+] [ pattern ] - list schemas\du[S+] [ pattern ] - list roles (users are roles that can login)
psql -h hostname -U username -p 5432 -d dbname
psql is the standard CLI client, and is typically included as part of the server installation.
A variety of GUI clients exist, but most admin functions are preformed via the CLI client.
A Brief History of NoSQL Databases
David ThomasDATABASE TRAINING ARCHITECT
A Brief History of NoSQL Databases
INTRODUCTION TO DATABASES ON LINUX
Carlo Storzzi used the term NoSQL to name his lightweight Strozzi NoSQL open-source relational database that did not expose a SQL interface.
This use differs from the circa-2009 general concept of NoSQL databases as it was still relational.
1998
2003
Memcached was first developed by Brad Fitzpatrick for his website LiveJournal.
2007
10gen software company began developing MongoDB in 2007 as a component of a planned platform as a service product.
2008
Avinash Lakshman, one of the authors of Amazon's Dynamo, and Prashant Malik developed Cassandra at Facebook to power the Facebook inbox search feature. It was released as an open-source project on Google code in July 2008.
2009
Johan Oskarsson, then a developer at Last.fm, reintroduced the term NoSQL to describe open-source distributed, non-relational databases.
In February, 10gen released MongoDB as an open-source project.
2009
2012
In January 2012, Couchbase Inc. released Couchbase Server 1.8.
On February 17, Cassandra becomes a top-level Apache project at https://cassandra.apache.org/
2010
Couchbase, Inc. was created as the result of the merger of Membase and CouchOne (a company with many of the principal players behind CouchDB.
2011
On October 20, 2017, MongoDB became a publicly traded company, listed on NASDAQ as MDB.
2017
Working with Apache Cassandra
David ThomasDATABASE TRAINING ARCHITECT
Working with Apache Cassandra INTRODUCTION TO DATABASES ON LINUX
Cassandra Query Language
Apache Cassandra is an open source, distributed, NoSQL database. It presents a partitioned wide column storage model with eventually consistent semantics.
For performance reasons, Cassandra does not support: • Cross partition transactions • Distributed joins • Foreign keys or referential integrity.
Cassandra provides the Cassandra Query Language (CQL, an SQL-like language, to create and update database schema and access data.
A keyspace in Cassandra is a namespace that defines data replication on nodes. A cluster contains one keyspace per node.
$ /home/cloud_user/apache-cassandra-3.11.6/bin/nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 127.0.0.1 70.05 KiB 256 100.0% 122048be-be7b-4794-8356-3efd6f19a3c3 rack1
$ /home/cloud_user/apache-cassandra-3.11.6/cqlsh Connected to Test Cluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4] Use HELP for help. cqlsh> CREATE KEYSPACE demo WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1};
Apache Cassandra
Working with Couchbase
David ThomasDATABASE TRAINING ARCHITECT
Working with CouchbaseINTRODUCTION TO DATABASES ON LINUX
N1QL
An open-source, distributed (shared-nothing architecture) multi-model NoSQL document-oriented database. It was the result of a merger between Membase (database project based on memcached) and CouchDB.
JSON formatted data is stored in documents that consist of a series of key-value (or name-value) pairs. These documents are grouped into buckets. The data can be manipulated via a query language called the non-first normal form query language, or N1QL (pronounced nickel).
# Create new bucket named 'demo-bucket' couchbase-cli bucket-create -c 127.0.0.1:8091 --username Administrator --password Omgpassword! --bucket demo-bucket --bucket-type couchbase --bucket-ramsize 512
# Connect to Couchbase Query console (CBQ) /opt/couchbase/bin/cbq -e http://localhost:8091 -u=Administrator
INSERT INTO `demo-bucket` ( KEY, VALUE ) Values ( "doc0",{"name": "David", "department": "Engineering"} ) RETURNING META().id as docid, *;
CREATE PRIMARY INDEX `demo-index` ON `demo-bucket`; SELECT * FROM `demo-bucket` WHERE name= "Betty"; UPDATE `demo-bucket` set department = "Sales" WHERE name= "Betty"; DELETE from `demo-bucket` WHERE name= "Clay";
Couchbase Server
Working with Memcached (Not Really a Database)
David ThomasDATABASE TRAINING ARCHITECT
Memcached uses a client–server architecture where the servers maintain a key–value associative array and the clients populate this array and query it by key. The keys can be up to 250 bytes long, but the values can be up to 1MB in size.
Memcached is not a true database, as the server storage is not persistent. The servers only store the values in RAM, and when the server runs out of RAM, it discards the oldest values. Clients must treat Memcached as a transitory cache, like short-term memory.
Other databases, such as Couchbase Server, do provide persistent storage while still maintaining Memcached protocol compatibility.
In February 2018 GitHub was the target of a DDoS attack using memcached servers. The memcached protocol over UDP has an amplification factor of more than 51000. In other words, for every byte the attacker sends out, the victim is receiving up to 51KB. Memcached version 1.5.6 disables the UDP protocol by default.
Working with Memcached (Not Really a Database)INTRODUCTION TO DATABASES ON LINUX
Memcached
Working with MongoDB
David ThomasDATABASE TRAINING ARCHITECT
Working with MongoDBINTRODUCTION TO DATABASES ON LINUX
Documents and Collections
MongoDB began as a component of a planned platform-as-a-service product, but in 2009 10gen (the software company behind MongoDB shifted to an open-source development model and began offering commercial support and other services.
MongoDB is a publicly traded company, listed on NASDAQ as MDB.
In MongoDB records are stored as documents composed of field and value pairs. This can be though of like rows in a traditional relational databases.
These documents are then grouped together in collections similar to tables.
> db.inventory.insertMany([ ... { id: 0, name: "David", department: "Engineering", expensecode: 200 }, ... { id: 1, name: "Clay", department: "HR", expensecode: 100 }, ... { id: 2, name: "Sue", department: "Sales", expensecode: 300 }, ... { id: 3, name: "Betty", department: "Marketing", expensecode: 400 }, ... ]) { "acknowledged" : true, "insertedIds" : [ ObjectId("5eda77cd9742638e05a080c5"), ObjectId("5eda77cd9742638e05a080c6"), ObjectId("5eda77cd9742638e05a080c7"), ObjectId("5eda77cd9742638e05a080c8") ] } > db.inventory.updateOne( ... { name: "Betty" }, ... { ... $set: { department: "HR" } ... } ... ) { "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 } > db.inventory.find( { name: "Betty" } ) { "_id" : ObjectId("5eda77cd9742638e05a080c8"), "id" : 3, "name" : "Betty", "department" : "HR", "expensecode" : 400 } >
MongoDB a document database designed for ease of development and scaling.
Advantages of Distributed Databases
David ThomasDATABASE TRAINING ARCHITECT
Data is closer to the user, which results in faster response times.
Lower Latency
Redundant copies of the data on each server provide backups. If one node dies, other nodes can continue.
Fault Tolerance
User requests can be spread across multiple servers, allowing more requests to be served.
Load Distribution
Multiple servers contain the same data. Updates are replicated between the servers.
Distributed Databases
Disadvantages of Distributed Databases
David ThomasDATABASE TRAINING ARCHITECT
Data on separate nodes must be kept in sync. This process can increase latency.
Keeping it in sync
$ More servers means more money and time spent keeping it running.
More hardware = More problems
With all the advantages, why would you ever not use a distributed database?
Distributed Databases
$
$$
$
With your data stored on more servers, it is more vulnerable to attack.
Security of data
What Is High Availability?
David ThomasDATABASE TRAINING ARCHITECT
A characteristic of a system, which aims to ensure an agreed level of operational performance,
usually uptime, for a higher than normal period
Eliminate single points of failure
Single points of failure are reduced by replicating data between multiple servers
What Is High Availability?INTRODUCTION TO DATABASES
Detection of failures
Failed nodes should be removed promptly
Reliable crossover
Load balancer or some type of proxy that acts as gateway
to the cluster
Further Questions?
David ThomasDATABASE TRAINING ARCHITECT
[email protected] out to me directly with any question you may have about this course.
EMAIL ME
https://support.linuxacademy.com/hc/en-us/requests/new
For more immediate help with hands on labs or other site features, you can open a support request at the above URL.
TROUBLE WITH LABS?
Summary
David ThomasDATABASE TRAINING ARCHITECT
SummaryINTRODUCTION TO DATABASES ON LINUX
Embedded Databases/Flat Files
• Embedded Databases and flat files are some of the first examples of electronic databases.
• Includes Comma Separate Values (CSV, and newer JSON and XML formats.
SummaryINTRODUCTION TO DATABASES ON LINUX
Relational Databases
• Data is organized using the relational model into tables of columns and rows.
• They typically use Structured Query Language (SQL to manipulate data.
SummaryINTRODUCTION TO DATABASES ON LINUX
NoSQL Databases
• Non-relational databases are built for scalability and speed.
• They typically store data as key-value pairs.
SummaryINTRODUCTION TO DATABASES ON LINUX
Distributed Databases
• Data is replicated across many servers, providing redundancy and performance gains.
• Replication can add latency.