LABORATORY OF DATA SCIENCE Data Access: RelationalData …

35
LABORATORY OF DATA SCIENCE Data Access: Relational Data Bases Data Science and Business Informatics Degree

Transcript of LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Page 1: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

LABORATORY OF DATA SCIENCE

Data Access: Relational Data Bases

Data Science and Business Informatics Degree

Page 2: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

RDBMS data access

¨ Protocols and API¤ODBC, OLE DB, ADO, ADO.NET, JDBC

¨ Python DBAPI with ODBC protocol

Laboratory of Data Science

2

Page 3: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Connecting to a RDBMS

Laboratory of Data Science

3

¨ Connection protocol¤ locate the RDBMS server¤ open a connection¤ user autentication

¨ Querying¤ query SQL

n SELECTn UPDATE/INSERT/CREATE

¤ stored procedures¤ prepared query SQL

¨ Scan Result set¤ scan row by row¤ access result meta-data

Client ServerConnectionString

OK

SQL query

Result set

Page 4: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Connection Standards

¨ ODBC - Open DataBase Connectivity¤ Windows: odbc Linux: unixodbc, iodbc¤ Tabular Data

¨ JDBC – Java DataBase Connectivity

¨ OLE DB (Microsoft) – Object Linking and Embedding¤ Tabular data, XML, multi-dimensional data

¨ ADO (Microsoft) – ActiveX Data Objects¤ Object-oriented API on top of OLE DB

¨ ADO.NETn evolution of ADO in the .NET framework

4

Laboratory of Data Science

Page 5: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

ODBC Open DataBase Connectivity

Laboratory of Data Science

5

Page 6: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

ODBC Demo

¨ Registering an ODBC data source¤ lbi on access¤ lbi on SQL Server (driver SQL Server)

¨ Data access¤ copy Access table to Excel

¨ Linked tables¤ Linking SQL Server Table from Access

Laboratory of Data Science

6

Page 7: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

OLE DB Demo

¨ Creating .udl data links¨ Data access

¤ accessing Access data from Excel

¨ Linked tables¤ accessing Excel data from Access

¨ OLE DB Drivers¤ By Microsoft

Laboratory of Data Science

7

Page 8: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

RDBMS data access

¨ Python DBAPI is a standard specification for modulesthat interefaces with databases

¨ Most of Python database interfaces adhere to thisstandard

¨ Functions:¤Connecting to a database¤Submitting SQL queries¤Scanning the results of queries¤Accessing meta-data on tables

Laboratory of Data Science

8

Page 9: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Support of Different RDBMS

¨ Portable across several relational and non-relational databases:¤ Microsoft SQL Server ¤ Oracle ¤ MySQL¤ IBM DB2 ¤ PostgreSQL¤ Firebird (and Interbase) ¤ Cassandra¤ MongoDB¤ ….. Laboratory of Data Science

9

Page 10: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Different Modules for a DB

Given a database we have variuos module options. For example, MySQL has the following interfacemodules:¨ MySQL for Python (import MySQLdb)¨ PyMySQL (import pymysql)¨ pyODBC (import pyodbc)¨ MySQL Connector/Python (import mysql.connector) ¨ mypysql (import mypysql)¨ etc ...

Laboratory of Data Science

10

Page 11: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

DBAPI Specification

¨ Most of database modules conform to the specification

¨ no matter which kind of database and/or moduleyou choose, the code will likely look very similar

¨ See details here: https://www.python.org/dev/peps/pep-0249/

Laboratory of Data Science

11

Page 12: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

DBAPI Specification

¨ Each module interface is required to have the following functions

¨ connect(args): a constructor for Connection objects, that makes the access available. Argumentsare database-dependent

¨ conn.close() – close connection¨ conn.commit() – commit pending transaction¨ ….

Laboratory of Data Science

12

Page 13: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

DBAPI Specification

¨ conn.cursor() – return a Cursor object for the connection. Cursors are used fetch operations

¨ c.execute(op,[params])–prepare and execute an operation with parameters where the second argument may be a list of parametersequences

¨ c.fetch[one|many|all]([s])– fetch nextrow, next s rows, or all remaining rows of result set

¨ c.close() – close cursor.

¨ and others. Laboratory of Data Science

13

Page 14: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Programming pattern

1. Import the DB module

2. Connect to the RDBMS

3. Submit a SQL query

4. Process query results

5. Close the connection

Laboratory of Data Science

14

Page 15: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

DB Module: Pyodbc

¨ Pyodbc is an open source Python module ODBC and implementing the DBAPI 2.0 specification.

¨ Enables an easily connection of Python applications to data sources with an ODBC driver

¨ Python program along with the pyodbc module will use an ODBC driver manager and ODBC driver

¨ The ODBC driver manager is platform-specific

¨ The ODBC driver is database-specific

¨ The ODBC driver manager and driver will connect, typicallyover a network, to the database server.

Laboratory of Data Science

15

Page 16: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Connect to the RDBMS

¨ Access the database via the connection object

¨ Use connect constructor to create a connection with database conn = pyodbc.connect(parameters...)

¨ Create cursor via the connectioncur = conn.cursor()

¨ Connect function requires the “connection string”¨ The connection string depends on the driver

Laboratory of Data Science

16

Page 17: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Connection String

¨ The connection strings:DRIVER=Driver name; SERVER=hostname;

DATABASE=DBname; UID=user;

PWD=password

¨ In Python:conn = pyodbc.connect(

'DRIVER={ODBC Driver 17 for SQL Server}; SERVER=tcp:apa.di.unipi.it;

DATABASE=Foodmart;

UID=lbi;

PWD=pisa')

Laboratory of Data Science

17

Page 18: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

ODBC DRIVER

¨ Microsoft have written and distributed multiple ODBC drivers for SQL Server:¤ {SQL Server} - released with SQL Server 2000¤ {SQL Native Client} - released with SQL Server 2005

(also known as version 9.0)¤ {SQL Server Native Client 10.0} - released with SQL

Server 2008¤ …..

18

Laboratory of Data Science

Page 19: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

ODBC DRIVER

¤ {SQL Server Native Client 11.0} - released with SQL Server 2012

¤ {ODBC Driver 11 for SQL Server} - supports SQL Server 2005 through 2014

¤ {ODBC Driver 13 for SQL Server} - supports SQL Server 2005 through 2016

¤ {ODBC Driver 13.1 for SQL Server} - supports SQL Server 2008 through 2016

¤ {ODBC Driver 17 for SQL Server} - supports SQL Server 2008 through 2019

Laboratory of Data Science

19

Page 20: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Submit a SQL query

Select String:query = "SELECT name, age FROM students”

Submit the SQL query and get the resultcursor.execute(query)

UPDATE String:update = “UPDATE students SET age = age + 1“;

cursor.execute(update )

Conn.commit()

Laboratory of Data Science

20

Page 21: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Scan query results

FETCHALL:cursor.execute("SELECT TOP 10 education, gender FROM

customer")

rows = cursor.fetchall() // all rows in memory!!!for row in rows:

print (row[0], row[1]) //access by index print(row.gender, row.education) //access by name

CURSOR AS ITERATOR:cursor.execute("SELECT TOP 10 education, gender FROM

customer;"):for row in cursor:

print(row.gender, row.education)

Laboratory of Data Science

21

Page 22: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Update and Delete

¨ Updating and deleting work passing the SQL to execute

deleted = cursor.execute("delete from products where id <> 0001").rowcount

conn.commit()

deleted represents the number of affected rows

Laboratory of Data Science

22

Page 23: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Close the connection

// close the cursorcursor.close();

// close connection to the databaseconn.close();…

Laboratory of Data Science

23

Page 24: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Prepared commands with parameters

¨ Problem: read N rows from a CSV file, and insert each one into a database table¤ N SQL queries?INSERT INTO names (id, name) VALUES (1, ‘Luigi Rossi’)

INSERT INTO names (id, name) VALUES (2, ‘Mario Bianchi’)

n Inefficiency: an execution plan has to be computed for every query, yet all of them share a common structure

¤ Use ? as a placeholder for parameters

INSERT INTO names (id, name) VALUES (?, ?)

Laboratory of Data Science

24

Page 25: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Prepared commands with parameters

.....

conn = …… //connectioncursor = conn.cursor()

lines = fileIn.readlines()sql ="INSERT INTO name_table(id,name) VALUES(?,?)“

i=0for var in lines:

rows = cursor.execute(sql,(i,var))i+=1

conn.commit()

Laboratory of Data Science

25

Page 26: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Prepared commands with parameters

conn = …… //connection

cursor = conn.cursor()list = ['USA', 'Canada']

query = ‘SELECT education, country FROM customer WHERE country=?’

for el in list:

rows = cursor.execute(query,el).fetchall()print ('Start ', el)

for row in rows:print(row)

print('\n')Laboratory of Data Science

26

Page 27: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

DATA TYPE MAPPING

Laboratory of Data Science

27

How Python objects passed to cursor.execute() as parameters are formatted and passed to the driver/database.

Page 28: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

DATA TYPE MAPPING

Laboratory of Data Science

28

How database results are converted to Python objects

Page 29: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Meta-data on ResultSet

Meta-data: column names and types of a resultset

for attributes in cursor.description:print("Name: %s, Type: %s " %

(attributes[0], attributes[1]))

Laboratory of Data Science

29

Page 30: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Meta-data on DB Tables

¨ tables(table=None,catalog=None,

schema=None,tableType=None)

¨ Returns an iterator for generating information about the tables in the database.

¨ Each row has the columns:¤ Table_cat: catalog name

¤ Table-schem: schema name

¤ Table_name: table name

¤ table_type: TABLE, VIEW, SYSTEM TABLE, GLOBAL TEMPORARY, LOCAL TEMPORARY, ALIAS, SYNONYM

¤ A description of the table

Laboratory of Data Science

30

Page 31: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Meta-data on DB Tables

cnxn = …

cursor = cnxn.cursor()

for table in cursor.tables():

print(table)

Or

for table in cursor.tables(table='sys%'):

print(table)

Laboratory of Data Science

31

Tables starting with «sys»

Page 32: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Columns meta-data

¤ table_cat

¤ table_schem

¤ table_name

¤ column_name

¤ data_type

¤ type_name

¤ column_size

¤ buffer_length

¤ decimal_digits

¤ num_prec_radix

¤ nullable

¤ remarks

¤ column_def

¤ sql_data_type

¤ sql_datetime_sub

¤ char_octet_length

¤ ordinal_position

¤ is_nullable: One of SQL_NULLABLE, SQL_NO_NULLS, SQL_NULLS_UNKNOWN.

32

columns(table=None,catalog=None,schema=None,column=None)

¨ Creates a result set of column information on a table

¨ Each row has the following columns:

Page 33: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Exercise: Stratified subsampling

¨ DATABASE NAME = lbi¨ Let T be a database table (e.g., census), and A a

column in T (e.g., sex)¨ Develop a Python program that exports on a CSV

file a subset of 30% of rows of T:¤ the subset is randomly chosen;¤ but it must preserve the proportion of distinct values of

column An e.g., if there are 65% of male students, the subset must

contain 65% of males and 35% of females.

Laboratory of Data Science

33

Page 34: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

Intuition on the solution!

Laboratory of Data Science

34

MalesNrows=100

SelRow=30

1° Rec=30/100

2° Rec=29/99 2° Rec=30/99

Selected Not selected

3° Rec=28/98 3° Rec=29/98

Selected Not selected

…...…...

…...

M° Rec=30/30

Not selected

Not selected

All records selected!!!

Page 35: LABORATORY OF DATA SCIENCE Data Access: RelationalData …

How to generate an element with probability x/y?

¨ Generate a number n in the range [0 …. Y]¨ The element is selected if n < x the record is

selected¨ For random selection of a number in the above

range(int)(Math.random()*Y)

Laboratory of Data Science

35