PostgreSQL Workshop: Working with Network Data PGCon 2014 Workshop James Hanson 21-May-14.
-
Upload
kennedy-ridgway -
Category
Documents
-
view
212 -
download
0
Transcript of PostgreSQL Workshop: Working with Network Data PGCon 2014 Workshop James Hanson 21-May-14.
PostgreSQL Workshop:Working with Network Data
PGCon 2014 WorkshopJames Hanson
21-May-14
2
About the presenter …
• James [email protected],http://www.linkedin.com/in/jamesphansonhttp://jamesphanson.wikispaces.com/
• Reformed Oracle RDBMS developer / administrator
• I am interested in network data and I see PostgreSQL as powerful, underutilized tool for storing and analyzing network data
21-May-14 PGCon14 Working with Network Data Types
3
About the presenter …
• James Hanson Parsons, Inc. [email protected]
21-May-14 PGCon14 Working with Network Data Types
4
What are we doing in this workshop?
• Using PostgreSQL to explore the question “What can I find out about the requester_IP addresses in my Apache log?”…plus some side trips …– Parsing & loading an Apache access log and
associating the requesting IPs with GeoLocated, known malicious and other IP lists.
NOTE: All workshop data is free and publicly available.
21-May-14 PGCon14 Working with Network Data Types
5
What will we get from the workshop?• Hands-on experience using PG to load, relate and
analyze moderate-sized network data sets• Simple & useful techniques for loading network data• Simple and useful techniques for exploring network
data• Experience exploring some public data sets• Get jazzed to use the techniques and code samples
on your network data• Meet other folks with complementary interests and
expertise
21-May-14 PGCon14 Working with Network Data Types
6
What’s not in this workshop?• GUI displays or application integration.
We focus on the data• PG optimization, tuning or extensions other
than PostGIS• RDBMS normalizing, surrogate keys or
orthodoxy*• Automated workflow• Handling of exceptions, errors and boundary
conditions* We’ll revisit this later21-May-14 PGCon14
Working with Network Data Types
7
Workshop setup and logistics• Using an EnterpriseDB PostgreSQL 9.3 + PostGIS
with no additional extensions• Hosted on a Centos 6.5 x64 virtual machine with
3GB RAM and 2* CPU cores• Using three workshop downloads …– The data and DDL queries are zipped and available on
GoogleDocs– The PowerPoint is on Slide Share– The SQL and Python files are on GitHub– Everything is linked off PGCon14 site
21-May-14 PGCon14 Working with Network Data Types
8
Workshop setup errata• PGAdmin Query Tool
Show NULL values as <NULL>• Centos yum packages:
policycoreutils-gui setroubleshoot lzop sambados2unix system-config-lvm.noarch
• PostgreSQL non-default parameters Workshop value Enterprise DB default
max_connections 10 100 effective_cache_size 262,144 16,384maintenance_work_mem 524,288 16,384max_stack_depth 2,048 100shared_buffers 98,304 1,024temp_buffers 131,072 1,024work_mem 1,048,576 1,024fsync off onsynchronous_commit off onencoding UTF-8
21-May-14 PGCon14 Working with Network Data Types
9
PostgreSQL network data typesName Storage Size Description
cidr 7 or 19 bytes IPv4 and IPv6 networks
inet 7 or 19 bytes IPv4 and IPv6 hosts and networks
macaddr 6 bytes MAC addresses
21-May-14 PGCon14 Working with Network Data Types
An inetrange type is a natural extension that complements subnets
CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet);
10
PostgreSQL network operators
21-May-14 PGCon14 Working with Network Data Types
Operator Description Example
<= is less than or equal inet '192.168.1.5' <= inet '192.168.1.5'
= equals inet '192.168.1.5' = inet '192.168.1.5'
<> is not equal inet '192.168.1.5' <> inet '192.168.1.4'
<< is contained within inet '192.168.1.5' << inet '192.168.1/24'
<<= is contained within or equals inet '192.168.1/24' <<= inet '192.168.1/24'
>> contains inet '192.168.1/24' >> inet '192.168.1.5'
11
PostgreSQL ranges and arrays• Ranges– 6 built in types. Easy to create custom types– Can by inclusive, [, or exclusive, (
We’ll use inclusive lower and upper bounds, […]– Use the “contains element”, @>, and “element is
contained by”, <@, operators• Arrays … just the tip or the iceberg– Use for parsing raw data and alternative storage
models– Search arrays with the ANY operator
21-May-14 PGCon14 Working with Network Data Types
12
Q: Why are we doing this with PostgreSQL?
I was told that “the cloud” was the only place to work with network data!
21-May-14 PGCon14 Working with Network Data Types
13
A1: Because RDBMSs have some advantages
• Cloud technologies have a difficult time with semantically similar but syntactically different data such as:– Subnets:
192.168.100.24/25 =192.168.100.24 255.255.255.128
– Date / time:21-May-14 = 05/21/14 = 1400638460
• RDBMS are (still) much faster with GUI responses on small / moderate data sets
21-May-14 PGCon14 Working with Network Data Types
14
A2: Because PostgreSQL has some useful and unique capabilities
• PostgreSQL network data types and operators enable:– Much faster Layer3-type operations … because you
can think like a router– Data validation (NOTE: custom solutions are nearly
impossible for IPv6)• PG arrays can simplify SQL syntax and ERDs.• PG regexp_split_to_array simplifies ETL
… but COPY still lacks important functionality
21-May-14 PGCon14 Working with Network Data Types
15
Our workshop approach
• Our approach: What useful and tough problems can I solve with PostrgreSQL’s unique capabilities and low cost?
• ! Our approach: Given that you are using a relational database management system, these are the reasons why PostgreSQL is the best brand to choose.
21-May-14 PGCon14 Working with Network Data Types
16
Workshop logistics• Preamble / instruction• Brief overview of our datasets• Transform and load the data together
-- Intermezzo --• Query the data together• Questions, follow-up & next steps.NOTE: Code snippets are at the end of the PPT.
All files are linked off of PGCon14
21-May-14 PGCon14 Working with Network Data Types
17
Our buffet of data …All free and publicly available …• OurAirports Airport GeoLoc
http://ourairports.com• CSC InfoChimps US IPGeo
http://www.infochimps.com/datasets • Dragon Research Group Malicious IPs
http://www.dragonresearchgroup.org/ • National Vulnerabilities DatabaseIAVAs
http://nvd.nist.gov/home.cfm • GeoLite IPv6 Geo
http://geolite.maxmind.com/21-May-14 PGCon14
Working with Network Data Types
18
… Our buffet of data• MyIP Malicious IPs
http://myip.ms• Cowgar Apache log
http://notebook.cowgar.com• The Internet Traffic Archive tracert
ftp://ita.ee.lbl.gov/
• Cooperative Association for Internet Data Analysis (CAIDA) DNShttp://www.caida.org
• NIL ARP table http://wiki.nil.com/
21-May-14 PGCon14 Working with Network Data Types
19
Dive into Airport data …
../020_Airports/• 4 csv files:– airports, runways, countries and regions
AirportDDL.sql
21-May-14 PGCon14 Working with Network Data Types
20
INTERMEZZO FOR 10 MINUTES
21-May-14 PGCon14 Working with Network Data Types
21
Let’s query our data …
../Queries/• Where else have the Apache_log
requesting_ip addresses appeared?
USIPGeo_Queries.sql
21-May-14 PGCon14 Working with Network Data Types
22
REVISIT NORMALIZATION, QUESTIONS AND WRAP UP
21-May-14 PGCon14 Working with Network Data Types
23
*Normalizing and surrogate keysNOTE: This is a revisited topic from the workshop pre-amble. • Normalization is very useful for handling ACID
transactions and transaction-entered data (vs. bulk-loaded).
• Surrogate keys are loved by Hibernate, may help in some situations and are part or many ORM methodologies.
But that does not generally apply to immutable network data intended for analysis … like we have.
21-May-14 PGCon14 Working with Network Data Types
24
*Normalizing and surrogate keys(2)
• And normalization has a steep cost in …– Increased complexity to load and query data– Hard coded assumptions in a complex ERD– Reduced performance **– Intellectual distance from the data
• The reasons to normalize and use surrogate keys don’t apply here but the costs do, so I don’t do it.(feel free to explore this topic off-line with me)
** This is not true in all cases. I refer to index-based joins and searches
21-May-14 PGCon14 Working with Network Data Types
25
Q: Why did you use arrays and custom data types?
Doesn’t that violate RDBMS principles plus lock you in to custom PostgreSQL syntax?
A: Yes. Here’s why I suggest you consider that:Dr’s. Boyce and Codd did not include arrays in their
RDBMS theoretical work and arrays are not in the ANSI-SQL standard … but that work is focused on ACID transactions, which do not apply here.
PostgreSQL network types are unique … so you are already committed to custom code. Why not take full advantage of PGs capabilities.
21-May-14 PGCon14 Working with Network Data Types
26
Wrap up and follow-on …
• IP and IP/subnet data types enable you to think of network data as a Layer-3 device does– This will always be faster and simpler than
treating IPs as strings or numbers.• Ex. “Which of these IPs is contained in that
large set if IP subnets and ranges?”– 1-line WHERE clause in PostgreSQL– … still working on it in other environments
21-May-14 PGCon14 Working with Network Data Types
27
Wrap up and follow-on …
• An inetrange type is a natural extension that complements subnetsCREATE TYPE inetrange AS RANGE ( SUBTYPE = inet);
• If you are using PG network data types with read-only data for analysis …… consider using custom data types and arrays– You have already committed to custom-code so
you may as well reap the (substantial) benefits.21-May-14 PGCon14
Working with Network Data Types
28
XML and XPath notes
• The idealized ETL process for XML (and JSON) is ETL … don’t Extract or Transform, just Load
• PostgreSQL has good, but not great, XML support• Whatever you do, don’t write custom code to
shred XML to RDBMS then build XML again for delivering to applications. Please.
• Very good XPath resources are:http://zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths http://stackoverflow.com/questions/17799790/using-xpath-to-extract-data-from-an-xml-column-in-postgres for details
21-May-14 PGCon14 Working with Network Data Types
29
PostgreSQL: Working with Network Data
• James Hanson [email protected],http://www.linkedin.com/in/jamesphansonhttp://jamesphanson.wikispaces.com/
• Please provide feedback.This presentation is new and I’d like to make it better.
(… and please let me know if you see / have any cool data sets too!)
21-May-14 PGCon14 Working with Network Data Types
30
QUESTIONS AND COMMENTS …James Hanson / [email protected]
21-May-14 PGCon14 Working with Network Data Types
31
Code snippets and techniques from AirportsDDL.sql
-- CountriesCREATE TABLE countries ( id BIGINT PRIMARY KEY, code CHAR(10) NOT NULL UNIQUE, name VARCHAR(100), continent CHAR(4), wikipedia_link VARCHAR(500), keywords VARCHAR(300));
COMMENT ON TABLE countries IS 'Data from http://ourairports.com/data/';
COPY countriesFROM '/Pgcon14/Scripts/020_Airports/countries.csv' WITH DELIMITER ',' CSV HEADER;
21-May-14 PGCon14 Working with Network Data Types
32
Code snippets and techniques from AirportsDDL.sql
-- Update the GEOMETRY column from the long / latUPDATE airports SET location = ST_GeomFromText('POINT('||longitude_deg||' '|| latitude_deg||')',4326);
21-May-14 PGCon14 Working with Network Data Types
33
Code snippets and techniques from USIPGeoDDL.sql
-- COPY into temporary table for processing-- Create temporary tableCREATE TABLE us_ipgeo_temp ( id SERIAL, col1 TEXT);
-- Load temporary tableCOPY us_ipgeo_temp( col1) FROM '/Pgcon14/Scripts/030_USIPGeo/infochimps_dataset_13358_download_16680/ip_blocks_us_geo.tsv‘WITH DELIMITER '|' -- use a character that is not in the datasetCSV;
21-May-14 PGCon14 Working with Network Data Types
34
Code snippets and techniques from USIPGeoDDL.sql
-- Use a CTE (Common Table Expression) syntax and-- regexp_split_to_array to look at the data in our temporary table.WITH us_ipgeo_array AS (SELECT regexp_split_to_array(col1, '\t') AS col1FROM us_ipgeo_temp-- LIMIT 10)SELECT col1[1] AS col1_1, col1[2] AS col1_2, col1[3] AS col1_3, col1[4] AS col1_4, col1[5] AS col1_5, col1[6] AS col1_6, col1[7] AS col1_7, col1[8] AS col1_8, col1[9] AS col1_9, col1[10] AS col1_10, col1[11] AS col1_11, col1[12] AS col1_12, col1[13] AS col1_13, col1[14] AS col1_14FROM us_ipgeo_array;
21-May-14 PGCon14 Working with Network Data Types
35
Code snippets and techniques from USIPGeoDDL.sql-- Manipulate and CAST the fields as neededWITH us_ipgeo_array AS (SELECT regexp_split_to_array(col1, '\t') AS col1FROM us_ipgeo_tempLIMIT 10000 -- check more records too)SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14FROM us_ipgeo_arrayORDER BY metro_code NULLS LAST;
21-May-14 PGCon14 Working with Network Data Types
36
Code snippets and techniques from USIPGeoDDL.sql-- Create and populate us_ipgeo from us_ipgeo_temp-- Try it once with a LIMIT statementSELECT * INTO us_ipgeoFROM (WITH us_ipgeo_array AS (SELECT regexp_split_to_array(col1, '\t') AS col1FROM us_ipgeo_temp-- LIMIT 10000 -- check more records too)SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14FROM us_ipgeo_array) AS tmp_table;
21-May-14 PGCon14 Working with Network Data Types
37
Code snippets and techniques from ApacheAccessLogs_DDL.sql
-- COPY into temporary table for processing-- Create temporary tableCREATE TABLE apache_log_temp ( id SERIAL, col1 TEXT);
-- Set the client_encoding for the same reason as beforeSET client_encoding TO 'latin1';
-- Load temporary tableCOPY apache_log_temp( col1) FROM '/Pgcon14/Scripts/080_ApacheLogs/access.log' WITH DELIMITER '^' -- use a character that is not in the dataset -- CSV ;
21-May-14 PGCon14 Working with Network Data Types
38
Code snippets and techniques from ApacheAccessLogs_DDL.sql
-- Try it once with a LIMIT statementSELECT *INTO apache_logFROM (WITH apache_log_array AS (SELECT regexp_split_to_array(col1, ' ') AS col1FROM apache_log_temp-- LIMIT 1000)SELECT col1[1]::inet AS requesting_ip, to_timestamp(RIGHT(col1[4], -1) || LEFT(col1[5], -6), 'DD/Mon/YYYY:HH24:MI:SS') AT TIME ZONE INTERVAL '-04:00' AS time, RIGHT(col1[6], -1)::VARCHAR(10) AS action, col1[7]::VARCHAR(100) AS resource, LEFT(col1[8], -1)::VARCHAR(10) AS protocol, NULLIF(col1[9], '-')::int AS status_to_client, NULLIF(col1[10], '-')::int AS object_size_bytes, REPLACE(NULLIF(col1[11], '"-"'), '"', '')::VARCHAR(100) AS referrer, REPLACE(COALESCE(col1[12], '') || COALESCE(col1[13], '') || COALESCE(col1[14], '') || COALESCE(col1[15], '') || COALESCE(col1[16], '') || COALESCE(col1[17], '') || COALESCE(col1[18], '') || COALESCE(col1[19], '') || COALESCE(col1[20], ''), '"', '')::VARCHAR(200) AS user_agentFROM apache_log_array) AS tmp_table;
21-May-14 PGCon14 Working with Network Data Types
39
Code snippets and techniques from tracert_DDL.sql
-- Lets play a bit with an array structure for the tracert table-- DROP TYPE tracert_trip;CREATE TYPE tracert_trip AS ( hop INTEGER, from_ip INET, to_ip INET);
SELECT source_file, dest_name, dest_ip, array_agg((hop, from_ip, to_ip)::tracert_trip ORDER BY hop) tracert_tripFROM tracertGROUP BY source_file, dest_name, dest_ipLIMIT 10000;
21-May-14 PGCon14 Working with Network Data Types
40
Code snippets and techniques from 03_MyIP_500TopHosts_Array_DDL.sql
-- Want to use a range for the IPs-- Create custom typeCREATE TYPE inetrange AS RANGE ( SUBTYPE = inet);
-- Check out our data-- SELECT inclusive, [], range for our inetrangeSELECT website, ww_site_rank, hosting_country, hosting_company, owner_country, parent_hosting_company, parent_country, inetrange(owner_ip_from, owner_ip_to, '[]') AS owner_range, -- [] means upper and lower inclusive parent_ip_rangeFROM tophostsORDER BY owner_range;
21-May-14 PGCon14 Working with Network Data Types