Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009.
-
Upload
kathryn-peay -
Category
Documents
-
view
234 -
download
2
Transcript of Innovations in Database Technology IRMAC BI/DW SIG May 28, 2009.
Innovations in Database Technology
IRMAC BI/DW SIG May 28, 2009
2
Agenda
About Infobright
Data Warehousing Challenge
Use Cases
Infobright Approach
Infobright Architecture
Infobright Versions & System Requirements
3
About Infobright
4
Founded 2006
Headquarters Toronto, Canada; offices in Boston, MA and Warsaw, Poland
The Infobright Data Warehouse
Simplicity: No new schemas, no indices, no data partitioning, easy to maintainScalability: Designed for rapidly growing volumes. Ideal for up to 30 TB Low TCO: Industry-leading compression, less storage, industry standard servers, low software costs, minimal ongoing operational expenses
The Open Source Solution Community (open source) and Enterprise Editions are available
MySQL Integration
Leverages MySQL connectivity to ETL and BI Provides MySQL customers with scalable, enterprise-ready data warehouse MySQL/SUN Microsystems invests in Infobright Sept 15, 2008
About Infobright
5
Data Warehousing Challenge
6
Data Warehousing Challenges
.
Traditional Data Warehousing
Labor intensive, heavy indexing and partitioning
Hardware intensive: massive storage; big servers
Expensive and complex
More Data, More Data Sources
More Kinds of Output Needed by More Users,
More Quickly
Limited Resources and Budget
0101010101010101010101010101
0101010101010101010101010
0101010101010101010101
1
0101010101010101010101
10
1010 1011001
0 110
01
1
0
01
101
010101
1
1
0101
0
1010101
10 0101
10
01
10
01
10
1
0
10101
01 010 01 0101
011
10100101
1
01
0
10
1010 101 10010 1
10
01
1
0
01
101
0
10101
10
0101010101010101010101010
0101010101010101010101010101
1
10110
0
101
1010 10 1101
010
0
0 101 0010
0
Real time data
Multiple databases
External Sources
6
7
New Demands: Larger transaction volumes driven by the internetImpact of Cloud ComputingMore -> Faster -> Cheaper
Data Warehousing – Raising The Bar
Data Warehousing Matures: Near real time updatesIntegration with master data managementData mining using discrete business transactionsProvision of data for business critical applications
Early Data Warehouse Characteristics:Integration of internal systemsMonthly and weekly loadsHeavy use of aggregates
8
Use Cases
9
Use Cases
• Loading millions of transactions with a limited batch window• Summarizing transactional data for trend analysis• Extracting transactional detail based on specific constraints• Ad hoc query support across many dimensional attributes
Infobright is a good fit for;
• Real-time transactional updates (operational data entry)• Full data extracts (select * from …)• Row based operations that need to access all columns of
a table are typically better suited to row based databases
Avoid using Infobright for;
10
Customer Experience – Load Speed
• Custom front end developed using MySQL JDBC driver• Completed design, test, deployment in < 3 months with no assistance from Infobright• Allowed for expansion from 7 to 90 days of online SMS history• Supports plan for 70% annual growth• Rollout to allow for 120 concurrent users
• Mavenir - OEM customer deploying a world wide telco application• Application provides operators with access to detailed SMS traffic• Needed a low cost solution with the ability to load 20K records
per second • Peak of 70M messages per hour during Chinese New year
Business Requirement
Solution
11
Customer Experience – Query Performance
• Sulake - Online Social Networking service with 126M users across 31 countries
• 990M page impressions per month• Need to quickly analyze online spend on a daily basis to enhance
online experience and drive additional revenue• Existing InnoDB solution was able to process business queries in a
reasonable time frame (queries taking hours to complete)• Business opportunities were being lost due to inability to analyze
subscriber behavior using transactions
Business Requirement
• Customer used existing data model and deployed the application using Business Objects – Data Integrator for ETL, Web-Intelligence for BI
• Existing ETL workflows were converted to Infobright in less than 4 weeks without assistance
• Historically long running queries (hours) now running in minutes and seconds• Additional benefits due to compression were a reduced need for disk storage
and an overall reduction in I/O and network traffic
Solution
12
Customer Experience - TCO
• A global provider of electronic trading solutions across 22 time zones and 700 financial exchanges
• Wanted to expand analytical access to financial transactions to include both current (30 days) and archived transactions (4 years)
• Expansion of existing Sybase solution was too costly
Business Requirement
• Infobright was able to achieve performance benchmarks within the first 3 days of a proof of concept using production data
• 28,000 records per second load speed• Join 100M row with a 30Mrow table -> 400k rows, returned in 185 seconds
• Additional queries that did not complete using Sybase, finished in minutes using Infobright
• Final solution deployed using Pentaho Kettle for ETL and Crystal Reports for BI• Success with modest data size (150GB) has opened opportunities for additional more
detailed transactional analysis
Solution
$
13
Customer Experience – Query Performance and TCO
• TradeDoubler – Based in Sweden, a global digital marketing company, serving 1600+ online advertisers across Europe and Asia.
• TradeDoubler optimizes Web marketing campaigns by analyzing Web clicks, impressions and purchases.
• Analyzing terabytes of data about the results of its programs is central to the company’s success.
• Selected Infobright to produce analytical results rapidly, seamless interoperability with their MySQL database and low TCO
Business Requirement
• Deployed solution using a single, $12,500 Dell server with 8 CPU cores and 16 GB RAM • Used Pentaho Kettle for ETL and Jaspersoft Server Pro Reports for BI• Needed to process and analyze data 20 billion online transactions/month• In POC, loaded > 3.2 billion rows at > 300,000 rows / second• In production, achieved 30x data compression• Extremely fast query speed. 3 queries that previously did not return, now returned
within a minute
Solution
14
Infobright Approach
15
Introducing Infobright
Smarter architecture Load data and go No indices or partitions
to build and maintain Knowledge Grid
automatically updated as data packs are created or updated
Super-compact data foot- print can leverage off-the-shelf hardware
Data Packs – data stored in manageably sized, highly compressed data packs
Data compressed using algorithms tailored to data type
Knowledge Grid – statistics and metadata “describing” the super-compressed data
Column Orientation
15
16
Column vs. Row-Oriented
EMP_ID FNAME LNAME SALARY 1 Moe Howard 100002 Curly Joe 120003 Larry Fine 9000
Row Oriented (1,Moe,Howard,10000; 2,Curly, Joe,12000; 3,Larry,Fine,9000;)
Works well if all the columns are needed for every query.
Efficient for transactional processing if all the data for the row is available
Works well with aggregate results (sum, count, avg. )
Only columns that are relevant need to be touched Consistent performance with any database design Allows for very efficient compression
Column Oriented (1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;)
17
Data Packs and Compression
64K
64K
64K
64K
Data Packs Each data pack contains 65, 536 data values Compression is applied to each individual data pack The compression algorithm varies depending on data type
and data distribution
Compression Results vary depending on the
distribution of data among data packs A typical overall compression ratio
seen in the field is 10:1 Some customers have seen results
have been as high as 40:1Patent PendingCompressionAlgorithms
18
Knowledge Grid
This metadata layer = 1% of the compressed volume
Data Pack Nodes (DPN)A separate DPN is created for every data pack created in the database to store basic statistical information
Character Maps (CMAPs)Every Data Pack that contains text creates a matrix that records the occurrence of every possible ASCII character
HistogramsHistograms are created for every Data Pack that contains numeric data and creates 1024 MIN-MAX intervals.
Pack-to-Pack Nodes (PPN)PPNs track relationships between Data Packs when tables are joined. Query performance gets better as the database is used.
19
A Simple Query using the Knowledge Grid
SELECT count(*) FROM employees WHERE salary > 50000 AND age < 65 AND job = ‘Shipping’ AND city = ‘TORONTO’;
salary age job city
Rows 1 to 65,536
65,537 to 131,072
131,073 to ……
2. Find the Data Packs that contain age < 65
3. Find the Data Packs that have job = ‘Shipping’
4. Find the Data Packs that have City = “Toronto’
All packs ignored
All packs ignored
All packs ignored5. Now we eliminate all rows that have
been flagged as irrelevant.
Only this pack will be
decompressed
6. Finally we have identified the data pack that needs to be decompressed
1. Find the Data Packs with salary > 50000
Completely Irrelevant
Suspect
All values match
20
A Join Query using the Knowledge Grid
SELECT MIN(sale), MAX(discount), nameFROM carsales, salesperson WHERE carsales.id = salesperson.id AND carsales.prov = ‘ON’ AND carsales.date = ‘2008-02-29’GROUP BY name;
Car Sales
id sale discount prov date
Sales Person
id name
1. Eliminate the Car Sales Data Packs that are irrelevant based on constraints in the SQL
2. Determine the related Sales Person Data Packs based on the values of carsales_id found in the relevant Car Sales Data Packs.
4. Any subsequent queries will be able to use the PPN to resolve joins between Car Sales and Sales Person
3. Create a Pack-to-Pack node that stores the results of the join condition between Car Sales and Sales Person.
Pack-to-Packcarsales_id vs salesperson_id
carsales.id
salesperson.id
0 1 0
1 1 0
Indicates that the Data Packs are related
21
Infobright Architecture
22
Infobright Optimizerand Executor
Infobright Optimizerand Executor
MySQL/Infobright Architecture
CONNECTORS: Native C API, JDBC, ODBC, .NET, PHP, Python, Perl, Ruby, VB CONNECTORS: Native C API, JDBC, ODBC, .NET, PHP, Python, Perl, Ruby, VB
Management Services &
Utilities
Management Services &
Utilities
InfobrightLoader / Unloader
InfobrightLoader / Unloader
CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches
CONNECTION POOL: Authentication, Thread Reuse, Connection Limits, Check Memory, Caches
SQL
Inte
rfac
eSQ
L In
terf
ace
MyS
QL
Load
erM
ySQ
L Lo
ader
Pars
erPa
rser
Cach
es &
Buff
ers
Cach
es &
Buff
ers
MyISAM•Views•Users•Permissions•Tables Defs
MyISAM•Views•Users•Permissions•Tables Defs
Knowledge GridKnowledge Grid
Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Compressor / DecompressorCompressor / Decompressor
Infobright – Embedded With MySQL
Infobright Components•IB Storage Engine consisting of 64Kb Data Packs, Compressor, and the Knowledge Grid
Knowledge GridKnowledge Grid
Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack Data PackData Pack
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Knowledge Node
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Data PackNode
Compressor / DecompressorCompressor / Decompressor
Infobright Optimizerand Executor
Infobright Optimizerand Executor
• IB Optimizer that uses rough set algorithms and the knowledge grid to navigate the database
InfobrightLoader / Unloader
InfobrightLoader / Unloader
• IB Loader supports text based and binary data formats
My SQL OptimizerMy SQL Optimizer
Infobright ships with the full MySQL binaries. The MySQL architecture is used to support database components such as connectors, security and memory management.
23
Optimized SQL for Infobright
The Infobright Optimizer supports a large amount of MySQL syntax and functions. When the optimizer encounters SQL syntax that is not supported, then the query is executed using the MySQL optimizer.
MySQL
Infobright Optimized SQL•Select Statements•Comparison Operators•Logical Operators•String Comparison Functions (LIKE, ..)
•Aggregate Functions•Arithmetic Operators•Data Manipulation Language (I/U/D)
• Data Definition Language (CREATE & DROP)
• String Functions• Date/Time Functions• Numeric Functions• Trigonometric Functions• Case Statements
24
Infobright Data TypesN
umer
icN
umer
icD
ate
Dat
eSt
ring
Strin
g
Most of the data types expected for a MySQL database engine are fully supported. The data types that are currently not implemented within Infobright include BLOB, ENUM, SET and Auto Increment.
25
Increased efficiency with popular platforms
Deeper ETL Integration Jaspersoft, Talend, Pentaho Leverages end-to-end data
management provided by ETL tools Improved support for Data
Manipulation Language (DML)
Leverage existing IT tools and resources for fast, simple deployments and low TCO
ETL Integration
26
Data Loading with & without custom ETL connectors
Loading Infobright tables with custom connectors: Kettle from Pentaho Talend ETL from Talend Jaspersoft ETL (Talend) from Jaspersoft
Two ways to invoke Infobright loader without connectors1.Generate a CSV or binary file and invoke the Infobright loader to load the file2.Named pipe technique:
Create a named pipe (i.e. mkfifo /home/mysql/s_mysession1.pipe) Launch the Infobright loader in the background to read from the pipe Launch the ETL process that writes data to the named pipe When the ETL process runs, as records are written to the named pipe, the
loader reads them and writes them to an Infobright database table
27
Infobright Versions & System Requirements
28
Comparison of ICE and IEE
Features
Technical SupportForums and/or
one-time 4-hr support packAvailable
Warranty and Indemnification
No Included
INSERT/UPDATE/DELETE No Supported
Infobright Loader Up to 50 GB/hrMulti-threaded, Up to 300
GB/hr
Data Load Types
Text onlyText & Binary(100% faster)
MySQL Loader No Supported
Platform Support
64-bit Intel and AMDRHEL 5, CentOS 5, Debian
32-bit Intel and AMD for Windows XP, Ubuntu 8.04,
Fedora 9
64-bit Intel and AMDWindows Server 2003,
Windows Server 2008, RHEL 5, CentOS 5, Debian, Solaris
10
29
System Requirements
30
For More Information
Thank you
Data Warehouse EvangelistBob Newell
Or join our open source community atwww.infobright.org
31
Query performanceInfobright Traditional DB
# Query Query name Intervall No cache Cache No cache Cache
1 Affiliate/minor/sum(order)/year 20060101-20061231 7,72 0,99 13,00,91 4,03,21
2 Affiliate/major/sum(order)/year 20060101-20061231 31,52 7,81 N/A N/A
1 Affiliate/minor/sum(order)/month 20060101-20060131 1,32 0,43 1,00,43 10,69
2 Affiliate/major/sum(order)/month 20060101-20060131 3,23 0,65 2,12,34 18,55
3 Events/Cat=2/Country/sum(no of)/year 20060101-20061231 37,16 24,42 N/A N/A
4 Events/Cat=*/Country/sum(no of)/year 20060101-20061231 41,67 29,62 N/A N/A
3 Events/Cat=2/Country/sum(no of)/month 20060101-20060131 15,16 7,15 8,08,13 2,10,15
4 Events/Cat=*/Country/sum(no of)/month 20060101-20060131 22,12 8,01 15,08,32 3,12,82
Time in minutes, seconds, milliseconds
31