3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL...

3 Objects (ViewsSynonymsSequences)

4PL/SQL blocks

5ProceduresTriggers

6Enhanced SQL programming

7SQL & .NET applications

8OEMDB structure

9DB security

10Backup Recovery

14Data Mining15Data Warehousing

1Course Introduction

2Oracle Introduction

Advanced SQL

New Trend

s

11Large object

12Transaction Management

Advanced DB

Concepts

Chapter Structure

DB Admin.

Data Mining – Business Data Mining – Business IntelligenceIntelligence

Data explosion problemData explosion problem We are drowning in data, but starving We are drowning in data, but starving

for knowledge!for knowledge! Finding interesting structure in data Finding interesting structure in data

((data-driven decision making practices, data-driven decision making practices, BBC Horizon - Age of Big Data ) )

Structure: Structure: refers to statistical patterns, refers to statistical patterns, predictive models, hidden relationshipspredictive models, hidden relationships

To provide knowledge that will give a To provide knowledge that will give a company a competitive advantage, company a competitive advantage, enabling it to earn a greater profitenabling it to earn a greater profit

Purpose of Data MiningPurpose of Data Mining

Goals of data miningGoals of data mining Predict the future behavior of attributes Predict the future behavior of attributes Classify items, placing them in the proper Classify items, placing them in the proper

categories categories Identify the existence of an activity or an event Identify the existence of an activity or an event Optimize the use of the organization’s Optimize the use of the organization’s

resources resources

Applications of Data Applications of Data MiningMining

RetailingRetailing Customer relations management (CRM) Customer relations management (CRM) Advertising campaign management

Banking and Finance Banking and Finance Credit scoringCredit scoring Fraud detection and prevention Fraud detection and prevention

Manufacturing Manufacturing Optimizing use of resources Optimizing use of resources Manufacturing process optimizationManufacturing process optimization Product design Product design

MedicineMedicine Determining effectiveness of treatmentsDetermining effectiveness of treatments Analyzing effects of drugsAnalyzing effects of drugs Finding relationships between patient care and outcomesFinding relationships between patient care and outcomes

Higher Education (Higher Education (Academic analytics)Academic analytics) which students will enroll in particular course programswhich students will enroll in particular course programs which students will need assistance in order to graduatewhich students will need assistance in order to graduate

Commercial Support and Commercial Support and Job MarketJob Market

Many Data Mining ToolsMany Data Mining Tools http://www.kdnuggets.com/

Database systems with data mining Database systems with data mining supportsupport Oracle 10g, 11gOracle 10g, 11g SQL Server 2005, 2008SQL Server 2005, 2008

Hot topicHot topic http://groups.yahoo.com/group/dataminin

g2/ 2677 members by April. 14, 20092677 members by April. 14, 2009

BI MarketBI Market Worldwide BI software revenue is forecast to reach Worldwide BI software revenue is forecast to reach

almost US$12.5 billion in 2012, up 7.2 percent over last almost US$12.5 billion in 2012, up 7.2 percent over last year.year.

The global BI software and services market will rapidly expand from $79 billion in 2012, to $143 billion in 2016

Company 2009 Sales Market Share

SAP 2,084.1 22.4

Oracle 1,351.1 14.5

SAS Institute 1,324.6 14.2

IBM 1,135.6 12.2

Microsoft 739.1 7.9

Data Mining and Business Data Mining and Business

IntelligenceIntelligence Increasing potentialto supportbusiness decisions

Data SourcesPaper, Files, Database systems, OLTP, WWW

Data Warehouses/Data MartsOLAP, MDA

Data ExplorationStatistical Analysis, Reporting

Data MiningInformation Discovery

Data PresentationVisualization

MakingDecisions

End User

DBA

BusinessAnalyst

DataAnalyst

Data Mining MethodsData Mining Methods ((6 basic 6 basic classes)classes)

AssociationsAssociations Finding rules like “if the customer buys frozen Finding rules like “if the customer buys frozen

Pizza, sausage, and beer, then the probability Pizza, sausage, and beer, then the probability that he/she buys potato chips is 50%”that he/she buys potato chips is 50%”

ClassificationsClassifications Classify data based on the values of the Classify data based on the values of the

decision attribute, e.g. classify patients based decision attribute, e.g. classify patients based on their “state”on their “state”

ClusteringClustering Group data to form new classes, cluster Group data to form new classes, cluster

customers based on their behavior to find customers based on their behavior to find common patternscommon patterns

Data Mining MethodsData Mining Methods

Sequential patternsSequential patterns Finding rules like “if the customer buys TV, Finding rules like “if the customer buys TV,

then, few days later, he/she buys camera, then, few days later, he/she buys camera, then the probability that he/she will buy then the probability that he/she will buy within 1 month video is 50%”within 1 month video is 50%”

Time-Series similaritiesTime-Series similarities Finding similar sequences (or subsequences) Finding similar sequences (or subsequences)

in time-series (e.g. stock analysis)in time-series (e.g. stock analysis) Deviation detectionDeviation detection

Finding anomalies/exceptions/deviations in Finding anomalies/exceptions/deviations in datadata

Association and Association and Classification RulesClassification Rules

Association rules Association rules have form {x} have form {x} {y}, where x {y}, where x and y are events that occur at the same time. and y are events that occur at the same time. Have measures ofHave measures of support support and and confidenceconfidence. .

Support is the percentage of transactions that contain all Support is the percentage of transactions that contain all items included in both left and right sidesitems included in both left and right sides

Confidence is how often the rule proves to be true; where Confidence is how often the rule proves to be true; where the left hand side of the implication is present, percentage the left hand side of the implication is present, percentage of those in which the right side is present as well of those in which the right side is present as well

Classification rules, Classification rules, placing instances into the placing instances into the correct one of several possible categoriescorrect one of several possible categories Developed using a Developed using a training set, training set, past instances for past instances for

which the correct classification is known which the correct classification is known System develops a method for correctly classifying a System develops a method for correctly classifying a

new item whose class is currently unknownnew item whose class is currently unknown

Sequential PatternsSequential Patterns

Sequential patternsSequential patterns e.g. prediction that a e.g. prediction that a customer who buys a particular product in customer who buys a particular product in one transaction will purchase a related one transaction will purchase a related product in a later transactionproduct in a later transaction Can involve a set of productsCan involve a set of products Patterns are represented as sequences {S1}, Patterns are represented as sequences {S1},

{S2}{S2} First subsequence {S1} is a First subsequence {S1} is a predictorpredictor of the of the

second subsequence {S2}second subsequence {S2} SupportSupport is the percentage of times such a is the percentage of times such a

sequence occurs in the set of transactionssequence occurs in the set of transactions ConfidenceConfidence is the probability that when {S1} is the probability that when {S1}

occurs, {S2} will occur on a subsequent occurs, {S2} will occur on a subsequent transaction - can calculate from observed datatransaction - can calculate from observed data

Time Series PatternsTime Series Patterns A A time seriestime series is a sequence of events that is a sequence of events that

are all of the same typeare all of the same type Sales figures, stock prices, interest rates, Sales figures, stock prices, interest rates,

inflation rates, and many other quantities inflation rates, and many other quantities can be analyzed using time seriescan be analyzed using time series

Time series data can be studied to Time series data can be studied to discover patterns and sequencesdiscover patterns and sequences

For example, we can look at the data to For example, we can look at the data to find the longest period when the figures find the longest period when the figures continued to rise each month, or find the continued to rise each month, or find the steepest decline from one month to the steepest decline from one month to the nextnext

Data Mining Methods: Data Mining Methods: RegressionRegression

A statistical method for predicting the value of an A statistical method for predicting the value of an attribute, Y, (the dependent variable), given the attribute, Y, (the dependent variable), given the values of attributes X1, X2, …, Xn (the independent values of attributes X1, X2, …, Xn (the independent variables) variables)

Statistical packages allow users to identify Statistical packages allow users to identify potential factors for predicting the value of the potential factors for predicting the value of the dependent variabledependent variable

Using Using linear regressionlinear regression, the package finds the , the package finds the contribution or weight of each independent contribution or weight of each independent variable, as coefficients, a0, a1, …, an for a linear variable, as coefficients, a0, a1, …, an for a linear function function Y= a0 + a1 X1 + a2 X2 + … + Y= a0 + a1 X1 + a2 X2 + … + anXnanXn

Can also use Can also use non-linear regressionnon-linear regression, using , using curve-curve-fittingfitting, finding the equation of the curve that fits , finding the equation of the curve that fits the observed valuesthe observed values

Neural NetworksNeural Networks Methods from AI using a set of samples to find Methods from AI using a set of samples to find

the strongest relationships between variables and the strongest relationships between variables and observationsobservations

Use a learning method, adapting as they learn Use a learning method, adapting as they learn new information new information

Hidden layers developed by the system as it Hidden layers developed by the system as it examines cases, using generalized regression examines cases, using generalized regression techniquetechnique

System refines its hidden layers until it has System refines its hidden layers until it has learned to predict correctly a certain percentage learned to predict correctly a certain percentage of the time; then test cases are provided to of the time; then test cases are provided to evaluate itevaluate it

Problems: Problems: overfittingoverfitting the curve - prediction function fits the training set the curve - prediction function fits the training set

values too perfectly, even ones that are incorrect (data noise)values too perfectly, even ones that are incorrect (data noise) Knowledge of how the system makes its predictions is in the Knowledge of how the system makes its predictions is in the

hidden layershidden layers Output may be difficult to understand and interpretOutput may be difficult to understand and interpret

ClusteringClustering

Methods used to place cases into clusters or Methods used to place cases into clusters or groups that can be disjoint or overlappinggroups that can be disjoint or overlapping

Using a training set, system identifies a set Using a training set, system identifies a set of clusters into which the tuples of the of clusters into which the tuples of the database can be groupeddatabase can be grouped

Tuples in each cluster are similar, and they Tuples in each cluster are similar, and they are dissimilar to tuples in other clustersare dissimilar to tuples in other clusters

Similarity is measured by using a Similarity is measured by using a distance distance functionfunction defined for the data defined for the data

Data Mining ProcessData Mining Process

Data preprocessingData preprocessing Data selection: Identify target datasets and Data selection: Identify target datasets and

relevant fieldsrelevant fields Data cleaningData cleaning

Remove noise and outliersRemove noise and outliers Data transformationData transformation Create common unitsCreate common units Generate new fieldsGenerate new fields

Data mining model constructionData mining model construction Model evaluationModel evaluation

3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL...

Documents

Transcript of 3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL...