3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL...
-
Upload
russell-floyd -
Category
Documents
-
view
233 -
download
0
Transcript of 3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL...
3 Objects (ViewsSynonymsSequences)
4PL/SQL blocks
5ProceduresTriggers
6Enhanced SQL programming
7SQL & .NET applications
8OEMDB structure
9DB security
10Backup Recovery
14Data Mining15Data Warehousing
1Course Introduction
2Oracle Introduction
Advanced SQL
New Trend
s
11Large object
12Transaction Management
Advanced DB
Concepts
Chapter Structure
DB Admin.
Data Mining – Business Data Mining – Business IntelligenceIntelligence
Data explosion problemData explosion problem We are drowning in data, but starving We are drowning in data, but starving
for knowledge!for knowledge! Finding interesting structure in data Finding interesting structure in data
((data-driven decision making practices, data-driven decision making practices, BBC Horizon - Age of Big Data ) )
Structure: Structure: refers to statistical patterns, refers to statistical patterns, predictive models, hidden relationshipspredictive models, hidden relationships
To provide knowledge that will give a To provide knowledge that will give a company a competitive advantage, company a competitive advantage, enabling it to earn a greater profitenabling it to earn a greater profit
Purpose of Data MiningPurpose of Data Mining
Goals of data miningGoals of data mining Predict the future behavior of attributes Predict the future behavior of attributes Classify items, placing them in the proper Classify items, placing them in the proper
categories categories Identify the existence of an activity or an event Identify the existence of an activity or an event Optimize the use of the organization’s Optimize the use of the organization’s
resources resources
Applications of Data Applications of Data MiningMining
RetailingRetailing Customer relations management (CRM) Customer relations management (CRM) Advertising campaign management
Banking and Finance Banking and Finance Credit scoringCredit scoring Fraud detection and prevention Fraud detection and prevention
Manufacturing Manufacturing Optimizing use of resources Optimizing use of resources Manufacturing process optimizationManufacturing process optimization Product design Product design
MedicineMedicine Determining effectiveness of treatmentsDetermining effectiveness of treatments Analyzing effects of drugsAnalyzing effects of drugs Finding relationships between patient care and outcomesFinding relationships between patient care and outcomes
Higher Education (Higher Education (Academic analytics)Academic analytics) which students will enroll in particular course programswhich students will enroll in particular course programs which students will need assistance in order to graduatewhich students will need assistance in order to graduate
Commercial Support and Commercial Support and Job MarketJob Market
Many Data Mining ToolsMany Data Mining Tools http://www.kdnuggets.com/
Database systems with data mining Database systems with data mining supportsupport Oracle 10g, 11gOracle 10g, 11g SQL Server 2005, 2008SQL Server 2005, 2008
Hot topicHot topic http://groups.yahoo.com/group/dataminin
g2/ 2677 members by April. 14, 20092677 members by April. 14, 2009
BI MarketBI Market Worldwide BI software revenue is forecast to reach Worldwide BI software revenue is forecast to reach
almost US$12.5 billion in 2012, up 7.2 percent over last almost US$12.5 billion in 2012, up 7.2 percent over last year.year.
The global BI software and services market will rapidly expand from $79 billion in 2012, to $143 billion in 2016
Company 2009 Sales Market Share
SAP 2,084.1 22.4
Oracle 1,351.1 14.5
SAS Institute 1,324.6 14.2
IBM 1,135.6 12.2
Microsoft 739.1 7.9
Data Mining and Business Data Mining and Business
IntelligenceIntelligence Increasing potentialto supportbusiness decisions
Data SourcesPaper, Files, Database systems, OLTP, WWW
Data Warehouses/Data MartsOLAP, MDA
Data ExplorationStatistical Analysis, Reporting
Data MiningInformation Discovery
Data PresentationVisualization
MakingDecisions
End User
DBA
BusinessAnalyst
DataAnalyst
Data Mining MethodsData Mining Methods ((6 basic 6 basic classes)classes)
AssociationsAssociations Finding rules like “if the customer buys frozen Finding rules like “if the customer buys frozen
Pizza, sausage, and beer, then the probability Pizza, sausage, and beer, then the probability that he/she buys potato chips is 50%”that he/she buys potato chips is 50%”
ClassificationsClassifications Classify data based on the values of the Classify data based on the values of the
decision attribute, e.g. classify patients based decision attribute, e.g. classify patients based on their “state”on their “state”
ClusteringClustering Group data to form new classes, cluster Group data to form new classes, cluster
customers based on their behavior to find customers based on their behavior to find common patternscommon patterns
Data Mining MethodsData Mining Methods
Sequential patternsSequential patterns Finding rules like “if the customer buys TV, Finding rules like “if the customer buys TV,
then, few days later, he/she buys camera, then, few days later, he/she buys camera, then the probability that he/she will buy then the probability that he/she will buy within 1 month video is 50%”within 1 month video is 50%”
Time-Series similaritiesTime-Series similarities Finding similar sequences (or subsequences) Finding similar sequences (or subsequences)
in time-series (e.g. stock analysis)in time-series (e.g. stock analysis) Deviation detectionDeviation detection
Finding anomalies/exceptions/deviations in Finding anomalies/exceptions/deviations in datadata
Association and Association and Classification RulesClassification Rules
Association rules Association rules have form {x} have form {x} {y}, where x {y}, where x and y are events that occur at the same time. and y are events that occur at the same time. Have measures ofHave measures of support support and and confidenceconfidence. .
Support is the percentage of transactions that contain all Support is the percentage of transactions that contain all items included in both left and right sidesitems included in both left and right sides
Confidence is how often the rule proves to be true; where Confidence is how often the rule proves to be true; where the left hand side of the implication is present, percentage the left hand side of the implication is present, percentage of those in which the right side is present as well of those in which the right side is present as well
Classification rules, Classification rules, placing instances into the placing instances into the correct one of several possible categoriescorrect one of several possible categories Developed using a Developed using a training set, training set, past instances for past instances for
which the correct classification is known which the correct classification is known System develops a method for correctly classifying a System develops a method for correctly classifying a
new item whose class is currently unknownnew item whose class is currently unknown
Sequential PatternsSequential Patterns
Sequential patternsSequential patterns e.g. prediction that a e.g. prediction that a customer who buys a particular product in customer who buys a particular product in one transaction will purchase a related one transaction will purchase a related product in a later transactionproduct in a later transaction Can involve a set of productsCan involve a set of products Patterns are represented as sequences {S1}, Patterns are represented as sequences {S1},
{S2}{S2} First subsequence {S1} is a First subsequence {S1} is a predictorpredictor of the of the
second subsequence {S2}second subsequence {S2} SupportSupport is the percentage of times such a is the percentage of times such a
sequence occurs in the set of transactionssequence occurs in the set of transactions ConfidenceConfidence is the probability that when {S1} is the probability that when {S1}
occurs, {S2} will occur on a subsequent occurs, {S2} will occur on a subsequent transaction - can calculate from observed datatransaction - can calculate from observed data
Time Series PatternsTime Series Patterns A A time seriestime series is a sequence of events that is a sequence of events that
are all of the same typeare all of the same type Sales figures, stock prices, interest rates, Sales figures, stock prices, interest rates,
inflation rates, and many other quantities inflation rates, and many other quantities can be analyzed using time seriescan be analyzed using time series
Time series data can be studied to Time series data can be studied to discover patterns and sequencesdiscover patterns and sequences
For example, we can look at the data to For example, we can look at the data to find the longest period when the figures find the longest period when the figures continued to rise each month, or find the continued to rise each month, or find the steepest decline from one month to the steepest decline from one month to the nextnext
Data Mining Methods: Data Mining Methods: RegressionRegression
A statistical method for predicting the value of an A statistical method for predicting the value of an attribute, Y, (the dependent variable), given the attribute, Y, (the dependent variable), given the values of attributes X1, X2, …, Xn (the independent values of attributes X1, X2, …, Xn (the independent variables) variables)
Statistical packages allow users to identify Statistical packages allow users to identify potential factors for predicting the value of the potential factors for predicting the value of the dependent variabledependent variable
Using Using linear regressionlinear regression, the package finds the , the package finds the contribution or weight of each independent contribution or weight of each independent variable, as coefficients, a0, a1, …, an for a linear variable, as coefficients, a0, a1, …, an for a linear function function Y= a0 + a1 X1 + a2 X2 + … + Y= a0 + a1 X1 + a2 X2 + … + anXnanXn
Can also use Can also use non-linear regressionnon-linear regression, using , using curve-curve-fittingfitting, finding the equation of the curve that fits , finding the equation of the curve that fits the observed valuesthe observed values
Neural NetworksNeural Networks Methods from AI using a set of samples to find Methods from AI using a set of samples to find
the strongest relationships between variables and the strongest relationships between variables and observationsobservations
Use a learning method, adapting as they learn Use a learning method, adapting as they learn new information new information
Hidden layers developed by the system as it Hidden layers developed by the system as it examines cases, using generalized regression examines cases, using generalized regression techniquetechnique
System refines its hidden layers until it has System refines its hidden layers until it has learned to predict correctly a certain percentage learned to predict correctly a certain percentage of the time; then test cases are provided to of the time; then test cases are provided to evaluate itevaluate it
Problems: Problems: overfittingoverfitting the curve - prediction function fits the training set the curve - prediction function fits the training set
values too perfectly, even ones that are incorrect (data noise)values too perfectly, even ones that are incorrect (data noise) Knowledge of how the system makes its predictions is in the Knowledge of how the system makes its predictions is in the
hidden layershidden layers Output may be difficult to understand and interpretOutput may be difficult to understand and interpret
ClusteringClustering
Methods used to place cases into clusters or Methods used to place cases into clusters or groups that can be disjoint or overlappinggroups that can be disjoint or overlapping
Using a training set, system identifies a set Using a training set, system identifies a set of clusters into which the tuples of the of clusters into which the tuples of the database can be groupeddatabase can be grouped
Tuples in each cluster are similar, and they Tuples in each cluster are similar, and they are dissimilar to tuples in other clustersare dissimilar to tuples in other clusters
Similarity is measured by using a Similarity is measured by using a distance distance functionfunction defined for the data defined for the data
Data Mining ProcessData Mining Process
Data preprocessingData preprocessing Data selection: Identify target datasets and Data selection: Identify target datasets and
relevant fieldsrelevant fields Data cleaningData cleaning
Remove noise and outliersRemove noise and outliers Data transformationData transformation Create common unitsCreate common units Generate new fieldsGenerate new fields
Data mining model constructionData mining model construction Model evaluationModel evaluation