Plaster Different forms and techniques. Yeseria Carved Plaster Spanish Moors.
Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014.
-
Upload
amie-mccoy -
Category
Documents
-
view
221 -
download
0
Transcript of Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014.
Data Mining (and Machine Learning)With Microsoft Tools
Michael Lisin, Plaster GroupMay 8, 2014
Why Reinvent a Toilet?
Page 2
Definitions
Page 3
Concept Definition / Solution ForData Mining
Algorithms to discover unknown data patterns
Machine Learning
Algorithms to predict based on data patterns
Statistics Branch of mathematics, methods of data collection and interpretation
Data Science
All of the above + Data Visualization
What Do You Think?
Page 4
Is Linear Regression?Data MiningMachine LearningStatisticsAll of the above
Linear Regression is a straight line describing how variable Y responds to changes in variable X
MS DM Environment• SQL Server 2000 - 2014
• Excel Data Mining Add-Ins (optional, recommended)• Interact with: Excel (add-ins), SQL Management
Studio, SQL Server Data Tools (SSDT), Custom Code
Page 5
SQL EditionComponent: Capability Enterprise BI StandardSSIS: Text Mining SSAS: DM basic SSAS: DM advanced (CV, prediction queries, …)
SSDTCustom Code
Start With a Question
Page 6
7
Many Potential Questions
MS DM Capabilities
How do we combine our products to increase profits?How do we predict the demand for a product / service?Why are customers buying from us?Where can we best cut costs?What are the opportunities to reduce risks?Who are our best customers?…
Generic question: What are the data patterns?
Best if more specific and directed at a problem, for example:
Approach• Define problem / questions• Prepare data • Build model• Validate model• Implement predictions• Automate model refresh• Extend / custom applications
Page 8
Mor
e Te
chni
cal
SQL DM Algorithms SummaryDiscrete
Continuous
Sequence
Common Group
Similar Group
TXT Semantic
Decision Trees [Classify, Estimate] Linear Regression [Advanced] Time Series [Forecast (T), Forecast] Clustering [Detect Categories(T), Except, Cluster]
Sequence Clustering [Advanced] Neural Network [Advanced] Logistic Regression [Fill From Sample (T), Scenario Analysis(T), Prediction Calculator (T)]
Association Rules [Shopping Basket (T), Associate] Naïve Bayers [Analyze Key Influencers(T)] Text Mining (matching, grouping, extracting)
Page 9
Predict Using Models
SELECTModel.[Bike Buyer],PredictProbability(
Model.[Bike Buyer]),NewData.Email
FROM [Model]NATURAL PREDICTION JOIN
(SELECT Age,[Commute Distance],Email
FROM …) As NewData Page 10
DMX = Data Mining Extensions to query models for predictions
Yes 0.1423 rob4@...
No 0.5698 elizabeth5@...
Yes 0.9045 eugene10@...
…
Output:DMX Query:
Demo
Page 11
Page 12
Appendix
Page 13
SQL Server Data Mining Algorithms
Page 14
Decision Tree Linear Regression Clustering
Sequence Clustering Association
Naive Bayes Neural Network
Time Series
Text Mining
• Fuzzy Grouping• Term Extraction• Term Lookup
Key SQL Server Algorithms - 1
Page 15
Decision Tree - makes predictions based on the relationships between input columns in a dataset. The decision tree makes predictions based on this tendency toward a particular outcome. Example: predict which customers are likely to be satisfied with a company, based on some input variables (# purchases, avg. transaction size).
Linear Regression - is a variation of the Decision Trees calculates a linear relationship between a dependent and independent variable, and then use that relationship for prediction. The algorithm is most applicable to predict continuous attribute. Example: product demand, price, site visitors.
Clustering is a segmentation algorithm that uses iterative techniques to group cases in a dataset into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, customer segmentation.
Key SQL Server Algorithms - 2
Page 16
Sequence Clustering – is similar to Clustering algorithm; however, instead of finding clusters of cases that contain similar attributes, this algorithm finds clusters of cases that contain similar paths in a sequence. It is used to explore data that contains events that can be linked by following paths, or sequences. For example: the click paths that are created when users navigate a Web site; the order in which a user follows a process. Association is useful to recommends products to customers (recommendation engine) based on items they have already bought, or in which they have indicated an interest. Example: market basket analysis.
Naive Bayes is a classification algorithm, it uses Bayes theorem but does not take into account dependencies that may exist, thus its assumptions are said to be naive. Can be used to do initial explorations of data where later you can apply the results to create additional mining models with other more computationally intense and more accurate algorithms. Example: send mailers only to those customers who are likely to respond.
Key SQL Server Algorithms - 3
Page 17
Neural Network algorithm combines each possible state of the input attribute with each possible state of the predictable attribute, and uses the data to calculate probabilities. useful for analyzing complex input data, such as from a manufacturing or commercial process, or business problems for which a significant quantity of data is available but for which rules cannot be easily derived by using other algorithms.Time Series algorithm provides regression algorithms that are optimized for the forecasting of continuous values, such as product sales, over time. Whereas other Microsoft algorithms, such as decision trees, require additional columns of new information as input to predict a trend, a time series model does not.
Text Mining algorithm analyzes unstructured text data. This allows companies to analyze unstructured data such as a "comments" section on a customer satisfaction survey. This algorithm is available in SQL Server Integration Services.
TEXT
SQL Text Mining
Page 18
Term Extraction Transformation
Creates (extracts) a list of terms discovered in the sourceWrites the terms (+score) to a transformation output columnLimitations:• English only• Nouns or noun phrases only
Term Lookup Transformation
Matches terms extracted from text in an input with terms in a reference table. Counts the number of times a term in the lookup table occurs in the input data set, writes the count together with the term from the reference table to columns in the transformation output.
Fuzzy Grouping Transformation
Select canonical row, identify fuzzy (to exact) text fragment match. Output: UID, Group ID, Similarity Score 0..1
Supplemental• Sampling (training and test sets, uniform representation):
• Row (Quantity) Sampling Transformation • Percentage Sampling Transformation
• Sort Transformation
Interesting Links• Sources of free data for research
– https://opendata.socrata.com– http://datamarket.azure.com– http://aws.amazon.com/datasets– http://www.google.com/publicdata/directory
• Algorithms– http://msdn.microsoft.com/en-us/library/ms174879.aspx – http://research.microsoft.com/apps/pubs/default.aspx?id=69669 – http://academic.research.microsoft.com/Paper/4499824– http://academic.research.microsoft.com/Paper/226089.aspx – http://www.sematopia.com/2006/04/k-means-and-em-clustering-algorithms/– http://en.wikipedia.org/wiki/Expectation-maximization_algorithm– http://axon.cs.byu.edu/Dan/678/papers/Cluster/Xu.pdf – http://www.epa.gov/bioiweb1/statprimer/tableall.html#multivclustr – http://research.microsoft.com/pubs/69669/tr-98-35.pdf– http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html – http://en.wikipedia.org/wiki/Expectation-maximization_algorithm – http://msdn.microsoft.com/en-us/library/dd299424(v=SQL.100).aspx – http://msdn.microsoft.com/en-us/library/cc280445.aspx – http://www.sqlserverdatamining.com/ssdm/Home/DataMiningAddinsLaunch/tabid/69/Defa
ult.aspx
Page 19
Useful Terms• Population is a group of use cases
– Valid: purchasers = customers who purchased items– Questionable: purchasers = customers who indicated in survey that they
would buy an item; actual here – customers who answered surveys, intent does not indicate behavior, pus sample may be insufficient
• Sample random subset of data. Correct sample size selection requires knowledge of data.
• Range all values including exceptions and outliers• Bias incorrect results, often form incorrect non-random sample selection, i.e.
selecting Seattle to represent WA• Mean or average sum of values / number of samples• Distribution frequency of a value, typically arranged as a graph around mean• Variance = • Standard Deviation = • Correlation variable changes as a result of change to another var.• Overfitting model accurately fit sample, but not real world• Underfitting model is not able to establish a useful pattern• Cross validation checking model on a subset of inputs not used in model
generation Page 20