Artificial Intelligence for Automating Data Analysis

41
Artificial Intelligence for Automating Data Analysis Manuel Martín Salvador Smart Technology Research Centre 27 th November 2013

description

The requirements for analysing big volumes of data have increased over the last few decades. The process of selecting, cleaning, modelling and interpreting data is called the KDD process. The decision of how to approach each step in this process has often been made manually by experts. However, experts cannot be aware of all methods, nor is it feasible to try all of them. Researchers have proposed different approaches for automating, or at least advising, the stages of the KDD process. This talk will outline the different types of Intelligent Discovery Assistants as described in the work of Serban et al. “A survey of intelligent assistants for data analysis” and point out some future directions.

Transcript of Artificial Intelligence for Automating Data Analysis

Page 1: Artificial Intelligence for Automating Data Analysis

Artificial Intelligencefor Automating Data Analysis

Manuel Martín SalvadorSmart Technology Research Centre

27th November 2013

Page 2: Artificial Intelligence for Automating Data Analysis

Outline

1. Data and KDD Process2. Support for Analysts3. Prior Knowledge4. Types of IDAs5. Future Directions6. References

Presentation based on the paper by Serban et al. “A survey of intelligent assistants for data analysis” 2013http://dx.doi.org/10.1145/2480741.2480748

Page 3: Artificial Intelligence for Automating Data Analysis

Data

Many domains: biology, geography, telecommunications, sales, process industry...Structured and non-structuredSingle source and multiple sourcesImperfect data: missing values, outliers...

Page 4: Artificial Intelligence for Automating Data Analysis

Data

Many domains: biology, geography, telecommunications, sales, process industry...Structured and non-structuredSingle source and multiple sourcesImperfect data: missing values, outliers...

Page 5: Artificial Intelligence for Automating Data Analysis

Data

Many domains: biology, geography, telecommunications, sales, process industry...Structured and non-structuredSingle source and multiple sourcesImperfect data: missing values, outliers...

Page 6: Artificial Intelligence for Automating Data Analysis

Data

Many domains: biology, geography, telecommunications, sales, process industry...Structured and non-structuredSingle source and multiple sourcesImperfect data: missing values, outliers...

Page 7: Artificial Intelligence for Automating Data Analysis

KDD process0. Goal?

Page 8: Artificial Intelligence for Automating Data Analysis

KDD processRaw Data

Target Data

1. Selection

0. Goal?

Page 9: Artificial Intelligence for Automating Data Analysis

KDD processRaw Data

Target Data

Preprocessed Data

1. Selection

2. Preprocessing

0. Goal?

Page 10: Artificial Intelligence for Automating Data Analysis

KDD processRaw Data

Target Data

Preprocessed Data

Transformed Data

1. Selection

2. Preprocessing

3. Transformation

0. Goal?

Page 11: Artificial Intelligence for Automating Data Analysis

KDD processRaw Data

Target Data

Preprocessed Data

Transformed Data

Patterns

1. Selection

2. Preprocessing

3. Transformation

4. Data Mining

0. Goal?

Page 12: Artificial Intelligence for Automating Data Analysis

KDD processRaw Data

Target Data

Preprocessed Data

Transformed Data

Patterns

Knowledge

1. Selection

2. Preprocessing

3. Transformation

4. Data Mining

5. Interpretation /Evaluation

0. Goal?

Page 13: Artificial Intelligence for Automating Data Analysis

KDD processRaw Data

Target Data

Preprocessed Data

Transformed Data

Patterns

Knowledge

1. Selection

2. Preprocessing

3. Transformation

4. Data Mining

5. Interpretation /Evaluation

Refin

ing

0. Goal?

Page 14: Artificial Intelligence for Automating Data Analysis

Starting a KDD process

Novice AnalystsOverwhelmedTrial and error

Advanced AnalystsComfort areaNo further exploration

Lack of guidanceIncreasing number of techniquesLarge volumes of data

Problems:

Page 15: Artificial Intelligence for Automating Data Analysis

Supporting analysts

Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters.Multiple steps of KDD process: Help regarding the sequence of operators and their parameters.Graphical Design of KDD workflows: GUIs for interactively building the process manually.Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem.Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.

Page 16: Artificial Intelligence for Automating Data Analysis

Supporting analysts

Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters.Multiple steps of KDD process: Help regarding the sequence of operators and their parameters.Graphical Design of KDD workflows: GUIs for interactively building the process manually.Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem.Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.

Page 17: Artificial Intelligence for Automating Data Analysis

Supporting analysts

Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters.Multiple steps of KDD process: Help regarding the sequence of operators and their parameters.Graphical Design of KDD workflows: GUIs for interactively building the process manually.Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem.Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.

Page 18: Artificial Intelligence for Automating Data Analysis

Supporting analysts

Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters.Multiple steps of KDD process: Help regarding the sequence of operators and their parameters.Graphical Design of KDD workflows: GUIs for interactively building the process manually.Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem.Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.

Page 19: Artificial Intelligence for Automating Data Analysis

Supporting analysts

Single step of KDD process: Hints and advice for data selection; support in choosing a suitable algorithm and parameters.Multiple steps of KDD process: Help regarding the sequence of operators and their parameters.Graphical Design of KDD workflows: GUIs for interactively building the process manually.Automatic KDD workflow generation: Based on the data and description of their task, the users receive a set of possible scenarios for solving a problem.Explanations: The rationale behind a decision or a result allows the user to reason about the aid provided.

Page 20: Artificial Intelligence for Automating Data Analysis

Prior knowledge

Meta-data of the input dataset:Data properties such as number of attributes, amount of

missing values, or information-theoretic measures.

Meta-data of operators: External (inputs, outputs, preconditions and effects) and Internal (structure and performance).

Case base: Set of successful prior data analysis workflows.

Page 21: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.

User Expert System Rules Experts

Ranking of useful techniques

Q&A

Page 22: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.

REX [Gale 1986]: linear regression.SPRINGEX [Raes 1992]: multivariate and non-parametric statistics.Statistical Navigator [Raes 1992]: multivariate casual analysis and classification.KENS [Hand 1987], NONPAREIL [Hand 1990] and LMG [Hand 1990]: manual exploration of rules.Consultant-2 [Craw et al. 1992]: first IDA for machine learning algorithms.

Page 23: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.

Evaluations of algorithms

Meta-data of datasets Meta-database Meta-learner Model

Trai

ning

Pred

ictio

n

New dataset

User preferences Advise/Ranking of algorithmsMeta-Learning System

Page 24: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.

StatLog [Michie et al. 1994]: A decision tree model is built for each algorithm predicting whether or not it is applicable on a new dataset.The Data Mining Advisor [Giraud-Carrier 2005]: A k-NN algorithm is trained to predict algorithm performance on a new dataset.NOEMON [Kalousis et al. 2001]: Pairwise models are built and stored in a knowledge base. Scores based on wins/ties/losses are obtained for each algorithm in order to create a ranking.

Page 25: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases.

User

Operators

Workflow editor WorkflowCase-based reasoner

Case baseExperts

Meta-data

Page 26: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases.

CITRUS [Engels 1996]: A case base of operators and workflows was created by experts. Most similar case is returned based on user needs and data statistics.MiningMart [Morik et al. 2004]: A case base of workflows in a XML-based language is available online. Cases are described in an ontology. It offers a three-tier graphical editor: case, concept and relation editors.The Hybrid Data Mining Assistant [Charest et al. 2008]: Combines CBR with the experts rules of expert systems. Apart from meta-features, the case base includes user satisfaction ratings which are used for case ranking.

Page 27: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases.4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows.

Experts Ontology

User

Dataset

Planner Plans Ranker Ranking of plans

Page 28: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases.4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows.

AIDE [Amant et al. 1998]: Multi-level planning based on hierarchical task network planning. A plan library contains subproblems and primitive operators.IDEA [Bernstein et al. 2005]: Meta-data is encoded in an ontology. Valid plans are ranked by user preferences.NExT [Bernstein et al. 2007]: CBR-extension of IDEA approach. Firstly, it retrieves the most suitable cases and then uses the planner for filling gaps.

1/2

Page 29: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases.4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows.

KDDVM [Diamantini et al. 2009]: A directed graph of operators is iteratively built using a custom algorithm. The operators are chosen from an ontology.RDM [Zakova et al. 2010]: A two-planner system that uses an ontology formed of knowledge (datasets, constraints...), algorithms and KDD tasks.eLico-IDA [Kietz et al. 2009]: An ontology with operators and their effects is queried for creating tasks that are sent to the HTN planner. A second ontology is used to rank the resulting plans. 2/2

Page 30: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases.4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows.5. Workflow Composition Environments: Facilitate manual workflow creation and testing.

User

Dataset

Workflow Composition Environment

Workflow editor WorkflowOperators

Page 31: Artificial Intelligence for Automating Data Analysis

Types of IDAs

Intelligent Discovery Assistant (IDA): System that supports user in the data analysis process.

1. Expert Systems: Apply rules defined by human experts to suggest useful techniques.2. Meta-Learning Systems: Automatically learn such rules from prior data analysis runs.3. Case-Based Reasoning Systems: Find and adapt workflows that were successful in similar cases.4. Planning-Based Data Analysis Systems: Use AI planners to generate and rank valid data analysis workflows.5. Workflow Composition Environments: Facilitate manual workflow creation and testing.

Canvas-Based Tools: IBM SPSS Modeler, SAS Enterprise Miner, Weka, RapidMiner or Knime.Scripting-Based Tools: MATLAB, R or Python.

Page 32: Artificial Intelligence for Automating Data Analysis

Future directions

Cold start problem: A new dataset is not similar to any of the previous cases.Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data.Predictive models: To predict the effects of the operators given the input data.Reduce expert dependency: Self-maintenance of case bases.Combination of approaches: CBR + expert rules, CBR + planning...Scalability: To deal with large repositories of operators and case bases.

Page 33: Artificial Intelligence for Automating Data Analysis

Future directions

Cold start problem: A new dataset is not similar to any of the previous cases.Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data.Predictive models: To predict the effects of the operators given the input data.Reduce expert dependency: Self-maintenance of case bases.Combination of approaches: CBR + expert rules, CBR + planning...Scalability: To deal with large repositories of operators and case bases.

Page 34: Artificial Intelligence for Automating Data Analysis

Future directions

Cold start problem: A new dataset is not similar to any of the previous cases.Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data.Predictive models: To predict the effects of the operators given the input data.Reduce expert dependency: Self-maintenance of case bases.Combination of approaches: CBR + expert rules, CBR + planning...Scalability: To deal with large repositories of operators and case bases.

Page 35: Artificial Intelligence for Automating Data Analysis

Future directions

Cold start problem: A new dataset is not similar to any of the previous cases.Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data.Predictive models: To predict the effects of the operators given the input data.Reduce expert dependency: Self-maintenance of case bases.Combination of approaches: CBR + expert rules, CBR + planning...Scalability: To deal with large repositories of operators and case bases.

Page 36: Artificial Intelligence for Automating Data Analysis

Future directions

Cold start problem: A new dataset is not similar to any of the previous cases.Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data.Predictive models: To predict the effects of the operators given the input data.Reduce expert dependency: Self-maintenance of case bases.Combination of approaches: CBR + expert rules, CBR + planning...Scalability: To deal with large repositories of operators and case bases.

Page 37: Artificial Intelligence for Automating Data Analysis

Future directions

Cold start problem: A new dataset is not similar to any of the previous cases.Adaptivity: Current IDAs are not able to adapt the workflows in the presence of new data.Predictive models: To predict the effects of the operators given the input data.Reduce expert dependency: Self-maintenance of case bases.Combination of approaches: CBR + expert rules, CBR + planning...Scalability: To deal with large repositories of operators and case bases.

Page 39: Artificial Intelligence for Automating Data Analysis

Thanks

You can get these slides in http://slideshare.net/draxus

[email protected]

Page 40: Artificial Intelligence for Automating Data Analysis

ReferencesAMANT, R. AND COHEN, P. 1998. Interaction with a mixed-initiative system for exploratory data analysis. Knowl. Based Syst. 10, 5, 265–273.BERNSTEIN, A. AND DAENZER, M. 2007. The NExT system: Towards true dynamic adaptations of semantic web service compositions. In The Semantic Web: Research and Applications, Lecture Notes in Computer Science, vol. 4519, Springer, 739–748.BERNSTEIN, A., PROVOST, F., AND HILL, S. 2005. Toward intelligent assistance for a data mining process: An ontology-based approach for cost-sensitive classification. IEEE Trans. Knowl. Data Eng. 17, 4, 503–518.CHAREST, M.,DELISLE, S.,CERVANTES, O., AND SHEN, Y. 2008. Bridging the gap between data mining and decision support: A case-based reasoning and ontology approach. Intell. Data Anal. 12, 1–26.CRAW, S., SLEEMAN, D., GRANER, N., AND RISSAKIS, M. 1992. Consultant: Providing advice for the machine learning toolbox. In Proceedings of the Annual Technical Conference on Expert Systems (ES). 5–23.DIAMANTINI, C., POTENA, D., AND STORTI, E. 2009b. Ontology-driven KDD process composition. In Advances in Intelligent Data Analysis VIII, Lecture Notes in Computer Science, vol. 5772, Springer, 285–296.ENGELS, R. 1996. Planning tasks for knowledge discovery in databases: Performing task-oriented userguidance. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD). 170–175.GALE,W. 1986. Rex review. In Artificial Intelligence and Statistics. Addison-Wesley Longman Publishing Co.,Inc., Boston, MA. 173–227.GIRAUD-CARRIER, C. 2005. The data mining advisor: Meta-learning at the service of practitioners. In Proceedings of the International Conference on Machine Learning and Applications (ICMLA). 113–119.HAND, D. 1987. A statistical knowledge enhancement system. J. Royal Stat. Soc. Series A (General) 150, 4, 334–345.HAND, D. 1990. Practical experience in developing statistical knowledge enhancement systems. Ann. Math. Artif. Intell. 2, 1, 197–208.KALOUSIS, A. AND HILARIO, M. 2001. Model selection via meta-learning: A comparative study. Int. J. Artif. Intell. Tools 10, 4, 525–554.KIETZ, J., SERBAN, F., BERNSTEIN, A., AND FISCHER, S. 2009. Towards cooperative planning of data mining workflows. In Proceedings of the ECML-PKDD Workshop on Service-Oriented Knowledge Discovery. 1–12.MICHIE, D., SPIEGELHALTER, D., AND TAYLOR, C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River, NJ.MORIK, K. AND SCHOLZ, M. 2004. The MiningMart approach to knowledge discovery in databases. In Intelligent Technologies for Information Analysis, N. Zhong, and J. Liu, Eds., Springer, 47–65.RAES, J. 1992. Inside two commercially available statistical expert systems. Stat. Comput. 2, 2, 55–62.ZAKOVA, M., KREMEN, P., ZELEZNY, F., AND LAVRAC, N. 2010. Automating knowledge discovery workflow composition through ontology-based planning. IEEE Tran. Autom. Sci. Eng. 8, 2, 253–264

Page 41: Artificial Intelligence for Automating Data Analysis

AcknowledgementsSatellite: http://commons.wikimedia.org/wiki/File:GPS_Satellite_NASA_art-iif.jpgIndustry: http://commons.wikimedia.org/wiki/File:Industry_Texas.jpgDNA: http://commons.wikimedia.org/wiki/File:DNA_Double_Helix.pngTable: http://www.iconarchive.com/show/ravenna-3d-icons-by-double-j-design/Database-Table-icon.htmlCar: http://en.wikipedia.org/wiki/File:Jurvetson_Google_driverless_car_trimmed.jpgTwitter: http://www.flickr.com/photos/recampaign/5623528621/Multiple sources: http://www.flickr.com/photos/inl/7895742584/Thermometer: http://commons.wikimedia.org/wiki/File:Digital_thermometer.jpgTraffic Control: http://commons.wikimedia.org/wiki/File:Air_Traffic_Control,_Abraham_Lincoln_CVN-72.jpgQuestion Mark: http://commons.wikimedia.org/wiki/File:Question_mark_road_sign,_Australia.jpgNoise: http://www.flickr.com/photos/benleto/3223155821/Outliers: http://commons.wikimedia.org/wiki/File:Diagrama_de_caixa_com_outliers_and_whisker.pngBowling: http://en.wikipedia.org/wiki/File:Lawn_Bowling_-_Tim_Mason1.jpgBaby: http://www.flickr.com/photos/107489497@N06/10671592736/Library: http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpgBack to the future car: http://lowrider-girl.deviantart.com/art/Back-To-The-Future-206312200Coquette Icon Set: http://dryicons.comRoboto font: http://developer.android.com/design/style/typography.html