CSE591 (575) Data Mining
description
Transcript of CSE591 (575) Data Mining
1
CSE591 (575) Data Mining
1/21/2003 - 5/6/2003Computer Science &
EngineeringASU
2
Introduction
Introduction to this CourseIntroduction to Data Mining
3
Introduction to the Course First, about you - why take this course?
Your background and strength AI, DBMS, Statistics, Biology, …
Your interests and requests What is this course about?
Problem solving Handling data
transform data to workable data Mining data
turn data to knowledge validation and presentation of knowledge
4
This course What can you expect from this course?
Knowledge and experience about DM Problem solving and solution presentation
How is this course conducted? Presentations Individual projects
Course Format Individual Projects 40% Exams and/or quizzes 40% Class participation 20%
off-campus students?
5
Projects - Start NOW! How to start? Projects should be sufficiently challenging
but reasonable, suitable for one semester How to choose your individual project
Real-world problems Problems that might make differences
Two types of projects Available projects Self-proposed projects (Approval’s needed)
6
Some project ideas Dealing with high dimensional data
Data of supervised, unsupervised learning Image mining
Feature extraction, clustering of images Active sampling
Various data structures (kd-trees, R-trees, Multi-Dimen Scaling) Meta data (RDF, namespace) for mining Ensemble learning Sequence mining (HMM learning) Bioinformatics and applications (feature selection) Intelligent driving data analysis
Data integration, data reduction (random projection)
7
How is a project evaluated? It depends on
What do you want to achieve Its impact Your effort
The sooner you start, the better The beginning is not easy
8
Course Web Site http://www.public.asu.edu/~huanliu/
cse591.html My office and office hours
GWC 342 T 10:30 - 11:30am and Th 4:00-5:00pm
My email: [email protected] Slides and relevant information will be
made available at the course web site
9
Any questions and suggestions? Your feedback is most welcome!
I need it to adapt the course to your needs. Please feel free to provide yours anytime. Share your questions and concerns with the
class – very likely others may have the same. No pain no gain – no magic for data mining.
The more you put in, the more you get Your grades are proportional to your efforts.
10
Introduction to Data Mining
DefinitionsMotivations of DM
Interdisciplinary Links of DM
11
What is DM? Or more precisely KDD (knowledge
discovery from databases)? Many definitions A process, not plug-and-play
raw data transformed data preprocessed data data mining post-processing knowledge
One definition is A non-trivial process of identifying valid,
novel, useful and ultimately understandable patterns in data
12
Need for Data Mining Data accumulate and double every 9 months There is a big gap from stored data to
knowledge; and the transition won’t occur automatically.
Manual data analysis is not new but a bottleneck
Fast developing Computer Science and Engineering generates new demands
Seeking knowledge from massive data Any personal experience?
13
When is DM useful Data rich
Two invited talks so far have convincingly demonstrate it
Large data (dimensionality and size) Image data (size) Gene data (dimensionality)
Little knowledge about data (exploratory data analysis) What if we have some knowledge?
14
DM perspectives Prediction, description, explanation, optimization,
and exploration Completion of knowledge (patterns vs. models) Understandability and representation of
knowledge Some applications
Business intelligence (CRM) Security (Info, Comp Systems, Networks, Data, Privacy) Scientific discovery (bioinformatics)
15
Challenges Increasing data dimensionality and data
size Various data forms New data types
Streaming data, multimedia data Efficient search and data access Intelligent update and integration
16
Interdisciplinary Links of DM
Statistics Databases AI Machine Learning Visualization High Performance Computing
supercomputers, distributed/parallel/cluster computing
17
Statistics Discovery of structures or patterns in data sets
hypothesis testing, parameter estimation Optimal strategies for collecting data
efficient search of large databases Static data
constantly evolving data Models play a central role
algorithms are of a major concern patterns are sought
18
Relational Databases A relational databases can contain several tables
Tables and schemas The goal in data organization is to maintain data
and quickly locate the requested data Queries and index structures
Query execution and optimization Query optimization is to find the best possible
evaluation method for a given query Providing fast, reliable access to data for data
mining
19
AI Intelligent agents
Perception-Action-Goal-Environment Search
uniform cost and informed search algorithms Knowledge representation
FOL, production rules, frames with semantic networks
Knowledge acquisition Knowledge maintenance and application
20
Machine Learning Focusing on complex representations, data-intensive
problems, and search-based methods Flexibility with prior knowledge and collected data Generalization from data and empirical validation
statistical soundness and computational efficiency constrained by finite computing & data recourses
Challenges from KDD scaling up, cost info, auto data preprocessing
21
Visualization Producing a visual display with insights into the
structure of the data with interactive means zoom in/out, rotating, displaying detailed info
Various branches of visualization methods show summary properties and explore relationships
between variables investigate large databases and convey lots of
information analyze data with geographic/spatial location
A pre- and post-processing tool for KDD
22
Bibliography W. Klosgen & J.M. Zytkow, edited, 2001,
Handbook of Data Mining and Knowledge Discovery.