Examples of data mining Marketing & Advertisement (Case of
Bank of America) In the past, each caller would have listened to
the same marketing advertisement, whether or not it was relevant to
the callers interests. we want to be as relevant as possible to
each customer. Bank of Americas customer service representatives
have access to individual customer profiles, so that the customer
can be informed of new products or services that may be of greatest
interest to him or her. Data mining helps to identify the type of
marketing approach for a particular customer, based on the
customers individual profile.
Slide 3
Examples of data mining Medical (Case of Childrens Memorial
Hospital) Brain tumors represent the most deadly cancer among
children, with nearly 3000 cases diagnosed per year in the United
States, nearly half of which are fatal. Director of brain tumor
research at Childrens Memorial Hospital in Chicago, has set the
goal of building a gene expression database for pediatric brain
tumors, in an effort to develop more effective treatment. As one of
the first steps in tumor identification, the director uses the
Clementine data mining software suite published by SPSS, Inc., to
classify the tumor into one of 12 or so salient types.
Slide 4
WHAT IS DATA MINING?
Slide 5
Data mining is the process of discovering: Meaningful new
correlations Patterns Trends by sifting through large amounts of
data stored in repositories, using pattern recognition technologies
as well as statistical and mathematical techniques.
Slide 6
WHAT IS DATA MINING? Data mining is the analysis of (often
large) observational data sets to find unsuspected relationships
and to summarize the data in novel ways that are both
understandable and useful to the data owner
Slide 7
WHAT IS DATA MINING? Data mining is interdisciplinary field
bringing together techniques from machine learning, pattern
recognition, statistics, databases and visualization to address the
issue of information extraction from large data bases
Slide 8
Introduction Data mining skills are in high demand as
organizations increasingly put data repositories online.
Effectively analyzing information from customers, partners, and
suppliers has become important to more companies. Many companies
have implemented a data warehouse strategy and are now starting to
look at what they can do with all that data
Slide 9
WHY DATA MINING? The ongoing remarkable growth in the field of
data mining and knowledge discovery has been fueled by a variety of
factors: The storing of the data in data warehouses, so that the
entire enterprise has access to a reliable current database The
availability of increased access to data from Web navigation and
intranets The competitive pressure to increase market share in a
globalized economy The development of off-the-shelf commercial data
mining software suites The tremendous growth in computing power and
storage capacity
Slide 10
FALLACIES OF DATA MINING Fallacy 1. There are data mining tools
that we can turn loose on our data repositories and use to find
answers to our problems. Reality. There are no automatic data
mining tools that will solve your problems mechanically while you
wait. Data mining is a process. Fallacy 2. The data mining process
is autonomous, requiring little or no human oversight. Reality. As
we saw above, the data mining process requires significant human
interactivity at each stage. Even after the model is deployed, the
introduction of new data often requires an updating of the model.
Continuous quality monitoring and other evaluative measures must be
assessed by human analysts.
Slide 11
FALLACIES OF DATA MINING Fallacy 3. Data mining pays for itself
quite quickly. Reality. The return rates vary, depending on the
startup costs, analysis personnel costs, data warehousing
preparation costs, and so on. Fallacy 4. Data mining software
packages are intuitive and easy to use. Reality. Again, ease of use
varies. However, data analysts must combine subject matter
knowledge with an analytical mind and a familiarity with the
overall business or research model.
Slide 12
FALLACIES OF DATA MINING Fallacy 5. Data mining will identify
the causes of our business or research problems. Reality. The
knowledge discovery process will help you to uncover patterns of
behavior. Again, it is up to humans to identify the causes. Fallacy
6. Data mining will clean up a messy database automatically.
Reality. Well, not automatically. As a preliminary phase in the
data mining process, data preparation often deals with data that
has not been examined or used in years. Therefore, organizations
beginning a newdata mining operation will often be confronted with
the problem of data that has been lying around for years, is stale,
and needs considerable updating.
Slide 13
Knowledge Discovery Process
Slide 14
The Knowledge Discovery in Databases process comprises of a few
steps leading from raw data collections to some form of new
knowledge. The iterative process consists of the following steps:
Data cleaning: also known as data cleansing, it is a phase in which
noise data and irrelevant data are removed from the collection.
Data integration: at this stage, multiple data sources, often
heterogeneous, may be combined in a common source. Data selection:
at this step, the data relevant to the analysis is decided on and
retrieved from the data collection. Data transformation: also known
as data consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining procedure.
Slide 15
Knowledge Discovery Process Data mining: it is the crucial step
in which clever techniques are applied to extract patterns
potentially useful. Pattern evaluation: in this step, strictly
interesting patterns representing knowledge are identified based on
given measures. Knowledge representation: is the final phase in
which the discovered knowledge is visually represented to the user.
This essential step uses visualization techniques to help users
understand and interpret the data mining results.
Slide 16
WHAT TASKS CAN DATA MINING ACCOMPLISH? The following list shows
the most common data mining tasks. Estimation Prediction
Classification Clustering Association
Slide 17
Classification There is a target categorical variable Example:
income bracket: partitioned into 3 classes or categories: high
income, middle income, and low income
Slide 18
Classification Classify the income brackets of persons not
currently in the database, based on other characteristics
associated with that person, such as age, gender, and occupation
Classification Task
Slide 19
Classification - Steps First : examine the data set containing
both the predictor variables and the (already classified) target
variable i.e income bracket. The algorithm (software) learns about
which combinations of variables are associated with which income
brackets. For example, older females may be associated with the
high- income bracket. This data set is called the training set.
Second: The algorithm would look at new records, for which no
information about income bracket is available. Based on the
classifications in the training set, the algorithm would assign
classifications to the new records. For example, a 63-year-old
female professor might be classified in the high-income
bracket.
Slide 20
Classification For example, in the medical field, suppose that
we are interested in classifying the type of drug a patient should
be prescribed, based on certain patient characteristics, such as
the age of the patient and the patients sodium/potassium ratio.
Figure 1.4 is a scatter plot of patients sodium/potassium ratio
against patients ages for a sample of 200 patients. The particular
drug prescribed is symbolized by the shade of the points. Light
gray points indicate drug Y; medium gray points indicate drug A or
X; dark gray points indicate drug B or C.
Slide 21
Which drug should be prescribed for a young patient with a high
sodium/potassium ratio? Which drug should be prescribed for older
patients with low sodium/potassium ratios?
Slide 22
Other Classification examples Determining whether a particular
credit card transaction is fraudulent Assessing whether a mortgage
application is a good or bad credit risk Diagnosing whether a
particular disease is present Determining whether a will was
written by the actual deceased, or fraudulently by someone else
Identifying whether or not certain financial or personal behavior
indicates a possible terrorist threat
Slide 23
Estimation Estimation is similar to classification except that
the target variable is numerical rather than categorical. Example:
estimating the systolic blood pressure reading of a hospital
patient, based on the patients age, gender, body-mass index, and
blood sodium levels The relationship between systolic blood
pressure and the predictor variables in the training set would
provide us with an estimation model. We can then apply that model
to new cases.
Slide 24
Examples of Estimation Estimating the amount of money a
randomly chosen family of four will spend for back-to-school
shopping this new school season. Estimating the grade-point average
(GPA) of a masters student, based on that students undergraduate
GPA. Estimating the number of goals per game that Ronaldo will
score during next UEFA, based on his last goals.
Slide 25
Prediction Prediction is similar to classification and
estimation, except that for prediction, the results lie in the
future regardless of numerical or categorical Any of the methods
and techniques used for classification and estimation may also be
used, for prediction. Traditional statistical methods - point
estimation and confidence interval estimations, simple linear
regression and correlation, multiple regression Data mining and
knowledge discovery methods such as neural network
Slide 26
Examples of Prediction Predicting the price of a stock three
months into the future Predicting the percentage increase in
traffic deaths next year if the speed limit is increased Predicting
the winner of this falls baseball World Series, based on a
comparison of team statistics Predicting whether a particular
molecule in drug discovery will lead to a profitable new drug for a
pharmaceutical company
Slide 27
Examples of Prediction Predicting the price of a stock three
months into the future
Slide 28
Clustering Clustering refers to the grouping of records,
observations, or cases into classes of similar objects. A cluster
is a collection of records that are similar to one another, and
dissimilar to records in other clusters. Clustering differs from
classification in that there is no target variable for clustering.
The clustering task does not try to classify, estimate, or predict
the value of a target variable. Instead, clustering algorithms seek
to segment the entire data set into relatively homogeneous
subgroups or clusters.
Slide 29
Slide 30
demographic profile of each of the geographic areas in the
country, as defined by zip code. One of the clustering mechanisms
they use is the PRIZM segmentation system, which describes every
U.S. zip code area in terms of distinct lifestyle types
Slide 31
Clustering Example clusters for zip code 90210, Beverly Hills,
California. Beverly Hills, CA 90210's most common PRIZM NE Segments
are: NumberName 16Bohemian Mix 07Money & Brains 03Movers &
Shakers 01Upper Crust 04Young Digerati
Slide 32
Examples of Clustering Group of residents according to zip
codes. For accounting auditing purposes, to segment financial
behavior into benign and suspicious categories As a
dimension-reduction tool when the data set has hundreds of
attributes For gene expression clustering, where very large
quantities of genes may exhibit similar behavior
Slide 33
Association The association task for data mining is the job of
finding which attributes go together. Most prevalent in the
business world, where it is known as affinity analysis or market
basket analysis The task of association seeks to uncover rules for
quantifying the relationship between two or more attributes.
Association rules are of the form If antecedent, then consequent,
together with a measure of the support and confidence associated
with the rule.
Slide 34
Association Another example, a particular supermarket may find
that of the 1000 customers shopping on a Thursday night, 200 bought
diapers, and of those 200 who bought diapers, 50 bought Formula
Milk. Thus, the association rule would be If buy diapers, then buy
Formula Milk with a support of 200/1000 = 20% and a confidence of
50/200 = 25%.
Slide 35
Examples of Association Investigating the proportion of
subscribers to a companys cell phone plan that respond positively
to an offer of a service upgrade Examining the proportion of
children whose parents read to them who are themselves good readers
Predicting degradation in telecommunications networks Finding out
which items in a supermarket are purchased together and which items
are never purchased together Determining the proportion of cases in
which a new drug will exhibit dangerous side effects
Slide 36
Bohemian Mix Midscale, Middle Age Mix A collection of mobile
urbanites, Bohemian Mix represents the nation's most liberal
lifestyles. Its residents are an ethnically diverse, progressive
mix of young singles, couples, and families ranging from students
to professionals. In their funky row houses and apartments,
Bohemian Mixers are the early adopters who are quick to check out
the latest movie, nightclub, laptop, and microbrew.
Slide 37
Money & Brains Upscale, Older Mix The residents of Money
& Brains seem to have it all: high incomes, advanced degrees,
and sophisticated tastes to match their credentials. Many of these
city dwellers are married couples with few children who live in
fashionable homes on small, manicured lots.
Slide 38
Movers & Shakers Wealthy, Middle Age w/o Kids Movers &
Shakers is home to America's up-and-coming business class: a
wealthy suburban world of dual-income couples who are highly
educated, typically between the ages of 35 and 54. Given its high
percentage of executives and white-collar professionals, there's a
decided business bent to this segment: members of Movers &
Shakers rank number one for owning a small business and having a
home office.
Slide 39
Upper Crust Wealthy Older w/o Kids The nation's most exclusive
address, Upper Crust is the wealthiest lifestyle in America--a
haven for empty- nesting couples between the ages of 45 and 64. No
segment has a higher concentration of residents earning over
$100,000 a year or possessing a postgraduate degree. And none has a
more opulent standard of living.
Slide 40
Young Digerati Upscale, Younger Mix Young Digerati are
tech-savvy and live in fashionable neighborhoods on the urban
fringe. Affluent, highly educated, and ethnically mixed, Young
Digerati communities are typically filled with trendy apartments
and condos, fitness clubs and clothing boutiques, casual
restaurants and all types of bars--from juice to coffee to
microbrew.