THE DATA MINING AUTOMATION COMPANY TM European Conference on Machine Learning and Principles and...
-
Upload
erika-powers -
Category
Documents
-
view
220 -
download
5
Transcript of THE DATA MINING AUTOMATION COMPANY TM European Conference on Machine Learning and Principles and...
THE DATA MINING AUTOMATION COMPANY TM
European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
15 - 19 September 2008, Antwerp, Belgium
ECML PKDD 2008
Industrializing Data Mining, Challenges and Perspectives
Françoise Fogelman Soulié [email protected]
ECML PKDD 2008 September 17, 2008Antwerp, Belgium
THE DATA MINING AUTOMATION COMPANY TM 2
• KXEN is an editor of a data mining software - The KXEN Analytic Framework™ is a suite
of predictive and descriptive modeling engines that create robust analytic
models fast and easily
- KXEN's products are based upon Vladimir Vapnik’s Statistical Learning Theory (Structural Risk Minimization)
• Our vision - KXEN wants to make predictive analytics part of everyday corporate
business decisions
• Our mission - Our mission is to embed advanced analytics into existing enterprise
applications and business processes
http://www.kxen.com/
KXEN
THE DATA MINING AUTOMATION COMPANY TM 3
Data Mining industrial applications
Data Mining industrial applications are in
• Telecommunications
• Bank & Finance
• Retail
• … and increasingly in « Web » companies
For
• Marketing
• Risk, fraud
• Security
• On-line retail and services, advertising, key-word optimization …
• In my talk, I will rely upon examples taken from KXEN customers in 3 sectors
Banking & Finance Retail Telecommunications
THE DATA MINING AUTOMATION COMPANY TM 4
Agenda
• Which world is this ?
• Data Mining in the real world
• Some examples
THE DATA MINING AUTOMATION COMPANY TM 5
A little bit of history
What has happened ?
Andrew Moore
THE DATA MINING AUTOMATION COMPANY TM 6
Data
The volume of data has explodedIn the 90s Today• Web transactions Fayyad, KDD 2007
- At Yahoo ! • Around 16 B events / day• 425 M visitors / month• 10 Tb data / day
• RFID Jiawei, Adma 2006
- A retailer with 3,000 stores, selling 10,000 items a day per store• 300 million events per day (after redundancy removal)
• Social network Kleinberg, KDD’07
- 4.4-million-node network of declared friendships on blogging community LiveJournal
- 240-million-node network of all IM communication over one month on Microsoft Instant Messenger
• Cellular networks- A telecom carrier generates hundreds of millions of CDRs / day- The network generates technical data : 40 M events / day in a large city
THE DATA MINING AUTOMATION COMPANY TM 7
Data
Just how big are big data sets ?
• Depth- Up to 100 Million lines
- Or Billion ?
• Width- Thousands of attributes
- Or Million ?
If they’re big today, wait for to-morrow
• Size of databases - X2-3 every 2 years
THE DATA MINING AUTOMATION COMPANY TM 8
Data
Many different
• Sources
• Types - Structured
- Unstructured• Text• Image • Video• Audio …
• Volumes- Web dominates !
• Lots more data out there- X 10 ? X 100 ?
Russom, TDWI 2007
THE DATA MINING AUTOMATION COMPANY TM 9
Agenda
• Which world is this ?
• Data Mining in the real world
• Some examples
THE DATA MINING AUTOMATION COMPANY TM 10
What are the issues in the real-world ?
• Data mining provides ways to define actions- A model not used for action is a useless cost
• Data volume grow exponentially : number of models must too
Herschel, Gartner 2006
THE DATA MINING AUTOMATION COMPANY TM 11
Data mining in practice
The data mining process is not very efficient in practice
• It does not make use of all variables
• It spends a large part of modeling time on data manipulation
• And as a result, it still takes a long time to build a model
Time to build a model
Project breakdown
Number of variables used
Eckerson, TDWI, 2007
THE DATA MINING AUTOMATION COMPANY TM 12
Challenges for the real-world
1. Challenge n°1 : IntegrationData mining is never THE solution : it only is a – small – part of it- In the real-world data mining needs to be integrated into a global system - Data mining needs to take inputs from/generate results to rest-of-the-world - Key words : openness, standards
2. Challenge n°2 : ProductivityData mining must bring value - Exploit all data available & Produce actionable results- At lowest possible cost- Be simple to use by non experts- Key words : Return On Investment
3. Challenge n° 3 : ScalabilityData mining must hold data volumes & number of models- Handle LARGE data sets- Produce AS MANY models as needed- Key words : time to produce a model as function of data set (width, depth)
4. Challenge n°4 : AutomatisationData mining should do all of the above (almost) automatically- Produce models, detect problems, retrain …- Key words : automatisation, control
THE DATA MINING AUTOMATION COMPANY TM 13
Challenges for the real-world
• Scalability - Data volume is characterized by (width, depth)- Modeling has 2 phases : build & apply
• How does time scale with volume ? Number of models ?
- Is real-time possible ?• At apply• Integration, Productivity, Scalability &
Automatisation
• Productivity- About 40% of modeling time is spent in data
preparation, can this be cut ?- Can all data be used ?
• Volumes ?• Structured / Unstructured ?
• Automatisation- Can modeling be done by
• A machine ?• Non-experts ?
- Can a model be « turned on » and controlled by a machine ?
Russom, TDWI, 2007
THE DATA MINING AUTOMATION COMPANY TM 14
Challenges for the real-world
The ability of Data Mining tools to satisfy the real-world challenges is critical for the wide deployment of data mining applications
THE DATA MINING AUTOMATION COMPANY TM 15
What are the issues in the real-world ?
Algorithms & Theory is NOT central said Fayyad at KDD’07
• Yes- In real-world, central issue is $
• But - Only strong algorithms & theory
can bring the $
- Iff they are up to the issues in the real-world
20
Research
Researcher view
Algorithms andTheory
Database
Systems
22
Research
Practitioner view
Systems and integration
Database
Algorithms
Customer
23
Research
Business view
Systems
Database
Algorithms
Customer
$$$’s
Fayyad
THE DATA MINING AUTOMATION COMPANY TM 16
Agenda
• Which world is this ?
• Data Mining in the real world
• Some examples
THE DATA MINING AUTOMATION COMPANY TM 17
A large financial institution
The Dilemma
• Efficiency goals
• Limits on resources
• New data sources
• Time-to-market opportunities
• Skill specific dependencies
• Need for competitive advantage
• Unique opportunity to mitigate macroeconomic risks including mortgage and housing troubles
Example n° 1
THE DATA MINING AUTOMATION COMPANY TM 18
Extreme Granularity Data
• Competitive advantage through enhanced analytics and unique data sources
• Numerous sources of granular data (transaction data, payment data, call data, etc.)
• Granularity and detail creates value if you can aggregate intelligently and extract knowledge
• Number of attributes grows exponentially as you consider time series, interactions, and transformations
Example n° 1
THE DATA MINING AUTOMATION COMPANY TM 19
The Challenge to KXEN
• Aggregation yields tens of thousands of variables about customers
• How do we select the 10 best variables for predicting credit risk, such as, “Will the customer be delinquent in payment 12 months from now?”
• How do we develop and deploy models quickly and efficiently?
• How can we enable business analysts not statisticians to build predictive models?
• How do I regain flexibility as the business owner while ensuring production quality?
Example n° 1
THE DATA MINING AUTOMATION COMPANY TM 20
Possible usage for enhancing variable selection
• Approach 1: use the same variables we used last year (common in resource constrained environment)
• Approach 2: based on experience and expertise, select the 500 variables that are most likely to be useful. Then use statistics to pick the 10 to 20 subset that is best (common in sophisticated analytic shops with heavy analyst presence)
• Approach 3: use all the variables and let the data tell you which are useful (rare where attributes >1000)
• KXEN POC focused on enabling approach 3 as new modeling process standard
Example n° 1
THE DATA MINING AUTOMATION COMPANY TM 21
The KXEN Fit – Large telco operator
Rapid development and deployment utilizing large volumes of data
• Data Enhancements- SNA dataset added ~1000 extra variables to our existing analytical
datasets (external data plus additional info aggregated from SNA data)
- Analytical datasets average over 2000 attributes
• Short time lines- Model development/comparative analysis to pilot execution at times was
done over 1-2 week periods
- KXEN allowed for quick turnaround on model development
- Built standard models and standard + SNA models for piloting
Example n° 2
THE DATA MINING AUTOMATION COMPANY TM 22
A large american Data provider
Target Source Consumer Database
• Largest response survey database in North America
• Precision data on 2 million Canadian households and 14 million US households
• 1000 data variables for targeting :- Behavioral, lifestyle, demographics and more
- Drives solutions for new customer acquisition, modeling, retention / growth and consumer insights
Example n° 3
THE DATA MINING AUTOMATION COMPANY TM 23
A large american Data provider
The Challenge
• Build 252 models
• Score 167 models on 4 million records
• Score 85 models on 2 million records
• Do it all in a 5 day time period…
• And do it with just one analyst…
Example n° 3
THE DATA MINING AUTOMATION COMPANY TM 24
Number of variablesSears
900Large Bank
1 200Vodafone D2
2 500Barclays
2 500Rogers Wireless
5 800HSBC
8 000Credit card
16 000
Why use more data ?
Some companies use lots of data
The goal
• Increase the performance of their models
The challenge
• Produce models with thousands of variables
• Exploiting all the available variables - 3 000, 5 000, 10 000 ?
• Creating new variables, also- Aggregates
- Behavioral variables
- Textual variables
- « Social network » variables …
The results- More lift, more returns, more €, $
THE DATA MINING AUTOMATION COMPANY TM 25
Scoring Code
SQLPMML
Initial Model
Validated Model
Exploit all variables
The analysis process
• Build ADS (Analytic Data Set)- Extract data
- Transform, aggregate, …
- Create ADS
• Build model- Produce initial model
- Refine, select variables
- Produce final model
• Apply model- Extract data
- Transform, aggregate, …
- Create ADS
- Apply model
- Export results to data base
Analytic Data Set including new variables
D’après
THE DATA MINING AUTOMATION COMPANY TM 26
Exploit all variables
• What-if ADS could contain ALL variables- Example : in Teradata ADS is a view. Data are not replicated or moved
ID FIELDS BEHAVIOR FIELDS DEMOGRAPHIC FIELDS MODEL SCORES CONTACT HISTORY
Model Description
Model ID
Model NameModel Date
Model Scores
Model ID (FK)Individual ID (FK)
Score
Individuals in Household
Individual ID
Household ID (FK)Individual attribute fieldsPrimary Householder Flag
Campaign Type Ref
Campaign ID
DescriptionStart Date
Contact History Detail
Campaign ID (FK)Individual ID (FK)
Promotion CodeTreatment CodeContact Date
Contact History Summary
Individual ID (FK)
Number of Mail ContactsNumber of eMail ContactsNumber of Telephone Contacts
Household
Household ID
Street AddressCityStateZipPhone
Demographics Lifestage Lifestage
Individual ID (FK)
Demographic Data Fields
Accounts
Account No
Account Information
Transactions
Transaction IDIndividual ID (FK)Account No (FK)
Transaction DetailsTransaction Timestamp
Products
Product ID
Product Details
Purchases
Transaction IDIndividual ID (FK)Product ID (FK)
Purchase Details
CustomerCustomerDataData
WarehouseWarehouse
D’après
HH-ID
CUST-ID
NAME
VALUE_SEG
BEHAV_SEG
LIFESTY_S
EG
LIFESTG_S
EG
EQUITY_1
2
EQUITY_2
4
LTVAGE
INCOM
E_CD
EDUCATION
2347387474 4797978698Gustavo 2 5 8 3 37.22 28.18 49.8 … 28 7 14 …7879973979 2439970274 Susan 3 3 6 5 18.88 28.97 154.32 … 42 9 18 …
9870908 879979 Andre 1 1 18 4 -1.38 -12.8 -48.76 … 61 5 12 …… … … … … … … … … … … … … … …
… …
THE DATA MINING AUTOMATION COMPANY TM 27
Exploit all variables
The analysis process
• Build ADS (Analytic Data Set)- Extract data
- Transform, aggregate, …
- Create ADS
• Build model- Produce initial model
- Refine, select variables
- Produce final model
• Apply model- Extract data
- Transform, aggregate, …
- Create ADS
- Apply model
- Export results to data base
« In-database Mining »
Analytic Data Set including new variables
Initial Model
Validated Model
Scoring Code
SQLPMML
< 1 week
3 days
< 1 day
< 1 day
THE DATA MINING AUTOMATION COMPANY TM 28
Using behavioral data
• From transactional data produce behavioral data - Transition from transaction A to transaction B
• Number of variables grows exponentially !
• But lift grows !!
Price
Time Jan Feb Mar Apr May Jun
40
30
20
10
0
Customer 2
AABB
CC
THE DATA MINING AUTOMATION COMPANY TM 29
Using textual variables
• In many applications (surveys, emails, …) there is a textual field
• These fields can be exploited to enhance results of data mining analysis
Bag of words
Oil prices rebound to above $67 after fears Hurricane Rita, heading for the Gulf of Mexico, may cause more damage to US oil refineries. One of the refinery…
Textual field
Count stop lists
Stemming
Analytical Data Set
Enrich
THE DATA MINING AUTOMATION COMPANY TM 30Without text
With text
Using textual variables – DataMining Cup’06
With 1000 added textual variables
• Computing time : 6 seconds 43 seconds
• Most significant variables : textual
• Lift : goes up
THE DATA MINING AUTOMATION COMPANY TM 31
Using « social network » variables
An example in telco
• Build the social network
• Extract « social network » variables - A few 10-100 additional variables
THE DATA MINING AUTOMATION COMPANY TM 32
Using « social network » variables
• « social network » variables increase lift- Globally : 40%
- First decile : 67%
- Second decile : 47%
0
The rest
THE DATA MINING AUTOMATION COMPANY TM 33
Why use more models ?
Some Companies produce lots of models
The goal
• Increase the performance of their models
The challenge
• Produce thousands of models
• For every campaign
• Refreshed often- Data distribution changes fast (Web)
• As « fine grain » as possible- Performance on a homogeneous population is better
The results- More lift, more returns, more €, $
- Ex : Response rate + 260%
Number of Models / yearVodafone D2 760Market research 9 600Cox Comm. 28 800Real estate 70 000Lower My Bills 460 000
Cable TV
High-speed InternetDigital TelephonePay-Per-View
Janu
ary
Digital Cable
April
Octob
er
July
TDWI 03-2005
THE DATA MINING AUTOMATION COMPANY TM 34
On-going research projects – Call center supervision
• Key Performance Indicators- Number of emails received, on-hold,
processed • Per time period, category, agent …
- To make sure SLA are met
• Issues- 100s of KPI
- Aggregated in hierarchies
• Predictive models- Alerts
- Optimize resource (capacity, people)• Long term Time Series forecasting on
KPI
- Detect deviations (seasonal effects)• Short term Time Series forecasting
on KPI
• “Predictive cubes” for hundreds of indicators
THE DATA MINING AUTOMATION COMPANY TM 35
On-going research projects – Recommendation
Models for recommendation
• The catalog can have 1 Million products
• Hundreds / thousands of models with thousands of attributes
www.hmv.co.jp
THE DATA MINING AUTOMATION COMPANY TM 36
Conclusion
Data mining can bring more $$
But it needs to satisfy constraints
• Integration
• Productivity
• Scalability
• Automatization
Manipulating MORE data and producing MORE models requires
• Automatized data manipulation & coding
• Simple & robust algorithms
• Software modules open & in line with market standards
Productivity gains Rogers Wireless
7xVodafone D2
10xSears
8xBelgacom
12x
THE DATA MINING AUTOMATION COMPANY TM 37
How does KXEN do it
• KXEN was designed for data miningindustrial applications
• KXEN is based upon Vapnik’s SRM – Structural Risk
Minimization - Strategy to control the trade-off
accuracy / robustness
• KXEN produces- An automatic encoding
• Non linear
- Then regression / classification• Polynomial
• Which allows- Integration
- Productivity
- Scalability
- Automatization
1
*h
n
VC dimension
Generalization error
Training error
empRGenR
h
Error......
......
21
21
khhhk
Ridge regression
Polynomials
KI (Gini index)
THE DATA MINING AUTOMATION COMPANY TM 38
How does KXEN do it
• In practice, for one final model, KXEN builds many models (SRM)- Depending upon variables complexity, encoding requires 10 - 30 models
- Then about 100 models (for regression)
• KXEN uses « data streams » techniques
• There is no data duplication - A few sweeps are necessary
• Time to build a model- About linear in depth & width
THE DATA MINING AUTOMATION COMPANY TM 39
Questions ?
THE DATA MINING AUTOMATION COMPANY TM 40
• Location - Marriott Paris Rive Gauche Hotel & Conference Center
17 Boulevard St Jacques - 75014 Paris, France
• Dates- June 28 - July 1, 2009
• Key Submission Dates- Due January 19, 2009 Workshop Proposals
- Due February 2, 2009 Paper Abstracts
- Due February 6, 2009 Research/Industrial Track Papers
- Due February 23, 2009 Tutorials/Panel Proposals
http://www.kdd.org/kdd2009/index.html
THE DATA MINING AUTOMATION COMPANY TM 41
For the first time, in 2009, KDD will leave North America for Europe
KDD’09 will be held in PARIS, France
THE DATA MINING AUTOMATION COMPANY TM 42
Mark the dates
Conference
• June 28th – July 1st, 2009
Key Submission Dates
• January 19, 2009 Workshop Proposals
• February 2, 2009 Paper Abstracts
• February 6, 2009 Research/Industrial Track Papers
• February 23, 2009 Tutorials/Panel Proposals
• April 10, 2009 Notification
• April 27, 2009 Camera Ready
Visit the Conference site
http://www.kdd.org/kdd2009/
THE DATA MINING AUTOMATION COMPANY TM 43
Contact us
General Chair [email protected]
• John Elder (Elder Research, Inc.)
• Francoise Fogelman Soulié (KXEN, Inc.)
Program Co-Chair [email protected]
• Peter Flach (University of Bristol)
• Mohammad Zaki (Rensselaer Polytechnic Institute
I personnally count on you for sending lots of good papers and show a very strong european attendance at KDD’09
See you there