Revie Business Intelligence

business intelligence

Crawford Revie

strategic information systems


2

Contents

Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1 Introducing business intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 General concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Exploratory data analysis and visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Techniques and algorithms from data mining and KDD . . . . . . . . . . . . . . . . . . . . 75

5 Techniques from ‘intelligent’ systems and artificial intelligence . . . . . . . . . . . 125

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

3

Preview

What is this module all about?

While I was putting together this module the following two headline articles appearedon

ABC News

and

Science Daily

:

Dishonorable conduct: Marine in brig for spending spree; government roots out abuse

A Marine Corps lance corporal is in the military brig for illegally using a Pentagon creditcard to buy herself a car, motorcycle, furniture and a breast-lift, according to the MarineForces Reserve…

Data mining

helped catch Pierre and other abusive card holders, andwill hopefully nab any others in the future, officials say. (ABC News, 14/8/03. Full storyat: http://abcnews.go.com/ sections/us/Business/pentagonabuse030814.html)

Scientists demonstrate new method for discovering cancer gene function

Using a new approach for dissecting the complicated interactions among manygenes, scientists at Dana-Farber Cancer Institute have discovered how a commoncancer gene works in tandem with another gene to spur the unchecked growth ofcells… Using a statistical tool called the

Kolmogorov-Smirnov metric

, the scientistsshowed that the 21 genes identified by the gene chips correlated closely… ; it was this‘

data-mining

’ process that turned up the C/EBP-beta gene. (Science Daily, 13/8/03.Full story at: http://www.sciencedaily.com/releases/2003/08/030808080345.htm)

These articles are fairly typical examples of the way in which what we shall refer to as‘business intelligence’ (BI) is reported in the press. They tend to relate to examplesinvolving large data sets:

11

million separate credit card purchases by Pentagon staff in2002 in the first case; and

16,000

different genes each expressed as a complexsequence of DNA in the second. They focus on applications that look for unusual pat-terns of behaviour (‘

outliers

’), as in the case of Lance-Corporal Pierre in the MarineCorps; or that discover previously unknown associations as in the case of the two typesof genes in the cancer research example. Similar preoccupations can be seen in thereporting of BI applications in the business news media, as can be seen below.

CRM’s secret sauce

[In CRM] the difficulty is not in gathering customers from the range of sales channels,but in making sense of the multitude of customer contact points coming into theenterprise. The web in particular has added a new dimension to the integration prob-lem in terms of sheer data volume. The net result of multiple touch points is a prolif-eration of data repositories with apparently little connection with each other… Thekey challenge [of CRM] is not simply to gather vast amounts of data on customers…but to mine it to highlight clues as to how the company could conduct its businessbetter. A good way to achieve this is through

CRM analytics

. (

Computer BusinessReview

Online, 1 April 2002)

Grokking the infoviz

Information visualisation is about to go mainstream. While it may not be the killerapplication some expect, ‘infoviz’ is going to help users to manipulate data in whollynew ways… Based on their business processes, companies now collect huge amountsof data that they want to [use

BI tools

to] analyse to gain a competitive edge. (

TheEconomist

, 19 June 2003)

Once again the focus is on large data sets with data warehousing and data miningtechnologies tending to dominate.



4

However, the coverage given by

Business Week online

earlier this year (2003) was inter-esting and reflects a more comprehensive understanding of the range of techniquesinvolved. Under the heading, ‘

Smart Tools: Companies in health care, finance andretailing are using artificial-intelligence systems to filter huge amounts of data andidentify suspicious transactions’

, their article gives a much more rounded presenta-tion of the tools than is often the case when BI is reported. For example, when discuss-ing the case of Wal-Mart and its CRM efforts they say:

Data-mining software typically included neural nets, statistical analysis, and socalled expert systems with if-then rules that mimic the logic of human experts. Theresults enable Wal-Mart to predict the sales of every product at each store withuncanny accuracy, translating into huge savings in inventories and maximum pay-off from promotional spending. (Business Week online, Spring 2003. Full story at:http://www.businessweek.com/print/bw50/content/ mar2003/a3826072.htm?bw)

This represents a more balanced and comprehensive view of the tools and techniquesinvolved in BI activities. It is an understanding that is becoming more widespread andis ultimately much more useful than the single focus on a novel approach, i.e. ‘datamining’, and is one I hope to develop more fully throughout this module. The fairlyunstructured cluster shown in the figure below lists some of the major tools, tech-niques and applications (these are often difficult to separate from one another) to befound in the area of business intelligence.

One of the problems associated with the growth of anynew area of technology is what you might call ‘greatsuccess’ reporting. This often focuses on the ‘wow’ fac-tor and can be long on jargon and short on explana-tion. This module sets out to get behind the headlinesand show not only

what

is being achieved within suc-cessful BI applications but also

how

this is beingachieved. In this sense the module is very much ‘tech-niques-oriented’. Naming and terminology is, at best, confusing in thisarea – as we shall see in attempting to define ‘businessintelligence’ in Chapter 1. The titles considered for thismodule included ‘data mining’ (too narrow), ‘knowledgemanagement’ (too broad), ‘applied statistics’ (too scary!), etc. Very recently the term ‘busi-ness analytics’ has come into more common use as the sub-area of business intelligencethat deals particularly with data-driven, quantitative analysis for decision support. Interms of the balance of coverage of material within this module perhaps

business analyt-ics

would have been a more accurate description as much of the content is techniques-ori-ented, looking particularly at data mining techniques for dealing with quantitative dataand at approaches from artificial intelligence for reasoning about data and rules. How-ever, ‘business analytics’ was not in common use when the course structure was formu-lated (2002) and this module does attempt to give at least a flavour of the range of relatedissues not strictly linked to the ‘algorithmic’ aspects of business intelligence.The range of techniques covered in this BI module are described in the table belowaccording to their major ‘discipline’. Even this simple listing would prove controversialto some in the field. For example, some intelligent systems researchers will claim thatBayesian belief networks (BBN) ‘belong’ to their discipline and indeed we cover themin Chapter 5 with other techniques from

artificial intelligence

(their position in mylisting here reflects the fact that they are rooted in Bayesian probability theory, whichis very much a ‘statistical’ approach). Conversely there are many statisticians whoclaim that artificial neural networks (ANNs) are simply a form of non-linear statisticalmodel (and in fact we cover different aspects of ANNs in both Chapter 4 and Chapter 5– a bit of fence-sitting, perhaps!).

Decision analysisData mining

VisualisationCustomer

relationship management

Expert systems

Data warehouse Text mining

Tools, techniques and applicationsrelating to Business Intelligence (BI)

5

Core methods and approaches used within BI listed according to their major disciplineIntelligent systems Statistics

Induction Correlation/associationDeduction ClusteringSearch RegressionBoolean logic ClassificationFuzzy set theory Sensitivity analysisMachine learning SimulationArtificial neural networks Bayesian belief networks

These are some of the subjects we will cover in the techniques-oriented parts of thismodule. However, given the breadth of subject matter it will be not be possible to domuch more than whet your appetite for any one technique. (Even in the case whereslightly more detail is given, for example for linear discriminant analysis in Chapter 4or predicate calculus in Chapter 5, these are limited ‘tutorial’ style introductions pro-vided to make some general points about the range of techniques of that particulartype.) Hopefully you will gain enough of an insight to get behind the jargon of busi-ness intelligence and be able to ask the right questions when looking at any proposedsolution in, for example, the areas of data mining or knowledge-based systems.In addition to the techniques-oriented Chapters (4 and 5) there is a general introduc-tion to BI and associated areas (Chapter 1) and an overview defining some ‘technical’issues of general relevance to the BI context (Chapter 2). The range of techniqueswhich falls within the area of

exploratory data analysis

is outlined in Chapter 3, whichsome authors would consider to be a part of BI and others not. These techniques havecertainly been around for quite a bit longer than the ‘BI phenomenon’ (but then sohave most of the approaches and algorithms which make up BI) and are of general rel-evance to a broad set of analyses. Finally, ‘The BI market and future trends’ which canbe found on the module website, takes a look at the main players, tools and applicationareas within the BI marketplace and also proposes some likely trends which will havean impact on how BI is perceived and used in the future.

What this module will not cover

Some will find it a little perverse to begin by describing what will not be covered in thismodule. However, this can be a useful exercise particularly for those who have done a lit-tle reading in the area and/or come to the subject with a certain set of expectations.

Not strategic intelligence

There appears to be a growing number of references to ‘strategic’ or ‘competitive’ intel-ligence in the same context as BI. These articles tend to use the word ‘intelligence’ inits loosest sense and draw parallels between the business context and ‘military intelli-gence’ or the ‘intelligence community’, about which much has been heard recently inthe UK since the war in Iraq. This usage appears to be strongest in the pharmaceuticalindustry where competitor and product intelligence are particularly important, butexamples can be found in other areas. (Indeed I heard one management consultantclaim that one of his best personal examples in the area of ‘strategic intelligence’ wasthe day he discovered a new and much nicer route for his morning walk in to theoffice!) So, although this form of knowledge tracking and management may beincreasingly in vogue we will not cover it. (The exception would be where the ‘intelli-gence’ is making obvious use of more traditional BI tools to manage a problem, forexample, the use of data mining by the US Department of Homeland Security in its‘Terrorism Information Awareness’ programme.)



6

Not business processes/performance (ROI, BPI, BAM, EAI, …)

Like much of IT, the BI area is plagued by jargon and abbreviations, including the wholeraft of ‘strategic advantage’ labels, many of which appear to have a particular meaningwithin BI. Thus we have: ROI, earned value analysis (EVA), balanced scorecards, BPI (thelatest in the business process/performance engineering… management… and now Inte-gration, cycle), enterprise application integration (EAI), ELT and most recently businessactivity monitoring (BAM). In a recent introduction to BAM, I found this sentence,which in some ways summed up the whole jargon-thing rather well:

… because we have the best-of-breeds technologies in ETL, EAI and BI, we can createthis infrastructure for the future as well as satisfying their BAM needs. (A quotefrom Informatica given in Making real-time business decisions with BAM. Availableat: http://www.dwinstitute.com/research/display.asp?id=6755&t=y)

For those who can be bothered to keep up with the ever-changing terminology, BAMattempts to provide executives with information that will help them make decisionsin real time, whereas operational BI is seen as being mostly a ‘batch-oriented’approach requiring data to be uploaded into data warehouses, etc. (To me this soundseerily like the executive information systems that were much touted in the 1980s.)

BAM is a still emerging discipline, and although some vendors and analysts disagreeabout what exactly comprises a BAM solution, a good general description is of anarchitecture that combines real-time transactional data with historical data, andwhich provides a context of some kind – often a digital dashboard – to organise andpresent this data. (Business activity monitoring takes center stage. Available at:http://www.dw-institute.com/research/display.asp?id=6641&t=y)

The same article notes that the research firm Gartner Inc. has defined five distinctcomponents of any BAM solution:

!

enterprise application integration (EAI)

!

extraction, transformation and loading (ETL)

!

data warehousing

!

business process modelling

!

and network systems management.

Gartner have also stated that BAM will be ‘one of the top four initiatives driving ITinvestment and strategy by 2004’.At least the technology ‘terminologists’ are being consistent this year as BAM wouldappear to follow on nicely from last quarter’s ‘next big thing’, which was the

real-timeenterprise

(RTE). (See, for example, the April news article from The Data WarehousingInstitute (TDWI) entitled

Best Practices for the Real-Time Enterprise

available at http://www.dw-institute.com/research/display.asp?id=6634&t=y.) Looking through articleson the web over the past two or three years I have noted the same management con-sultancy firms repeatedly making claims for the next big thing and this thing hasoften overlapped with what I would consider to be BI territory. I do not propose tospend much more time on this sort of material.The following is a typical example of the sort of commentary that I feel gives manage-ment consultants a bad name:

Business intelligence means consciously using knowledge coupled with action toeffect performance improvement. Creating the awareness of performance potentialin everyone’s mind is the first step. (From Knowledge Consultants Inc. Available at:http://www.knowledgebiz.com/bi.html)

Well, aren’t you glad you asked?!

7

Perhaps a little more balanced example is the report produced by TDWI in 2003,

SmartCompanies in the 21st Century: The Secrets to Creating Successful Business IntelligenceSolutions

. In the executive summary, after reminding the reader of the ability to ‘derivesignificant ROI by using BI to devise better tactics and… capitalize more quickly onnew opportunities… using BI to become intelligent about the way you do business’, theauthors state that successful BI solutions have the following characteristics:

!

business sponsors are highly committed and actively involved in the project

!

business users and the BI technical team work together closely

!

the BI system is viewed as an enterprise resource and given adequate funding andguidance to ensure long-term growth and viability

!

firms provide users both static and interactive online views of data

!

the BI team has prior experience with BI and is assisted by vendor and independentconsultants in a partnership arrangement

!

the company’s organizational culture reinforces the BI solution.(Available at: http://www.dw-institute.com/display.asp?id=6766)

I quote this extract partly because it comes from TDWI, which is an excellent source ofBI information, and also because I feel it contains some truth. However – and here youmay call me a cynic if you like – I get a strong sense of

déjà vu

when I read this type ofreport. I’m sure if you replaced ‘BI’ with ‘MIS’ or ‘EIS’ or ‘BPM’ or…, the script would looklike something from the late 1980s, mid-1990s, or whatever. I am not saying that thebusiness (strategic) context is not of importance – indeed whenever this module goesinto detail about techniques we must keep an eye on the ultimate use and goal ofthese novel approaches. It is more a case of feeling that you will have heard this typeof ‘strategic’ talk before and that I am not sufficiently eloquent to make it sound anybetter than it did the last time around. There is an excessive amount of this type ofreporting on the web and you can follow some of the directed reading from the BImodule web area if you feel the need to put things back into the ‘big picture’. Just incase I (or you) are tempted to forget it, here is a final quote from the Unicom Seminarsorganisation in London:

In order to get full value of Business Intelligence and Business Analytics, it is neces-sary to understand how these are positioned within the organisational context. It isnot sufficient to be informed only about the technologies underpinning BusinessIntelligence (BI) and Business Analytics (BA). It is necessary to consider the businessneeds, the business processes and then ask the right questions as to how BusinessAnalytics can be embedded within Business Process Integration (BPI) and the infor-mation systems. (Available at: http://www.unicom.co.uk/bi).

Not ‘architecture’

The reason that I will not cover the systems architecture aspects of BI is definitely

not

because this is an area of peripheral interest. Indeed it is critical to the success of mostBI applications that the systems aspects of the problem are adequately addressed. Noris this a trivial task, as BI applications often run against very large data sets collectingdata from across the enterprise. I do not plan to cover ‘architecture’ for two reasons. Inthe first place, we do not have the space to adequately address all the issues requiredin anything like sufficient depth. Secondly, there are other places within the SIS coursewhere this material is covered.So while we will not cover architecture in any detail within the module text there isplenty of material on the web which can be referenced and which we can discuss inthe module conference area if participants are interested. As a ‘taster’ for the range ofmaterial on the web, as well as being one of the best two-page summaries of BI archi-tecture I have come across, I end this section with some quotes from Athena IT Solu-



8

tions’ article ‘Mars, Venus, and a Successful BI Architecture’ (with obvious references tothe well known ‘self-help’ book on males, females and relationships).

When it comes to BI architecture, business users are from Mars, and IT people arefrom Venus. Each approaches a BI project from a totally different viewpoint

.

As an ITperson, you often need to educate your business users to make sure they understandthat a successful BI architecture requires more than just a silver-bullet product.(Available at: http://www.athena-solutions.com/bi-brief/may03-architecture.html)

The authors then go on the describe a four-layer BI architecture which has ‘informa-tion architecture’ as its foundation, ‘data architecture’ and ‘technical architecture’ asits supporting framework, and finally a ‘product architecture’ to deliver solutions tothe end user. As they note, it is often at this last phase in the architectural hierarchythat things can go badly wrong.

This is where you see the gulf between business users and IT. Business users will go offon their own and purchase products based on a great looking demo and PowerPointpresentation. ‘Architecture?…we don’t need no stinkin’ architecture!’ This throws IT’sfour-part architecture totally out of whack. … Regardless of what planet you’re from,you need to establish an architectural framework for your BI projects. Creating aframework and plan increases the likelihood of long-term success. (Op. cit.)

So whether you are from Mars or Venus you have been warned! The road to successfulBI implementation will never be easy. However, it can ultimately be highly rewardingand I hope that these notes and your involvement in this module will increase yourunderstanding of the issues involved.

9

Aims

The aims of this module are:

!

to develop a critical awareness of the range of tools being marketed under thelabel

decision support

or, more generally,


(BI)

!

to provide an overview of the core

statistical approaches

underpinning the subject

!

to help participants gain an appreciation of key benefits and limitations, whichare all too often over-stated by the prefix

intelligent

in marketing literature.

Learning outcomes

After completing this module participants will be able to:

!

identify how classical statistical techniques have been re-packaged as modern BIproducts

!

discuss the potential application of BI tools to various types of business problem-solving and appreciate their limitations

!

describe the range of approaches used by computers to represent and reason withformal and quantitative knowledge

!

recognise the role of end users and analysts in BI and the important differencesbetween decision support and decision making.

Topics covered

!

modelling the mind: deduction, induction, machine learning and neural networks

!

quantitative methods for data analysis and knowledge extraction: classificationand regression, clustering, association rules, Bayesian approaches, etc.

!

modelling, simulation, optimisation and uncertainty

!

BI applications: data mining, knowledge management, decision analysis, textmining, etc.

Indicative readings

Hand, D., Mannila, H. and Smyth, P. (2001)

Principles of Data Mining

, Cambridge, MA,MIT Press.Mallach, E.G. (2000)

Decision Support and Datawarehousing Systems

, New York,McGraw Hill.Turban, E. and Aronson, J. E. (2000)

Decision Support Systems and Intelligent Systems

.6th edn, London, Macmillan.Wisniewski, M. (1997)

Quantitative Methods for Decision Makers

, 2nd edn, London,Financial Times/Prentice-Hall.

Web references

Referenced sources (URLs) were last checked on 13 December 2003.

1 introducing business intelligence



12

Contents

Learning outcomes for this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.1 From terabytes to petabytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Some examples of business intelligence and KDD . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Definition and scope of data mining and knowledge discovery . . . . . . . . . . . . .20

1.4 Pejorative or negative use of the term ‘data mining’ . . . . . . . . . . . . . . . . . . . . . . . . 23

Activities

Thinking point Beers and diapers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Thinking point Customer retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18Exercise 1.1 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23Exercise 1.2 Interpreting the ‘Dogs of the Dow’ . . . . . . . . . . . . . . . . . . . . . . . .25Self-assessment questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

introducing business intelligence

13

Learning outcomes for this Chapter

When you have completed this Chapter you will be able to:

!

appreciate the size and scope of data sets typically involved in BI applications

!

explain what is meant by ‘business intelligence’ (BI) and its relationship to theareas of data mining and ‘knowledge discovery from/in databases’ (KDD)

!

identify the key stages within the BI/KDD process and the models used todescribe them.



14

Introduction

Defining exactly what was meant by ‘business intelligence’ was one of the most diffi-cult parts of writing this entire module. Even once we have accepted, as I have sug-gested in the preview, that we are focusing on business analytics and will thereforeadopt a ‘techniques-oriented’ approach, the next step in terms of definition is notobvious. For example, if I follow Berson, Thearling and Smith’s (1999) definition whichstates, ‘the process of discovering meaningful new correlations, patterns and trends…’,I am confused before I even get started. As we shall see, models (which is what correla-tions are) and patterns are quite different things.Perhaps a better place to start is to ask, ‘why do we need business intelligence, what isit for’?


15

1.1

From terabytes to petabytes

Most large organisations now possess transactional databases on customers, products,etc. that run into millions (or even billions) of records and these are growing all thetime. Their owners feel that potentially interesting knowledge must reside in the datawarehouses created from these transactions but often all they see is the data. (Wefocus for much of this module on well-structured data as our information sourcethough we will take a brief look at semi or un-structured information in ‘The BI marketand future trends’ when we consider future trends.) In addition to the well-known example of data sets generated on customer buyinghabits by large supermarkets (aided by ‘loyalty’ cards), the telecommunications indus-try also collects vast amounts of data. For example, France Telecom, which has over

90

million customers in over

200

countries, recently completed a new data warehouse towhich

500

million records (using

100

GB of storage) will be added each day! At peaktimes more than

65

million records will be added in one hour. These numbers are trulystaggering: approximately every

10

days the database will have grown by a terabyte.This used to be an upper limit that data mining practitioners bench-marked against,and there was a line, by an anonymous data miner, ‘I am terrified by terabytes’(

1

terabyte

=

1,024

gigabytes). Yet at a recent conference (KDD 2003) Jim Gray, notingthe continued growth of data sets, modified this to ‘I am petrified of petabytes!’(

1

petabyte = 1,024 terabytes). What is all this data for? In the case of France Telecomthey were expanding their data warehouse to support additional BI applications:

The company needed to expand its data warehouse, creating a new system andonline repository for all telecommunications traffic, to support fraud detection, cus-tomer service, network traffic analysis, and marketing. (TDWI News, 3/9/2003)

Outside of commercial systems there are also many examples of large data sets, par-ticularly in various areas of scientific research. Clearly large projects in areas such asparticle physics and astronomy lead to massive data sets. One well known classifica-tion application in this area is the SKICAT system (Fayyad, Djorgovski and Weir, 1996)which automatically catalogues millions of stars and galaxies where each point is rep-resented by a 40-dimensional vector describing its features. Recently the developmentof genomic and micro-array analysis has lead to similarly large data sets but this timeat the opposite end of the ‘size’ spectrum, with data relating to the specific codesequencing of DNA in genes as opposed to astronomical objects.One of the most recent and controversial examples of large data set collection cameabout as a direct result of the 11 September 2001 attack on the World Trade Centre inNew York. In response to the perceived increase in threat from terrorism the US govern-ment launched its TIA program. This initially stood for ‘total information awareness’but has since been re-labelled the ‘terrorism information awareness’ program toattempt to allay the fears of many civil liberties groups. This is not the place to discussthe politics (nor even the eventual likely success) of the total/terrorism program; sufficeit to say that a lot of personal data is collected (exactly how much is not known – that isa ‘national security’ issue – another of the worries of those concerned about freedom ofinformation). However, the proposed use for all this data is known (at least in generalterms) and it is exactly the same as in many business intelligence applications:

Attempts to ‘connect the dots’ quickly overwhelm unassisted human abilities. Byaugmenting human performance using these computer tools, the TIA Programexpects to diminish the amount of time humans must spend discovering informa-tion and allow humans more time to focus their powerful intellects on thingshumans do best – thinking and analysis.

strategic information systems business intelligence

16

This quote is taken from a DARPA report on the terrorism information awareness (TIA)program and was reported in the Washington Post on 20 May 2003; on-line at:http://www.washingtonpost.com/wp-dyn/articles/A17121-2003May20.html. There is always a danger when using ‘breaking’ stories to illustrate a point – keepsthings relevant but they are also liable to change. On 25 September 2003 the US Con-gress voted down the part of the Homeland Funding budget dedicated to the TIA pro-gram. This was mainly due to civil liberties concerns and the fact that the technologywould potentially target US citizens. Senator Ron Wyden stated, ‘The program thatwould have been the biggest and most intrusive surveillance program in the history ofthe United States will be no more’. However, it was noted that while ‘Congress is shut-ting down TIA, we are not forgoing the use of technology to sharpen our homelandsecurity efforts and track terrorists here and around the world’. In particular the legisla-tion ‘does permit TIA or a similar system to be used for data-mining, as long as the tar-gets are not US citizens or residents’. So that is the state of the TIA as we go to press: acontroversial and challenging public sector application of BI technology. Quotes from:http://www.kdnuggets.com/news/2003/n18.]While these large data sets provide the most impressive success stories it should benoted that BI and its techniques such as data mining, classification, etc. are notrestricted to massive data sets; the principles also work for much smaller collections ofdata. For practical reasons we will use quite a few small-scale examples throughoutthis module and even on this scale we will (hopefully) see some interesting results.Thus the goal of business intelligence from an organisational point of view can beseen as the discovery and use of knowledge that resides in databases. This is oftenreferred to in the literature as ‘knowledge discovery from/in databases’ (KDD) and hasbeen defined by Piatetsky-Shapiro and Frawley (1991) as:

…the nontrivial extraction of implicit, previously unknown, and potentially usefulknowledge from data…

The majority of our discussion on business intelligence will focus on the use of analyt-ical methods applied to structured data sets. However, the reason the module is notnamed KDD or simply data mining is that we will touch on the context and architec-tures that support the KDD process, such as data warehousing and on-line analyticalprocessing (OLAP) tools. We will also look at the use of the patterns/models that resultfrom the KDD process which may involve formal representation of and reasoning with(or inference from) the knowledge discovered from the process.But enough of the philosophical reflection for the moment! I am actually a fairly prag-matic person and often find the best way to understand something is to look at exam-ples. So rather than try to refine our definition of BI any further at the moment let’stake a look at the range of application areas and tasks covered by this approach.


17

1.2 Some examples of business intelligence and KDD

Association rulesPerhaps the best known (or at least most frequently quoted) example of data mining isthe infamous ‘beer and diapers’ sales relationship. This is not such a bad place to startas, in addition to its infamy, the example illustrates one of the simplest types of datamining technique – that of association rules. The story goes that a large retail chainapplied data mining techniques to look at customer buying patterns and discoveredthat customers who bought diapers (that’s nappies to you and me) were much morelikely to buy beer than the average customer. This (alleged) association was said to beparticularly strong for male customers. By placing the nappies and beer much moreclosely together within the shop the retailer increased its sales of both. It is alsoclaimed that they placed snack foods (potato crisps, etc.) between the two and signifi-cantly increased the sales of all three. Whether or not the story is true is not really thepoint – I guess in principle it is possible, but then why haven’t all shops adopted thissame layout? (I suspect there are other forces at play and that the average youngmother or father asking about finding nappies would be less than impressed with theresponse, ‘Oh, they are over in the wines and spirits aisle’). The original ‘beer and dia-pers’ association allegedly related to young males shopping at the weekend. As aretailer I guess you would have to ask the ‘bigger’ questions such as: ‘what proportionof my weekly nappy sales are purchased by this customer type?’ (formally referred toas the coverage of the rule); or ‘can we afford to move all the nappies over to the beeraisle each weekend?’, rather than become fixated on a single ‘discovered’ pattern.

!

Thinking point Beers and diapers What is your interpretation of the ‘beer and dia-pers’ finding?

!

While the ‘beer and diapers’ example may be somewhat suspect, what is not in ques-tion is the general rule association approach behind it, referred to more generally inthe retail domain as market-basket analysis. This approach has affected a wide rangeof retail practices such as: ! shelf layout (did you know that the Mars company provides a product lay-out

map to guide retailers in how their confectionary should be physically positionedon the shelf?)

! product and advert placement (used in the preparation of catalogue layouts andmore recently web-site organisation and promotion) and cross-selling of products(think of Amazon’s ‘people who bought this book also bought…’).

Before moving on to examples of other techniques it is worth noting that the applica-tion of association rules is not restricted to the area of retailing. The example noted inthe introduction of discovery of the C/EBP-beta gene (and its link to common cancercell genes) was discovered by a process of searching for association rules (Science Daily,13 August 2003). Another, slightly more unusual example, using association rules isthe Advanced Scout application created by a research team from IBM to assist profes-sional basketball coaches (Bhandari et al., 1997). The Advanced Scout looks at past logsof basketball games to try to discover interesting patterns that may be of use to thecoach. It then expresses these patterns in the form of rules of the type, ‘If Player A is inthe game during the last period, the rebounds recovered by Player B decrease from 35%to 13%’. There is one final area where association rules can prove useful and this is in


18

finding associations which are temporal – i.e. changes through time. In general BItools are not particularly well suited to this type of sequential analysis, but the associ-ation rule approach does offer some potential benefits.

Customer relationship management The ‘market-basket analysis’ discussed above is now seen as being part of the broaderbusiness activity of customer relationship management (CRM) – in some uses the ‘M’stands for marketing. CRM is concerned not simply with product sales but with assess-ing overall customer profitability and thus improved customer retention as well asmore effectively targeted customer acquisition. To achieve these goals a number ofadditional techniques are added to that of rule association, notably classification andclustering.It is a well accepted generalisation within CRM that the cost of acquiring a new cus-tomer is in the order of 5–20 times greater than the cost of maintaining an existingone. This is the fundamental driver behind customer retention and the technique ofclassification can be effectively put to work in this area. Assume that you have a largedatabase of historical customer behaviour. It is relatively straightforward to identify(classify) those customers who were profitable and who subsequently ‘defected’ toanother supplier (an obvious example being banks/building societies and mortgageprovision). It may be possible to create models that help ‘explain’ such defectionbehaviour. These models are then used to classify existing customers and to targetprofitable but potentially ‘defecting’ customers with remedial action.

!

Thinking point Customer retention How could customer data be used to improvecustomer retention?

!

Customer segmentationHere the related technique of clustering might well be used. In this case there is norequirement to model the different behaviours of customers as classified by their vari-ous group memberships; all that is being attempted is the isolation of interestinggroups (clusters) of customers. These may turn out to be a highly profitable group (onthe basis that in some sectors 5% of the customers generate 95% of the profits) or theymay turn out to be bad debtors, or whatever. The task of the clustering algorithm is sim-ply to identify the existence of potentially interesting clusters and it is thus said tobelong to the class of unsupervised learning or discovery algorithms. The earlier case ofclassification where the algorithm is working to a set of specified targets (e.g. to classifya particular type of customer – those who are most profitable – and evaluate theirchance of becoming defectors) is referred to as a supervised task or algorithm (seeChapter 2 for further discussion on types of machine learning techniques).

Unusual patternsA more restricted form of the general idea of clustering can also be seen in the identifi-cation of unusual patterns. The algorithms which operate in this area tend to useeither the well-developed set of techniques from statistics which are able to detectoutliers (i.e. highly unusual values) or apply logical inference to demonstrate non-con-formance to a set of domain rules. The most widely known examples in this area prob-ably relate to the area of credit card fraud with American Express, Lloyds TSB andmany others using these approaches to identify aberrant spend patterns:


19

! American Express: were perhaps the first to popularise common point-of-com-promise techniques which are ‘analytical data-mining techniques that allowAmerican Express to identify with statistical certainty where card-member data(such as card number, expiration date, name, address, and phone number) wascompromised before the start of a fraud episode. (Available at: http://home5.americanexpress.com/merchant/ service/faq/fraudprevention3.asp)

! Lloyds TSB: where the data-mining suite from SPSS was used and the Head ofFraud Strategy reported, ‘within the two week pilot period we had built 24 fullyworking predictive models with an estimated annual saving of £2.5 million’.(Available at: http://www.spss.com/success/ template_view.cfm?Story_ID=83)

The recent case of lance-corporal Pierre of the Marine Corps which we referred to ear-lier provides an interesting example of both types of algorithm (i.e. statistical outliersand expert rules). In the first place unusual activity on her, and other Pentagon staffmembers’, accounts was detected by an ‘outlier identification’ program. Once this can-didate set of unusual transactions had been identified (in terms of frequency oramount to spend) a set of domain rules was used to cross-validate those which weregenuine from those which were not. Thus ‘transactions between, say, a medical facilityand an upholsterer would be looked at more closely than those made between a medi-cal facility and a pharmaceutical supplier’ (http://abcnews.go.com/sections/us/Busi-ness/pentagonabuse030814.html). Given that Pierre’s purchases included a car, amotorcycle and breast implants, I am not sure that a particularly clever piece of datamining software was needed, but then we are taking about the US military here.

Regression modelsThis is perhaps the most long-standing example of BI in a number of sectors and hashistorically often been referred to as ‘forecasting’. Nakhaeizadeh et al. (2002) note thatregression, ‘is the most frequently encountered…application of data mining techniquein banking and finance’. These models have been used to predict the future behavior ofsuch elements as stock prices, currency or interest rates and in this context have mostoften used statistical approaches such as time series analysis or regression modelling.However, regression models are also used to ‘predict’ the probability that a new prod-uct will be successful or to find the likely spend of a new customer (i.e. these do notstrictly deal with the time dimension in the same way as classic prediction-orientedapproaches).Within the financial domain regression models are used to aid portfolio management(Hill et al., 1994), currency trading (Diekmann and Gutjahr, 1998), bankruptcy predic-tion (Poddig, 1995), and identifying other negative trends in corporate finance (Altmanet al., 1993). In addition to the use of conventional general linear models these applica-tions have also involved approaches using regression trees, rule-based systems, neuralnetworks, fuzzy logic, genetic algorithms and even chaos theory. In addition to consid-ering how certain variables affect a final outcome there is also interest in financialsystems regarding the common underlying factors which affect key indices, say theDow Jones and the FTSE – the area of dependency analysis for which stochastic regres-sion models have been developed (von Hasseln and Nakhaeizadeh, 1997).


20

1.3 Definition and scope of data mining and knowledge discoveryAnand and Buchner (1998) appear to have extended and embellished the earlier defi-nition by Piatetsky-Shapiro and Frawley (1991), noted above, to give the rather long-winded definition as:

The efficient, semi-automated process of discovering non-trivial, implicit, previouslyunknown, potentially useful and understandable information from large, historicaland disparate data sets.

While I appreciate the ethos behind this constant (re)development of definitions, i.e.an attempt to be comprehensive and as exhaustive as possible, I am not always con-vinced they are useful. Indeed, definitions such as the one just noted may lead to inac-curacies and/or controversy. Does all data mining have to be carried out on ‘historical’data? (If so those working in the area of real-time data mining had better re-label theirwork!) Do data sets always have to be ‘disparate’ and what exactly does that mean?How large is ‘large’? And so on.One of my favourite definitions at the moment comes from Fayyad et al. and appearsin their text on information visualisation in data mining. Although Fayyad was anearly author in the area and has used much more complex definitions than in thisrecent book, for me this is very elegant:

The mechanised process of identifying or discovering useful structure in data.(Fayyad, Grinstein and Wierse, 2002)

In addition to its brevity I like this definition because ‘mechanised’ simply impliessome form of machine (i.e. computer) rather than ‘automatic’ which would excludecertain human-assisted data mining approaches. Similarly, the inclusion of identifica-tion allows for a broader set of activities – such as verification of user hypotheses –than is allowed when only discovery is included in the definition. And, while the datasets used in data mining are typically very large, there is no need to make this a condi-tion for the inclusion of an activity.Fayyad et al. note that the term ‘structure’ can refer to patterns, models or relations.A relation is perhaps the most straightforward structure to identify and understand asit simply designates some sort of dependency between attributes over, typically a sub-set of the data. Thus a large proportion of individuals who have been identified asbeing private home-owners may also be seen to have credit cards. A model could thenprovide the (mathematical) description of a relationship that holds true for the dataset, for example:

For all i, where house_valuei > £150,000No_of_credit_cardsi = a ! house_valuei + b

A pattern is more difficult to define precisely but typically refers to any way of sum-marising a subset of data points – what is known as a ‘parsimonious description’ instatistics.One clear characteristic of data mining tools is that they exhibit some level of automa-tion – i.e. the process is driven by the system (though this may be done in a semi-auto-matic way). This can be contrasted with other, user-driven, tools which have been usedto make sense of large data sets, including: database reporting and querying tools orlanguages; on-line analytical processing (OLAP) tools; classical statistical packages;and even some early machine learning tools.


21

While the term ‘data mining’ is the one preferred by statisticians and ‘IT/MIS’ types,the term ‘KDD’ tends to be favoured within the AI and machine learning communities.It is also generally accepted that KDD represents a broader set of issues than are cov-ered in data mining. Typically KDD is seem as a process within which data mining willbe a stage. The overall process might incorporate the following steps:! understand the application domain! goal identification! data collection! data selection! data pre-processing (various cleaning and transformation stages, often referred

to in data warehousing as extraction, transformation and load (ELT)! selection of attributes/feature extraction! data mining! post-processing (assessing the importance of patterns together with their robust-

ness and reliability)! interpretation of results and action! feedback/use the knowledge discovered.

The list above represents my relatively informal attempt to summarise the mainstages in the KDD process. A number of more formal models have been developedincluding the CRISP-DM model (CRoss Industry Standard Process for Data Mining)shown in Figure 1.1. The six main phases of the CRISP model are fairly self-explanatoryand in fact the ‘modelling’ phase potentially allows for other forms of data drivenexploration – i.e. other than just data mining. In the light of this and the notes abovethe model might be more correctly named the CRISP-KDD (or even, CRISP-BI) model.

Businessunderstanding

Data

Dataunderstanding

Datapreparation

Deployment

Evaluation

Modelling

Figure 1.1 The cross industry standard process for data mining (CRISP-DM) model(from http://www.crisp-dm.org/)


22

There are many different ways of attempting to categorise the range of fields associ-ated with data mining, knowledge discovery and business intelligence, which include: ! databases (especially very large and multi-dimensional databases)! statistics ! artificial intelligence (particularly machine learning and pattern recognition, but

in stating this I am demonstrating my own background discipline and bias, asmany statisticians would claim either or both of these to be sub-areas of statistics)

! data visualisation, etc.

Rather than fall out over which specific subject areas belong to which fields it is some-times useful to consider how techniques have historically been categorised. Maimonand Last (2001) identify five main categories: ! logic-based (e.g. inductive models)! classical statistical (e.g. hypothesis testing and regression models)! non-linear classifiers (e.g. neural networks and much pattern recognition)! probabilistic (e.g. Bayesian belief networks)! information theoretic (e.g. info-fuzzy networks).

Of course these categories are highly subjective and depend on one’s viewpoint. Forexample, I could also suggest some ‘discipline-centred’ categorisation in which wemight have such categories as: ! Philosophy (including induction, deduction, etc.)! Biologically-inspired (which would include the techniques of neural networks

and genetic algorithms), etc.

Ultimately such categorisations may not be that useful and this certainly appears tohave been the view of Cios, Pedrucz and Swiniarski (1998) who simply throw the rangeof algorithm and approaches into a fairly random diagram not dissimilar to that usedearlier in the preview to this module.Classification is certainly useful in pointing up similarities as well as differences in agroup of activities/methods. To this end, I do attempt to cover the material (especiallyin Chapters 4 and 5) in some sort of meaningful order while at the same time pointingout that this or that technique can also be seen to exhibit certain characteristics whichmean it could be considered to belong to an alternative grouping.


23

1.4 Pejorative or negative use of the term ‘data mining’

There is one other set of definitions which has been applied to the data mining aspects ofBI in particular. These have historically been made by statisticians who feel that many ofthe tools and – more importantly – the unwarranted or uncritical application of thesetools will lead to the mining of something very different from gold (as is implied in theKDnuggets name). The terms ‘data dredging’, ‘snooping’ and ‘fishing’ have variouslybeen used as alternatives to ‘mining’ to indicate that this is a much less structured taskthan is sometimes implied and may end up providing unsavoury results! To illustrate the potential risks of the unthinking application of a statistical technique,Jeff Ullman of Stanford University relates the (apocryphal?) story of a ‘parapsycholo-gist’ in the USA. In the 1950s this scientist decided to test students’ ‘extrasensory per-ception’ by getting them to guess 10 cards in a row – red or black? He ran the test on10,000 students (this must be an apocryphal story!) and found that 10 of them guessedall the cards correctly. These students he declared to have ESP. What he had failed tonote was that the statistical likelihood of getting all 10 correct due to pure randomchance is 210, or 1,024 (pretty close to the 1/1,000 ‘success’ rate he found). Of course onretesting his star students he discovered that they performed no better than the aver-age. And his conclusion? Once you tell people that they have ESP they can no longerdemonstrate it!Of course this is a slightly silly story but it demonstrates an important point – if yousearch long enough you can find an arbitrary model that will fit your data or alterna-tively if you have a pre-defined model in mind you will probably find data to fit it!While we may mock the parapsychologist many people make the same mistake when,for example, carrying out statistical hypothesis testing. This is hardly surprising asmost people have a poor understanding of what a ‘statistically significant’ test is tell-ing them. Let us say that you are testing to see whether there is a difference betweenthe debt owed by different groups of credit card customers. You begin by consideringthe attribute gender and find that at the 95% significance level there is no evidence ofany difference in the credit balance of your groups (i.e. male vs female). Had you foundthere was a difference, what the ‘significance level’ tells you is that there is only a 5%chance that this result was due to purely random chance and so you would feel justi-fied in claiming it was a ‘real’ effect. However, let us assume that you now continue tolook for differences between groups of customers based on: ! those under 30 year of age vs those over! those who are home owners vs those who rent! those with children vs those with none, etc.

Exercise 1.1 Statistical analysis For each of these individual tests you would look for adifference at the 95% confidence level – would this mean that you could now feel as jus-tified in claiming ‘significance’ as when you started with gender?

This error of searching in an unconstrained manner for patterns coupled with theinherent unpredictability of the stock market combine to provide a ‘cautionary tale’ onthe limits of data mining approaches to conclude this introductory chapter.


24

A cautionary tale on the limits of data mining (and BI)

The Investor Column: data mining can produce fool’s goldMonday 19 November 2001Past returns are no guarantee of future performance. We’ve all heard that phrasebefore yet many people still ignore it. Last year’s top fund performers always seem toattract the most new customers and shares that have risen dizzily in price still attractthe most attention. Unfortunately, those who invest solely on the basis of recentbuoyant performances can end up buying expensive assets. Others who tend to learnabout past returns the hard way are investors looking for patterns they hope will berepeated in the future. Chartists1 are in this category but so are the so-called ‘dataminers’. These are people who crunch numbers, looking for correlations or themesthat other investors haven't discovered. This has become much easier with the speedand power of modern desktop computers. One famous example of a data mining outcome is the ‘Dogs of the Dow’ strategy.Around ten years ago, US funds manager Michael O’Higgins discovered that theworst performing shares on the Dow Jones Index tended to outperform the index inthe year following. He revealed all in a book and now the strategy is widely imple-mented. A more recent theory getting a lot of attention is where an investor buysshares with the highest dividend yields in the market. History has found a portfolioof these shares also tends to beat the market in the year following. While these strat-egies have shown results, sometimes over extended periods, they still attract scep-tics. That’s because it is extremely rare to find a pattern that consistently repeats. Anexample is a dice that rolls a six ten times in a row. The consistency of past resultsdoes not increase the likelihood of another six on the next throw. In many cases, events can seem correlated but are not. For example, imagine ifresearch found that over the past ten years the best performing class of shares on theNew Zealand share market were those that started with the letter ‘B’. Obviously,there is no way there can be a link between the first letter of the name of an invest-ment and its performance. In a celebrated study a couple of year ago, a US researcher spent many hours analys-ing United Nations statistics looking for money-making patterns. At last he found anuncanny correlation between butter production in Bangladesh and the movement ofthe US share market. Again, it would be a brave or foolhardy investor who bet that asteep rise in butter production would automatically foreshadow the next bull mar-ket. One of the ironies of data mining is that, should an overlooked indicator of pricerises be discovered, it doesn’t take long before enough people get to know about it tocancel the effect out. Take the Dogs of the Dow theory. This has become so popular that investors nowpour tens of billions of dollars each year into target shares. As Mr O’Higgins pointedout in 1997, after he had stopped using it as an investment tool, too much demandfor unloved shares quickly pushes up their price to the extent that they are no longercheap. Forget data mining, profitable investment comes from realising that marketsare inherently unpredictable and taking the appropriate steps to spread your risk.http://www.sharechat.co.nz/features/investorcolumn/article.php/0b06334e

1 Chartists – also known as technical analysts – use mechanical rules to detectchanges in the supply of and demand for a stock or to capitalise on expectedchanges.


25

Exercise 1.2 Interpreting the ‘Dogs of the Dow’ (a) What explanations can you sug-gest for the phenomena described in this article?(b) Supposing you found that office rental values were highly correlated with petrol taxlevels. What conclusions could you draw?

Self-assessment questions (a) What are:

– CRM– clustering– KDD– CRISP-DM– ELT?

(b) Describe the techniques used to catch Lance-corporal Pierre.(c) From your own experience do you feel the terms ‘data dredging’ or ‘snooping’ are

justified in some/many occasions when analysts claim to be carrying out ‘datamining’?

2 general concepts


28

Contents

Learning outcomes for this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

2.1 Some important distinctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Types of data and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 A BI process model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

2.4 BI process model: pre-processing (the cleaning phase) . . . . . . . . . . . . . . . . . . . . . . 37

2.5 BI process model: pre-processing (the transformation phase) . . . . . . . . . . . . . . 40

2.6 BI process model: application of the data mining algorithm . . . . . . . . . . . . . . . . 47

2.7 BI process model: post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50

ActivitiesExercise 2.1 Distinctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Exercise 2.2 Non-integer discrete variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Thinking point Dirty data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37Thinking point Reduction of feature and instance space . . . . . . . . . . . . . . . . . 46Self-assessment questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53

general concepts

29


After working through this Chapter you should be able to:! explain some of the key concepts and distinctions which apply to the full range of

BI algorithms! appreciate why reducing the data space (both in terms of features and instances)

is often a key prerequisite to effective BI learning! identify the most appropriate techniques to apply at the pre-processing stage to

reduce the size of the data space! explain the use of training and test data sets and the potential problem of over-

fitting! appreciate the importance of post-processing for assessing the true value of any

output from BI algorithms.


30

Introduction

There is a range of issues that we should be aware of when considering any BI solution,such as when are we looking to discover new knowledge and when are we using thatknowledge in a predictive or associative way. In addition there are a number of ‘techni-cal’ issues which are relevant to most of the techniques introduced in later chapters (inparticular Chapter 4). For example, what options exist to cut down the solutions spaceto make the BI task more manageable? (feature selection, sampling, etc.). Or, howwould we ensure that the model remains generalisable so it is effective on unseendata? (over-fitting, sensitivity analysis, etc.). Or, how can we compare the performanceof two or more competing models? (‘lift’ and ROC curves). These issues are covered inthis Chapter, beginning with some general observations and distinctions, and areminder of some useful background mathematical definitions, before moving on tothe use a BI-process model to consider general techniques of relevance at each stage inthat process.

general concepts

31

2.1 Some important distinctions

It is important to be aware of a couple of basic definitional distinctions when discuss-ing aspects of BI. Sometimes these are important for distinguishing between tasks(e.g. are we attempting to classify – i.e. decide on a class; or do we wish to regress – i.e.compute a value) while at other times it is important to understand that although dif-ferent terminology is being used the basic approach is the same (an inevitable resultof the range of disciplines involved in the field of BI). In Chapter 1, I attempted to intro-duce a couple of key distinctions, so we begin with a reflection on those before intro-ducing a couple more.

Exercise 2.1 Distinctions Try your hand at summarising the difference between:(a) statistics and data mining(b) patterns and models.

Later we will look at further sets of distinctions, including those between discovery,prediction and association. One further set of distinctions that is worth making at themoment is between the various types of machine learning that can take place withinBI – as many of the techniques and algorithms are often introduced in terms ofwhether they are ‘supervised’, ‘unsupervised’, etc.

Types of machine learningThe discipline of machine learning has grown up over the past 30 years, mostly havingits roots in the AI research community. It has been primarily concerned with attempt-ing to create computers (and algorithms) that can learn things for themselves. Michal-ski, Bratko and Kubat (1998) give the following useful definition:

The general label of machine learning is usually reserved for artificial intelligence-related techniques, especially for those whose objective is to induce symbolic descrip-tions that are meaningful and understandable and at the same time help improveperformance. In a broader understanding, though, the machine-learning task canbe defined as any computational procedure leading to an increased knowledge orimproved performance of some process or skill such as object recognition.

(Note that within this definition there is already another implied distinction – thatbetween symbolic and non-symbolic descriptions. We shall come to this distinction inChapter 5 when we discuss artificial neural networks (ANNs) but would simply notehere that if the more restrictive definition of machine-learning implied in the first partof the definition above were to be adopted then no ANN-based approaches could beconsidered to exhibit ‘machine learning’ – as they, by definition, use sub-symbolicapproaches for representation.)Within data mining, which of course has a much more limited history, some of the dis-tinctions made within machine learning have been adopted and – necessarily –adapted, as we are primarily interested in data mining with tools that enable humansto learn from data sets. The main types of machine learning referred to within the BIcontext include:! supervised learning! unsupervised learning! reinforcement learning! ensemble learning.


32

Supervised learningThis class of learning algorithm requires that the example set contains both inputsand matched outputs (the result class). The objective of the algorithm is then to definea predictive function which will be able to correctly identify the outcome class of asyet unseen examples.

Unsupervised learningThese learning algorithms operate in the absence of any outcome data in the exampleset. Clearly they cannot make predictions as they have no ‘knowledge’ as to what theresulting outcome in any situation might be. Rather they are used to provide descrip-tive models which explore the data space and uncover general patterns which mayexist. Algorithms which discover clusters within data sets provide a good example ofunsupervised learning.

Reinforcement learningAlgorithms within this class are a more general version of supervised learning and areonly really applicable to limited domains within AI, such as intelligence agents. Basi-cally they rely on the presence of a ‘teacher’ to provide feedback as to how they areperforming. So rather than having a complete training set of examples available at theoutset (as in the case of supervised learning), these approaches tackle a range of sce-narios as they are presented and receive feedback on their performance on each. Theyhave then to adapt their behaviour in the light of this feedback in an attempt to learnhow to perform more effectively in the future.

Ensemble learningStrictly, this is not a different specific type of learning but rather represents a methodof bringing together the outcomes of other learning algorithms. Suppose a learningalgorithm produces a hypothesis (a method of mapping from the input to outputspace) which has an error rate of 1 in 10. If we could generate another four hypothesesusing different algorithms, then for any given future example we would look at theoutcome of the five hypothesis and use majority voting to decide on the outcome. Ifwe assume each of the additional hypotheses has a similar error rate (i.e. 1 in 10) andthat the classification errors made are independent then it can be shown that the errorrate from the ‘ensemble’ of the five hypotheses is less than 1 in 100. Of course theassumption that the errors are independent is unrealistic – some cases are by defini-tion hard to classify and will remain so across algorithms – but the resulting classifica-tion will still be better than using any algorithm on its own. Rather than simply usingmajority voting most ensemble learning methods use more sophisticated approachesto adding hypotheses. In particular the method of boosting using weighted trainingsets is commonly used. A useful overview of the area and an introduction to AdaBoost,the most widely used algorithm, can be found at: http://kiew.cs.uni-dort-mund.de:8001/mlnet/instances/81d91eaa-da13ff3cfc.

general concepts

33

2.2 Types of data and variables

A variety of different types of data can be held within the databases to which BI algo-rithms are applied. From a database management point of view these would be seenas data types which would be specified during table definition as being in one of avariety of formats. A statistician would tend to think in terms of variables whichwould be used to represent any particular data attribute we might measure, manipu-late or compare. Whatever we call them (we shall use the term ‘variable’ but see thenote below on differences in terminology) the measurement scale used for these vari-ables broadly divides them into one of two major types of variables – qualitative/cate-gorical variables and quantitative variables. Within these two broad groupings thereare some slightly more subtle distinctions that we can make. Mostly these are prettystraightforward but we shall use the data set shown in Table 2.1 for illustrations as wemake these definitions.

Starting with the qualitative or categorical data we have two main types: nominal andordinal. A nominal variable is one which only allows for a simple classification intoone of a number of categories. It is not possible to quantify or rank these categories.The variables Gender and Faculty in Table 2.1 are examples of nominal variables. Theother type of categorical variable – the ordinal variables – is again allocated into a cat-egory but in this case the categories can be ranked in some way. Thus we cannot quan-tify how much more/less based on ordinal variables but we can make rankcomparisons – ‘high earners’ have larger salaries than ‘low earners’, etc. In our sampledata in Table 2.1, Project grade is a ordinal variable allowing us to rank the students interms of project performance (Mouse, Duck, Pie – in order of highest to lowest usingthe conventional grading system). Some authors make a distinction between ‘ordercategories’ and ‘ranked’ ordinal variables but I can’t say that I find this distinction use-ful either in theory or in practice.As far as quantitative variables go there are a couple of useful sub-divisions, though inpractice these are often treated in the same way by data mining and other algorithms.All quantitative variables can be used not only to rank items but also to answer thequestion ‘how much more?’. However, a ratio variable is stated with respect to someassumed absolute zero point. In our sample data in Table 2.1, both Course mark andDoB are examples of ratio variables. In the case of marks the assumed reference level is0%. Thus we can not only say that Mouse scored 36 more marks than Pie but also thathis performance was roughly twice as good. In a similar way, because the DoB variableis set with reference to the year 0 AD we can not only say that Duck is 20 years olderthan Mouse but also that he is more than twice his age (i.e. 37.3 vs 17.3 years old – at thetime of writing, 1 September 2003). In the case of an interval variable no absolute ref-erence point is known. Thus while the question ‘how much more/less?’ can still beanswered, it may still not be possible to state the proportional difference. For example,in our data set in Table 2.1, I have included a second version of the date of birth field,DoB2. This assumes we had failed to format the variable correctly as a ‘normal’ date

Table 2.1 Table of student records with attributes (variable) of different types

Name Gender Faculty Coursemark

Project grade

DoB DoB2

D. Duck M Arts 65% B 21/5/1966 22787M. Mouse M Science 71% A+ 21/5/1986 30092T. Pie F Arts 35% C 01/6/1986 30103… … … … … … …


34

format and ended up with the ‘Windows’ representation of the date. In this intervalrepresentation (assuming you realised these referred to days but did not know that allWindows dates start from 1/1/1900) you would still be able to say that Mouse was 11days older than Pie and that Duck was about 20 years ((30092 – 22787)/365 = 20.01)older than Mouse. However, you would have no idea of the absolute age of any of thethree students and therefore would not be able to make statements such as ‘Duck ismore than twice as old as Mouse’, etc. Another distinction that is sometimes applied to quantitative variables is that betweendiscrete and continuous variables. Discrete variables can be thought of being applied tothings that are ‘counted’ while continuous variables are used for values that are ‘meas-ured’. The quantitative variables shown in Table 2.1 are all continuous as would be suchattributes as student height, weight, etc. On the other hand if we had variables repre-senting ‘number of exams completed’ or ‘number of children’ then these would be dis-crete variable. Sometimes people think of discrete vs continuous in terms of the datatype involved – e.g. integer vs real numbers. This is not really helpful. In one sense evencontinuous variables are held as ‘discrete’ values at some level of resolution. Thus if wemeasured height in metres we might need a real variable (height = 1.96) but if weexpressed this as millimetres (height = 1960) an integer value would do – but both rep-resent continuous variables. Or, we could choose to express age in terms of years, yearsand months, days, or even minutes. Whichever representation or resolution we chosethis would be a continuous variable as it represents a way of measuring age, while a‘surrogate’ for age in years expressed as ‘number of birthdays celebrated’ would be adiscrete variable as it represents something that is counted. From a practical point of view the distinctions between the various types of quantita-tive variables are not that important, as the algorithms tend to treat quantitative datain the same way. If anything, the more important comparison is with ranked categori-cal data (i.e. ordinal variables) as many algorithms operate by first coding quantitativedata into a variable of this form. Whatever the type of variables, the important point inBI algorithms is to look for relationships between them. In the context of relationshipsone other distinction should be noted: that of independent and dependent variables.The independent variables are those from which we hope to be able to predict or clas-sify and they are also referred to as features, attributes, explanatory or input variables.The dependent variable is the one which we hope to be able to predict and this is alsoknown as the target, outcome, response or output variable.

Exercise 2.2 Non-integer discrete variables Perhaps the reason why the data type ofvariables is sometimes used to explain the difference between discrete and continuousvariables is that discrete variables tend to be of type integer. Can you think of examplesof discrete variables which would be represented by non-integer variables?

A further note on terminologyOne of the reasons why the terminology associated with data mining and businessintelligence has become so confused is that, as we have already noted, the area spans anumber of disciplines. A computer scientist might refer to a ‘tuple’, a database program-mer to a ‘record’, a data analyst to an ‘instance’, a statistician to a ‘sample’ and a mathe-matician to ‘a n-dimensional datapoint’ or ‘vector’ – and they all refer to the samething! I have tried to maintain some kind of consistency in my descriptions of algo-rithms, etc., sticking in the main to the ‘main-stream’ database usage with the termrecords which are themselves made up of a set of attributes. However, at times I might

general concepts

35

slip into ‘statistics-speak’ and refer to instances and features. Only where the algorithmis fairly mathematical and benefits are gained from using shorthand references such as‘vector’ do I use these terms and I will always try to explain what is meant.The table below summarises some of the common terminology and how terms areused in different disciplines.

A little mathematics There is inevitably a bit of mathematics involved in most BI algorithms. We do nothave space here to provide details of the maths involved, but I will note some key ele-ments and make reference to some resources which can be consulted for those whowish to explore the mathematics in a little more detail.The most useful basic mathematical knowledge relates to linear algebra – specificallyto vectors and matrices. It is often useful to think of data instances as vectors in n-dimensional space (where we have ‘n’ data attributes). Using vector algebra it is thenpossible to calculate how similar two items are by measuring the distance betweenthem. This approach is used in clustering and nearest neighbour classification. It canalso be used to assess the performance of a predictor (or classifier) by measuring thedistance between the predicted and the actual values as they would be represented inthe solution ‘space’. A variety of aspects from vector algebra such as vector additionand the scalar product are useful in this regard. The related area of matrix algebra isalso very useful as vectors tend to be represented as matrix components and transfor-mations such as vector scaling or rotation rely on matrix algebra. In addition there aresome slightly more complex elements of matrix algebra which are of use in a numberof BI algorithms. These would include the use of eigenvalues within principal compo-nents analysis (PCA) among others. An excellent introductory overview to thoseaspects of linear algebra of value to the BI practitioner is given by David Barber, at:http://anc.ed.ac.uk/~dbarber/lfd1/lfd1_supp_maths.pdf.In addition, some basic knowledge of probabilities and statistical distributions can beof value in a number of the BI algorithms we cover in Chapters 4 and 5. I will introducea little of this material, particularly with regard to Bayesian probability, when wecome to the relevant algorithms. In the meantime another summary prepared byDavid Barber has an excellent section (Section 6) on statistical and probabilistic distri-butions. He also notes that it is useful to have a knowledge of partial differentiation,especially as applied to problems of finding maxima and minima for functions: http://anc.ed.ac.uk/~dbarber/lfd1/lfd1_2003_prelim.pdf.

Table 2.2 Terminology: records/attributes/data types/etc.

Database administrator

Computer scientist

Statistician Mathematician

database table sample set of vectors/matrix X

record tuple instance n-dim Datapoint (vector) x1

attribute attribute variable point-value x14 = X41

data type class variable type value restrictions


36

2.3 A BI process model

You will note that this section is not entitled ‘the’ BI process model – i.e. there is noth-ing special or definitive about the model I have selected to use, I just feel that it is auseful description of the main stages involved in applying BI techniques and algo-rithms which allows us to introduce a range of generic issues in a structured manner.We will be looking at specific approaches in Chapters 4 and 5 but before doing so thereare a number of general concepts involving terminology and techniques which are rel-evant across a range of approaches within BI.

The DM/BI process model illustrated in Figure 2.1 is used toplace these in context/order.In the next two sections we look at aspects of data pre-processing. I have chosen to organise the discussion aroundthe various data cleaning and data transformation activi-ties. However, alternative perspectives exist and someauthors describe pre-processing in terms of at least the fol-lowing four stages:! data cleaning! data integration! data transformation! data reduction.

Datacleaning

Datatransformation

BIalgorithm

Post-processing

Raw data

Data warehouse

Selected data

Models and/or rules

Interesting and useful models,patterns or rules

Figure 2.1 A simple model of the BI process

general concepts

37

2.4 BI process model: pre-processing (the cleaning phase)

Without doubt one of the most time-consuming and tricky areas of BI is the pre-processing of data before any data mining or other analysis can be carried out. Theimportance of data cleaning depends on the type of learning activity being under-taken. For example, if association rules (or other ‘pattern’ oriented outcomes) are thegoal then the inclusion of even small numbers of incorrect records (and in particularoutliers) can cause major problems. In other cases it may be possible to use statisticaltechniques (such as sampling) to effectively reduce the level of ‘dirty’ data. In generalthe Pareto principle holds here, in that 80% of your effort will be taken up with 20% ofthe data which will need some sort of ‘cleaning’ attention (in fact the proportions areprobably closer to 98% effort for 2% of data).The scope of exactly what is involved in this phase varies from one author or perspec-tive to another but typical estimates are that it accounts from between 60% (Cabena etal. 1998) to 90% (Kim et al. 2003) of activity within the BI process. (In a recent poll car-ried out by the Kdnuggets team the mean amount of time that data mining researchersestimated they spent on pre-processing activities was 65-70% http://www.kdnug-gets.com/news/2003/n19/1i.html). Indeed, those writing from the perspective of datawarehousing have their own term for this important phase – extraction, transforma-tion and load (ETL). There are specialist companies whose sole focus is on providing ETLsoftware and/or services.1 The area of data acquisition within the ETL discussionincludes a range of topics which I do not plan to cover here, including: ! data extraction (frequency and timing of data extracts)! detection of changes in source data! capture and transformation of meta-data, etc.

In this section we concentrate on data cleaning, which is most closely associated withthe ‘E’ and ‘L’ elements of ETL, while in the subsequent section we cover a range oftransformation-related (‘T’) issues including data coding, levels of data summary andfeature selection/reduction.

!

Thinking point Dirty data When you yourself fill in data capture forms, what sorts oferrors do you typically make? Are there ways in which users could be ‘constrained’ tomake fewer such errors?

!

The basic problem when considering any data set upon which some type of BI analysisis to be performed is that the real world is not perfect – i.e. you will always be workingwith ‘dirty’ data. Even more than in other aspects of data processing the old adageGIGO (garbage in, garbage out) applies here and so the importance of identifying andfixing (where possible) data problems cannot be over-stated.‘Dirty’ data can be broadly classified into three categories:! missing or incomplete data! wrongly entered data (often referred to as ‘noisy’ data)! non-standard or inconsistent data.

In fact there are many more categories than this and each of these contains a number ofsub-categories. In a paper by Kim et al. (2003), ‘A taxonomy of dirty data’, the authorsidentify around 35 specific classes of dirty data and still make the statement that, ‘we areonly confident that our taxonomy is about 95% ‘comprehensive’. For example, they sub-

1 See the document ‘The BI market and future trends’ on the module website.


38

divide wrongly entered data into categories where (a) it should have been possible topick up the problem through integrity constraints in the initial database and where (b)this would not have been possible. Then, within the dirty data classed as being wronglyentered with non-enforceable integrity constraints, they have a sub-division for errorsinvolving a single field and within this class they have a further three categories: ! erroneous entry (e.g. age mistyped as 26 instead of 25) ! misspelling (e.g. ‘principle’ instead of ‘principal’) – though for me this is equiva-

lent to the first class ! extraneous data (e.g. name and title entered in a name-only field).

While an identification of all types of ‘dirty’ data may be a useful exercise, and the Kimet al. (2003) paper makes interesting reading, the simpler three-category model men-tioned above has the advantage of helping give focus to what can be done about theproblem. Techniques exist to address each of the main categories, specifically to:! handle missing values! identify outliers and remove, or smooth out ‘noise’! correct inconsistent entries.

(1) Handling missing valuesThe most important initial question to ask in this circumstance is whether there issome identifiable (systematic) reason for the omission or whether the data are miss-ing at random (MAR). If there is a systematic reason then it may be possible to remedythe situation. For example, in a recent analysis of Scottish prescription data we foundthat ‘deprivation index’ figures were missing for certain GP practices, though therewas no initially apparent regularity or location. On closer investigation it became clearthat the practices concerned all referred to patients from a particular set of postcodeareas. In transpired that these were areas whose postcode boundaries had changedbetween the 1991 and 2001 censuses (from which data the ‘deprivation index’ isderived) and that a relatively simple update of the data could be made to reclaim 95%of the missing data of this type. In other cases it may not be possible to get to the rootof the systematic cause or to do anything about it even if the cause is identified. How-ever, if it is likely that the cause is systematic then it is almost never appropriate to usethe approaches described below for MAR data.If data is truly missing at random then there are a number of potential solutions, asoutlined in Table 2.3. (In the table, dm denotes data for which a particular field is miss-ing, while dk denotes those cases where the data is known.)

For a more complete discussion of the above, and much more besides, see Little andRubin (1987) Statistical Analysis with Missing Data.

Table 2.3 Methods to estimate values for missing at random (MAR) data points

Approach Advantages Disadvantages

Replace data with its mean value.

Easy to do. This simple approach can only be justified in some very limited circumstances.

Look for similar sets of inputs and infer the missing values from these examples.

A heuristic approach to a full distribution density model (easier to achieve).

Given the amount of work why not go to full density model?

Find a model for P(dm|dk) – ‘missing given known’ and use the average estimated value.

Best estimate of MAR values. Harder to achieve.

general concepts

39

(2) Identification of outliers and removing or smoothing ‘noise’The identification and appropriate treatment of ‘outliers’ is an area fraught with diffi-culties. For example, clustering techniques can be used to highlight potential outliersbut remember that these are the very same approaches being used to uncover ‘unusualbut valid’ groupings. There are fairly well accepted methods from statistics of definingwhat an ‘outlier’ is. For example, when plotting a continuous numerical variable usinga box-plot, the statistical package Statistica defines an outlier as, ‘a value more that 1.5times larger or smaller than the width of the box’ (normally set at ± 2 standard devia-tions of the mean). When working with even two or certainly more than two variablessimultaneously the identification of outliers can become very tricky and some auto-mated tools may be necessary. However, it should be remembered that an outlier isdetermined not only by its ‘un-usualness’ but also by its infrequency of occurrence. Ifan ‘unusual’ data point begins to appear with any sort of regularity then it is morelikely that either there is a systematic measurement problem or that these multiplepoints are not outliers at all but an interesting cluster of unusual events.

(3) Correcting inconsistent entriesMost data cleaning tools allow a range of options for the correction of inconsistentlyentered data. As a very simple example, in a hospital database it may be found that thegender of patients has been entered in a range of different way – ‘M/F’; ‘male/female’;‘man/woman’; etc. In this case it would not be difficult to run an automatic transfor-mation routine to change these various representations into the standard ISO gendercoding of (0: unknown/1: male/2: female/9: not applicable). In a similar manner, fieldsdealing with time and other units of measure could be standardised, such that ‘mth/dd/yy’ formatted fields become ‘yyyy/mm/dd’ or ‘lbs’ format fields become ‘kg’,etc.The above examples do not necessarily represent inconsistency of entry, since each ofthe ‘formats’ may well have been carefully adhered to at the time of input. Rather theyreflect the almost inevitable inconsistency which arises when data ‘fusion’ (the bring-ing together and ‘blending’ of data from disparate sources) takes place within the datawarehouse. Where a more trivial level of inconsistency occurs, such as the misspellingof a customer’s name, there are again tools that can attempt to spot and rectify theseproblems. It may be possible to run ‘fuzzy matching’ on customer names, for example,and to use additional information relating, say, to address or age as a means of con-firming that two apparently different people are in fact a single customer.Assuming that the data has now been cleaned as well as possible, a number of trans-formations may still need to take place before we are ready to initiate data mining orknowledge discovery activities. A range of possible types of transformation are out-lined in the next section.


40

2.5 BI process model: pre-processing (the transformation phase)Once the data has been cleaned and loaded into the data warehouse a number of dif-ferent types of transformation may still be needed before the data mining phase. Thedata may have to be coded into a structure suited to the learning algorithm, or it mayhave to be normalised in some way. Following this coding it may still be necessary toreduce the data space being considered by approaches such as feature selection,dimensional reduction or sampling. These various transformation activities are intro-duced in the following sections.

Coding and normalising dataThere are a variety of reasons that the raw data in a data set may require transforma-tion before applying a BI algorithm. For example you may have a continuous quantita-tive variable whereas the algorithm requires categorical variables to be able to operate.Thus you might want to use a Chi-square Automatic Interaction Detection (CHAID)algorithm to create a decision-tree model of when to lend on a mortgage request. Someof the attributes relating to new potential loans would already be categorical – such astype of property to be purchased, employment status of applicant, etc. However, others,such as total loan requested will be continuous. Most CHAID implementations willautomatically create discrete categories from continuous variables (referred to as‘bins’). Let’s assume that we request that five bins be used and that the total loan valuesranged from £1,000 to £500,000 – the default bins, based on ‘equal-width-interval’ (EWI)criteria, would be £1,000-£100,000 and so on up to £400,000–£500,000. This wouldalmost certainly give us a problem as 95% of the loan requests are likely to fall withinthe first ‘bin’ – i.e. a few very large loan requests have skewed our data, which meansthat the EWI approach is inappropriate. There are a number of ways to adjust forskewed data and other problems. One of the simplest in the example just given wouldbe to use an ‘equal-frequency-interval’ (EFI) approach instead of equal-width (EWI). Thismeans that the bins would be set such that the number of instances in each bin isroughly equal. This can however lead to different problems and other mathematically-based normalisation techniques such as ‘zero-mean’ and ‘min-max’ strategies exist thatwould address these problems. One of the most commonly used is to rescale a variableu to a new variable x such that x will have a mean of zero and a standard deviation ofone. This is achieved through the following simple transformation:

x = (u – )/s

(where is the mean of the initial variable, in the training set data available, and s isthe corresponding standard deviation).Another example of where coding might be required is in the case where some nomi-nal categorical variable might require a more ‘meaningful’ mathematical representa-tion. Suppose we were attempting to cluster some mortgage applicants and theattributes available included age and profession. If we used a numerical list to repre-sent the profession category, what value would we use and how would we allocatethem? Would a doctor be given the value 9, a lawyer 4 or an architect 1? Would thisimply a lawyer was more like a doctor than an architect, etc.? Clearly, allocatingnumerical values and assigning some implied ranking can lead to problems. Anotheroption is to use a ‘1-to-m’ coding which avoids these problems. In this case we create‘m’ different profession attributes and allocate a single profession to each – thus abutcher becomes (1,0,0), a baker (0,1,0) and a candlestick-maker (0,0,1) [here we haveonly 3 professions and therefore require a ‘1-to-3’ coding]. Thus for instances in ourdata set we would make a mapping such as that shown in Figure 2.2.

u

u

general concepts

41

Various other coding options are frequently used, for example the scaling of any con-tinuous variable to be included within an ANN into the range 0-1, but we do not havespace to cover all of these in detail here and the notes given should be sufficient tomake the point that this is an important pre-processing phase that involves carefulconsideration of the ‘shape’ of the raw data.

Feature selection and dimensional reductionThe whole point of data mining tools is to try to discover (uncover) knowledge within themassive databases generated by today’s technology. Without automated assistance thesedata will never be analysed. However, even with DM tools the data sets can be unman-ageable. If we have a simple two-dimensional table with F features (attributes) and Ninstances (records) then our data set is obviously of size F ! N and increases linearly witheither dimension. To make the task of the data mining algorithm more straightforwardwe can apply techniques which reduce the space in either the F or N directions. We dis-cuss the Feature dimension first by looking at the techniques of feature selection andfeature reduction (or dimensional transformation), before looking at ways to reduce theInstance dimension, principally through sampling and segmentation.

Feature selectionWe begin by looking at features because many of the analytical BI routines will lookhierarchically or recursively (i.e. all non-included features will have to be reassessedfor inclusion in the model on each pass through the data set) at the Feature-space andit is thus the size of F that is critical. (This is in addition to the obvious point that formany real-world data sets F << N. Therefore more instances/records in our data setthan there are features/attributes, so a reduction in the feature-space will be relativelymuch more significant than a reduction in the instance-space.) The purpose of ‘featureselection’ is to select a minimal subset of features which represent the data set in sucha way that the task at hand can be achieved as well as (or indeed better than) would bethe case if the full data set were used. In many ways the rationale of feature selectionis captured in the slogan, ‘less is more’.

Basically feature reduction is a process of removing fea-tures that are redundant (i.e. where one feature is a ‘sur-rogate’ of another) or irrelevant (of no consequence tothe analysis at hand). If x such features are identifiedthen the data set size is reduced to (F " x) ! N, and inaddition it may be possible to reduce N due to the likeli-hood of duplicate instances given the reduced featureset. As well as reducing the total size of the data set thishas the added advantage that outputs of any analysiswill be more concise and therefore should be simpler tovisualise and interpret. Liu and Motoda (1998) provide an excellent discussion of a range of issues relatingboth to feature selection and reduction in their text Feature Selection for KnowledgeDiscovery and Data Mining. We obviously cannot summarise a 150 page book in half apage but the diagram shown in Figure 2.3 captures some of the key elements.

Figure 2.2 A ‘1-to-m’ coding transformation on some simple personal data

Age Profession Age Prof 1 Prof 2 Prof 3

23 Baker 23 0 1 0

41 Butcher 41 1 0 0

… … … … … …

Search strategy

Generationscheme

Evaluationmeasure

Figure 2.3 Some key elements of feature selection– after Figure 3.1 of Liu and Motoda (1998, p44)


42

Feature reduction/dimensional transformationRather than selecting particular features, an alternative technique is to transformthem in some way so that the total number of features is reduced. For example, wemight be looking at customer’s likely spend on a new product. Rather than using sepa-rate variables to represent income, savings, property ownership, etc., we could create asingle ‘wealth indicator’ variable. Sometimes background domain knowledge such asthis can be used to reduce the number of features but this is not usually the case. (Also,such ‘simplistic’ dimensional transformation can be a mistake. For example, what if‘high earner/low savings’ customers have very different buying patterns from ‘lowearner/high savings’ customers? Our single ‘wealth indicator’ variable will tend to blurany differences between these groups.) More sophisticated mathematical techniquesexist to reduce the number of dimensions. Among the most widely used are principalcomponent analysis (PCA), linear discriminants and Kohonen nets (covered inChapter 4). In fact these are all examples of dimensional reduction using linear projec-tion. Other, more sophisticated, non-linear reduction techniques, such as auto-encod-ers, exist but are well outside the scope of our discussion here.Principal component analysis (PCA) is perhaps the most widely known approach todimensional reduction. PCA involves a non-probabilistic mapping which attempts tofind a low-dimension coordinate system which may appropriately represent datafrom the higher-dimensional space. Formally, the PCA approach is driven by an algo-rithm which uses matrix algebra to find the most significant eigenvectors of the dataset. We do not have time to go into the mathematics here but those who are interestedare directed to some material by David Barber (whose initial mathematical overview –including a brief introduction to eigenvectors has already been recommended) at thefollowing web site: http://anc.ed.ac.uk/~dbarber/lfd1/lfd1_2003_dim_red.pdf. Infor-mally, the PCA approach is based on the assumption that data in high dimensionalspace is unlikely to be randomly allocated throughout that space but will rather tendto be arranged in hyper-planes. The eigenvectors returned by the PCA algorithm iden-tify the key (orthogonal) dimensions in the modified data space. One of the problemswith the PCA approach is that it is difficult to interpret the output produced. Theeigenvectors/values are generated simply as part of the algorithm and have no intui-tive meaning. Thus neither plotting the data points in the reduced main eigenvectorspace nor using these as a direct model for prediction/classification is useful. However,assuming that the first three eigenvectors account for, say, 95% of the variability in thedata, then using this new 3-dimensional (rather than the N-dimensional) representa-tion may allow a more efficient learning algorithm to be used.

One of the standard output plots of any PCA tool is a graph illustrating the ‘contribu-tion’ made by each eigenvalue, in decreasing order, to the total. These ‘eigenvaluespectrum’ (or ‘scree’) plots, such as the one shown in Figure 2.4, help us decide howmany dimensions to include in our reduced linear sub-space. These plots give us anidea of the number of degrees of freedom in the data set and in a sense the number of‘intrinsic’ dimensions in the data. (In the case of Figure 2.4 and the Cars_by_Countrydata set there is clearly one dominant eigenvector and three others of moderateimportance – in total these four account for around 96%.) However, in selecting the Mmost significant eigenvectors and ignoring the rest as ‘noise’ we may be making a bigmistake – in certain problem domains it may be precisely this ‘noise’ that is useful inmaking clear classification decisions. In addition the PCA technique is linear and thuslooks for dominant hyper-planes. If the data can be best mapped onto low-dimen-sional curved manifolds this cannot be modelled by a linear approach. Also, if thereare groups of clustered instances within the data set these may not be well capturedby the single linear transformation. Despite all these limitations and caveats PCA is

general concepts

43widely used (at least for initial analysis) in high-dimensional problems due to the rela-tively straightforward way in which the linear dimension reduction can be carried out.In David Barber’s introduction to PCA (noted above) he illustrates the use of PCA effec-tively to reduce a 483 dimensional representation for hand-written characters in thepattern recognition domain to just 10 dimensions!

Sampling and other approaches to reducing the number of instances An alternative to reducing the extent of the data space is to cut down on the size of N,the number of records or instances. Obviously we are talking significant reductionhere (i.e. by factors of 100s or 1000s) as the size of N is typically very large (unlike Fwhich is typically relatively small). The area of sampling is well developed within tra-ditional statistics, and knowledge from that domain should be made use of during thispart of data pre-processing. (There are some who state that data mining is differentfrom statistics precisely because in data mining all of the data are considered whereasstatistics typically operates on samples. I do not subscribe to this view and feel thatsampling has a useful roles in many BI applications.) There is also a lot of nonsensetalked about sampling, for example by those who claim that ‘adaptive’ samplingmethods (such as stratified sampling) are required because ‘simple random samplingmay have very poor performance in the presence of skew’ (Chen 2001). This is reallyonly true for relatively small data sets (definitely not the case in data mining) and maywell lead in turn to problems of over-fitting in the model.

Figure 2.4 Graph showing the Eigenvalue spectrum

(also known as a ‘scree plot’) for PCA analysis of the Car_by_Country data setusing all 8 independent variables (the nature of the data set is introduced in Chapter 3)


44

SummarisationThis is not strictly a sampling technique but it probably is the simplest method ofreducing the instance space. Taking the example of the retail sales sector and ‘marketbasket’ analysis we can envisage a number of reasonable summarisations. For exam-ple, instead of storing information on all products that were in each customer’s basket,as in Table 2.4(a)...

... we could chose to hold only summary data on sales by product category, as in Table2.4(b).

If we assume that the ‘average’ customer purchases 150 grocery items spread over 15product categories this represents a 10-fold reduction in the instance space. This formof the data would be adequate for looking at customers’ overall buying habits or, forexample, for clustering certain customers into a ‘very healthy’ grouping to target withleaflets on your chain’s latest range of organic produce. However, the lost level of prod-uct information does mean that specific product linkages could not be made (includ-ing, sadly, the now-famous ‘beer and diapers’ association). Nor would it be possible tosee how an ad campaign for a specific product was affecting its sales, nor to compare‘own brands’ against brand-leaders. Another example of summarisation in thisdomain would involve the time dimension. Rather than looking at the sales perform-ance of your shops on a daily basis across the country, you could summarise to aweekly level. An obvious problem with this is that if one shop performed well duringthe week but did little business at weekends while another was the reverse (i.e. did aroaring trade on Saturday and Sunday) the summary data would totally mask this andindicate average performance over the week in both shops. Daily – and even hourly –data are used by some stores for the efficient control and planning of staffing levels.Thus summarisation can be a useful way to cut down the instance space, but it willalways be at the expense of losing certain pieces of information. It should therefore beused with care and with a clear view of the ultimate goal of the data mining exercise.Summarisation should almost never be attempted at the time of data collection asinformation lost can never be recaptured (and it is almost impossible to know in

Table 2.4(a) An example of data stored for market basket analysis at the level of the individual items

Cust/Basket ID Date Product Quantity Value

C127093/B122 10/12/2003 Orange Juice 3 £2.50C127093/B122 10/12/2003 Cola 6 £3.10C127093/B122 10/12/2003 Sparkling water 1 £0.90C127093/B122 10/12/2003 Apples 4 £1.20C127093/B122 10/12/2003 Peaches 8 £1.99etc. … … … …

Table 2.4(b) Data for market basket analysis stored at a level aggregated according tomajor product categories

Cust/Basket ID Date Product area Quantity Value

C127093/B122 10/12/2003 Drinks (non-alc.) 10 £6.50C127093/B122 10/12/2003 Fruit 22 £5.70C127093/B122 10/12/2003 Dairy 4 £3.15etc. … … … …

general concepts

45

advance whether certain data attributes will turn out to be of value in futureanalyses). Data should be collected in its ‘raw’ form but it is very possible that certaintypes of data mining will require that summarisation is carried out as part of the pre-processing stage.

SegmentationOnce again this is not really a sampling technique but it does lead to a reduction in thenumber of instances considered and does not suffer from the problems of loss of infor-mation associated with summarisation. This is a very straightforward techniquewhich involves sub-dividing the data into segments and analysing each segment sepa-rately. So in the ‘market basket’ example given in the last section we might choose toanalyse all the ‘drinks’ or ‘dairy’ data, i.e. segment the data by main product grouping.Notice that this reduces the data space both by one feature (the one on which we havechosen to segment) as well as by the number of instances. On average the reductionwill be relative to the number of instances in each group of the attribute chosen forsegmentation to the total number of instances.In addition to the reduction in total size of the data to be analysed some authors justifysegmentation based on the ‘dilution of pattern’ effect that can happen in large datasets. Parsaye (1999), as reported in Chen (2001), noted that the larger the data ware-house the more interesting patterns it may contain. However, he also noted that pat-terns from different data segments may act to ‘dilute’ each other out and so thenumber of useful patterns obtained may paradoxically decrease as the warehousegrows. For example, staying with our retail sales example, you may find that in oneregion where your chain has opened up city centre shops (metro-stores) these havebeen particularly profitable, while in a different region the shops opened in out-of-town retail ‘parks’ have been the best performers. If you had looked at performanceacross all shops in your chain without segmentation you may have found that thecompeting metro/out-of-town rules in each region had cancelled each other out. Thisis in fact the manifestation within data mining of a well-known phenomenon in sta-tistics referred to as ‘Simpson’s paradox’. There are many references to this interestingparadox on the web – it appears to fascinate people, and not just statisticians! A goodplace to start is at Cut-the-Knot where the related issues of Simpson’s paradox andmediant fractions are elegantly introduced. (see: http://www.cut-the-knot.org/blue/Mediant.shtml) The paradox was first noted by Simpson (1951) when looking at theperformance of students in higher education and analysing results by gender. Hefound that when looking at the student population as a whole girls did significantlybetter than boys, but that when the analysis was carried out on the same students butbased on their main subject discipline there was no such effect (or in some cases malesappeared to perform better). However, this point should not be over-stretched norshould the claims that segmentation can get around the problem be over-stated, asyou have to know in advance the dimensions in which this problem is likely to arise!Thus I do not agree with Chen, who wrote, when encouraging segmentation: ‘lookingat all of the data at once often hides the patterns because the factors that apply to dis-tinct business objectives often dilute each other’ (Chen, 2001). Most statistical algo-rithms are well ‘aware’ of this danger, which is why they attempt to look at all factorstogether, including the possibility that ‘interaction’ factors may be at work. Thus in asimple analysis of variance (ANOVA) process applied to our retail chain profitability,the factor representing metro/out-of-town may not show up to be significant butthere may be a clear significance seen in (metro/out-of-town ! region) – the interac-tion factor (i.e. the factor is not of significance on its own, but can be seen when theanalysis incorporates its interaction with other factors).


46

SamplingThe most common approach to sampling is simple random sampling. As the namesuggests this involves a piece of software which randomly selects a sub-set ofinstances from the total population in the database. There are heuristic rules from sta-tistics which give guidelines as to the appropriate size of such sub-sets. Random sam-pling can perform poorly in the presence of skewed data but this is normally only aproblem with small sets of initial data. Some problems can be addressed by stratifiedsampling. In this approach the sampling is still random but the additional constraintput on the sampling process is that the proportions in the sampled sub-set for certainattributes must be the same as the proportion in the population. Therefore if the male/female breakdown of the customers in our retail sales example is 42%/58%, then astratified sample would ensure that this proportion was maintained in the sampledset. The stratification can be over multiple attributes such that if 11% of our customersare in the Scottish region and within this group the gender proportions are 35%/65%then the proportions in the sampled set will match these profiles.In its classical usage within statistics, sampling is typically used before the collectionof data – for example in sample size calculations to see, say, the number of patientslikely to be required in a clinical trial to demonstrate a certain treatment effect. In thecase of data mining, however, the data set already exists and therefore sampling canbe applied in an iterative manner to the problem of selecting data. The basic approachto use sampling – for instance reduction – is to train a learning algorithm on increas-ingly larger random sub-sets of the data, look at the results and stop training when theimprovement becomes marginal. Specific approaches include:Dynamic sampling: at each iteration a small random sample is taken and used to trainthe mining algorithm. The success of the resulting model is evaluated after each itera-tion. After each evaluation the sample size is increased by a constant number ofinstances until the difference in the evaluated result when compared to the last modelis smaller than some pre-defined (by user) level.Window sampling: this was used by a number of the early induction tree algorithms.An initial tree is constructed based on a randomly sampled subset of the instances (a‘window’). The remaining instances are classified using this tree and those which areincorrectly classified are then added to the ‘window’, which becomes the new trainingset from which a new tree is constructed and so on. This is obviously only suited to areasonably limited training set (it would be impractical to check for mis-classificationof all instances in a large data warehouse). Windowing also tends to become quiteinefficient in noisy environments. While this discussion is by no means comprehensive, I hope it has been useful as anintroduction to the range of techniques that can be used to cut down the ‘data space’that may be explored in any data mining task. This can be done either through thereduction of feature space (feature selection and/or reduction) or instance space (sum-marisation, segmentation or sampling).

!

Thinking point Reduction of feature and instance spaceWhat are the practical benefits of reducing the data space in each of these dimensions?

!

general concepts

47

2.6 BI process model: application of the data mining algorithm

The specifics of each algorithm are not discussed here, as the full range of possiblealgorithms is covered in Chapters 4 and 5. As a generic process we can say that thecoded, cleaned, transformed data is supplied to the algorithm in a suitable format (theinput data, in some cases a ‘training set’ and in others the whole data set). The algo-rithm then operates to find the model and parameters that best fits the input data,and this is presented for validation.

There are a couple of general issues that are pertinent to theapplication of all learning algorithms and these are brieflyoutlined here before we discuss the validation and use ofthe resulting model in the post-processing phase. The firstpoint to note is that the application of many BI/data min-ing algorithms takes place in two stages: a training phaseand a test phase – as illustrated in Figure 2.5.As the figure illustrates, the initial data set can be brokendown into two sub-sets, often roughly on a 70/30 basis, with30% of the data being kept for the testing phase – this isreferred to as the holdout or test sample estimation method.(There are in fact a number of alternative sampling schemesfor training and testing that are commonly used includingrotation estimation, stratified cross-validation and boot-strapping. Pointers to discussion articles on these methodscan be found at the BI module website. The differencesbetween methods can be important but they share a commonobjective, to reduce bias in the sampled sets and ensure thatthe testing phase provides as reliable an estimate to the trueaccuracy of the model as possible.) As can be seen from Figure2.5, the results of the testing process are then assessed in a number of ways at the post-processing phase. These assessment methods are introduced in Section 2.7 on page 50. Anumber of general issues must be considered during the training and testing of any datamining algorithm or the creation of a BI model; some of these are discussed below.

OverfittingThe term ‘overfitting’ is used to refer to thesituation in which a predictive model is sospecific that it picks up idiosyncrasies(‘noise’) that exist in the training data setand modifies its parameters to account forthis noise rather than the actual underly-ing (and more general) relationship. Theproblem is perhaps best illustrated visu-ally. The graph in Figure 2.6(a) shows somesample data points which come from aknown functional relationship betweentwo variables. However, noise has beenintroduced into the system – as it might be,for example, if poorly calibrated equipmentwas used for measurement – and so thedata points do not lie on the ‘perfect’ func-tional line. In attempting to build a model

Trainingset

Testing

Error rateConfusion matrixGain chart, ROCLift curve

Training

Testset

Initial data set

Trainedmodel

Figure 2.5 Training and testing a BI model

Scatter plot of (x,y) sample points andknown function y = –x^2 + 20x + 10

Y

X

100

80

60

40

20

00 5 10 15 20

Figure 2.6(a) A data set and its ‘true’ functional relationship


48

to relate the two variables shown, one possible approach is to assume an overly simplefunction and put the rest of the variation down to ‘noise’ – this situation is shown inFigure 2.6(b) and is referred to as under-fitting. The opposite tendency is to give toomuch significance to small variations (which are in fact noise) and end up with a verycomplex model. This is referred to as overfitting, and is illustrated in Figure 2.6(c). Inpractice the correct level of detail at which to model often comes down to a combina-tion of model-building experience and a bit of trial-and-error. The general principle ofOccam’s razor (see below) also applies in these circumstances.

While the problems of overfitting are perhaps bestillustrated graphically in terms of functional modelsanother commonly-occurring example is in the areaof decision trees. These trees (see ‘Predictive model-ling – classification’ on page 104) can grow quitelarge and a common question is, when should thealgorithm stop splitting nodes and adding morebranches and trees. In this context the problem ofoverfitting (growing the tree till it is too large) isdealt with by various tree-pruning strategies. Forexample, if any leaf nodes in the final tree apply toless than n instances in the training set then thesenodes are ‘pruned back’ until a leaf with sufficient‘coverage’ is reached.

Occam’s razor2

This general principle advises that the simplestapproach should be adopted wherever possible. Inthe context of creating BI models this means thatthe most straightforward model or one with fewerestimated parameters should be preferred. Forexample, linear models have been used successfullyin the physical sciences not least because manyproblems (at least if you exclude the ‘extreme’ endsof the boundaries) are approximately linear. If weattempt to use some non-linear technique, such asartificial neural networks, to model such problemswe are likely to end up with difficulties. The modelwe will generate will in fact be too complex for theproblem at hand and will typically lead to overfit-ting problem in later use. (Thus, as an example of theapplication of Occam’s razor, it is often recom-mended that the performance of any neural networkmodel be bench-marked against a linear model tocheck for overfitting.)

2 Often spelled, arguably more correctly, ‘Ockham’s razor’ after William of Ockham(1285-1348). This Franciscan monk and medieval English philosopher is creditedwith this principle sometimes also referred to as the law of economy or the law ofparsimony. Ockham's original statement was ‘pluralitas non est ponenda sinenecessitate’, literally ‘plurality should not be posited without necessity’.

Scatter plot of (x,y) sample points andunder-fitting function y = 0.132 x + 70.6

Y

X

100

80

60

40

20

00 5 10 15 20

Figure 2.6(b)A model which under-fits the data in Figure 2.6(a)

Scatter plot of (x,y) sample points andover-fitting function (y = as shown!)

Y

X

100

80

60

40

y = 0.0002x6 – 0.0166x5 + 0.4619x4 – 6.1977x3 + 39.926x2 – 97.48x + 103.36

20

00 5 10 15 20

Figure 2.6(c)A model which over-fits the data in Figure 2.6(a)

general concepts

49

The ‘reject’ optionOne of the other issues to consider when building a model is whether to remove anyinstances from the data set – the ‘reject’ option. Above we discussed methods of sub-sampling to reduce the data set, either in terms of the number of features or thenumber of instances. In that discussion the focus was on ensuring that the remaininginstance were a representative sample of the population, i.e. of avoiding bias. In thecase of the ‘reject option’ there may be circumstances in which we actively chose toremove ‘difficult’ cases from the training and test sets – which will typically improvethe performance of the predictive model. For example, if you were creating a system tocorrectly identify the postcode data on letters for the Royal Mail (or whatever it is nowcalled!) there may be a requirement that the system must be 99.9% accurate. Toachieve this level of accuracy you may need to remove 2% of the letters from your dataset. (However, this high level of performance on the 98% of letters that are left is muchmore useful than a 99% accuracy on 99.5% of the letters – i.e. a reject rate of just 0.5%.)In general the shape of the reject-error curve will look something like that shown inFigure 2.7 – i.e. the more ‘difficult’ cases we chose not to model (and/or test with) thebetter the performance.

Of course using the ‘reject’ option duringmodel building can be dangerous as itrelies on being able to accurately identifythe ‘difficult’ cases in operation and thesemay not be easily described in some gen-eral format (by their very nature they tendto be maverick points!). In practice, the useof the ‘reject’ option is part of a more com-plex assessment of the performance of themodel which will include an assessment ofthings such as the confusion matrix andthe lift factor, which are discussed in thenext section.

% of instances rejected

% error in model

0 5 10 15 20

20

15

10

5

0

Figure 2.7 A typical reject-error curve when testing a BI model


50

2.7 BI process model: post-processing

As noted above, after model training has taken place some data are used to test theperformance of the model. The range of post-processing activities goes wider thanthis relatively narrow and technical focus on model accuracy but we will focus pre-dominantly on it here. (Some of the other post-processing activities, such as evaluat-ing the usefulness of an association or the consistency of a production rule, arediscussed in the specific context of the various techniques to which they are related inChapters 4 and 5.)

Error rates and misclassificationGiven that the correct outcomes for all instances in the test set are already known, it ispossible to evaluate any model which we create and compare the predicted outcomes tothe actual values. (Here we restrict our discussion to the case of simply binary outcomepredictions: ‘yes/no’, ‘present/absent’, ‘profitable/loss-making’, etc. – referred to in cer-tain settings as the ‘0-1 loss function’ – see also ‘Predictive modelling – classification’ onpage 95. However, this can be extended to the more general case of multiple class pre-diction or indeed continuous outcome data. A discussion of the continuous case and theuse of ‘residuals’ can be found in Chapter 4 in ‘Predictive modelling – regression’ onpage 111.) The error rate is often expressed in terms of overall accuracy which is the rateof correct predictions made by the model over the whole test set. (The coverage repre-sents the proportion of the data set for which the model can make a prediction andrelates to the ‘reject’ option discussed in the last section. In the discussion that followswe will assume 100% coverage to ease the illustrative calculation of rates.) Even some-thing as apparently simple as accuracy is a little more subtle than at first might appear,as there are different ways of getting an incorrect answer. The correct answer may be‘yes’ while the model predicts ‘no’, or the opposite may be the case. To see the impor-tance of this distinction consider the real example case of a model which has to predictthe presence of bowel cancer on the basis of clinical evidence. The patient may not havebowel cancer while the model predicts that they do (a ‘false positive’). On the otherhand the patient may in fact have bowel cancer but the model predicts that cancer isnot present (a ‘false negative’). In both of these instances the model has ‘got it wrong’but arguably the second situation is much more serious than the first.To enable us to assess more than just the crude accuracy of a model’s outcome we typ-ically present the results in terms of a misclassification table (also known as a confu-sion matrix). An example of such a table/matrix is shown in Figure 2.8 for ourhypothetical model to predict bowel cancer.

Figure 2.8 An example misclassification table (confusion matrix)

Predicted situationActual situation Model predicts negative Model predicts positivePatient does not have cancer (negative)

A B

Patient does have cancer (positive)

C D

general concepts

51

From the misclassification matrix we can calculate the overall accuracy as follows:accuracy = (A + D)/(A + B + C + D)However, just as important are a number of other estimates:sensitivity = D/(C + D) [the true positive rate]specificity = A/(A + B) [the true negative rate]Obviously the false positive and negative rates can also be calculated. These termsare typically used in the medical context. In other contexts these ratio are knownby slightly different names. For example in information retrieval (IR) the termused for sensitivity is ‘recall’, and a second ratio which is typically stated for IRmodels is:precision = D/(B + D)

In other domains these misclassification fractions are known by other names but theimportant issue is how can they be reduced and which are the most important to getright in a given context?

The significance of ‘getting it wrong’One of the typical characteristics of the confusion matrix is that the ratios it producestend to interact with each other, typically in an opposing manner. Thus to increase thelevel of sensitivity of a clinical diagnostic test (i.e. catch more true positive cases) theprice you will pay is reduced specificity (i.e. your true negative rate will drop – that is,you will pick up more false negatives). Similarly, in designing IR systems the makingof changes to the retrieval algorithm which increase recall will often be at the expenseof recall. To enable the model designer to make sensible choices from the enforcedtrade-offs that are involved it is important to know the significance of getting itwrong. For example, in the area of IR systems a historical goal was to ensure thatrecall was maximised (e.g. in a bibliographic database make sure that interesting arti-cles are not missed out). This may have been at the expense of precision but with spe-cialist searches in a relatively restricted document space (say 100,000s of articles) thiswas an acceptable price to pay. However, consider the case of information retrieval onthe internet (say via Google or Alta-Vista). In this case we are talking about hundredsof millions of documents and often very generalist searches. It is now likely that preci-sion will become much more critical and that we will be prepared to compromise onthe level of recall as a result. Or take the example of the medical diagnostic system. Here the significance of mis-classifying a patient who has cancer is much more serious than of misclassifying onewho does not (though of course such an error may cause that patient some psycholog-ical distress until the error is resolved – hopefully through more extensive diagnostictesting). In this type of scenario we can specify a ‘loss matrix’ which attempts to quan-tify the importance of incorrect predictions. In the example of the bowel cancer testthe loss matrix might be:

This represents the fact that we would be prepared to accept 1,000 incorrect (positive)predictions for patients who were in fact negative for every 1 negative prediction for apatient who was in fact positive. Of course, in many other applications the implica-tions of missing positive cases is much less serious. So for example in the case of amail-shot prediction system for likely high-spending customers our loss matrix wouldlook very different.

0 10001 0


52

Assessing the implications of different choices, their associated trade-offs and thecosts and benefits associated with each can be a non-trivial task. To aid the modeldesigner in this post-processing assessment a number of graphical tools exist whichsummarise the options. These are discussed below.

Gain charts, lift curves and ROC plotsThere are a number of graphical tools which enable the model builder to compare theperformance of different models. The best-known are receiver operating characteristic(ROC) plots or curves. ROC plots are so named as they were used to compare the per-formance of different receivers of radar signals to correctly identify objects in the air orwater. (High sensitivity leads to shooting down pigeons but at the opposite end of thespectrum you don’t get warning of an enemy fighter plane till you can see it!). The ROCplot is typically laid out as shown in Figure 2.9 with sensitivity (in % terms) plottedagainst (100 – specificity). The object of the exercise in using these type of plot is to tryto choose the best point of trade-off which maximises both elements (i.e. somewherein the top-left of the graph – shown here to be the point at which specificity = 91.1%and sensitivity = 90.9%).

As noted in the previous section, not all misclassification is of equal importance. Theincorporation of loss matrix type information as well as the cost implications of differ-ent interventions can be added to the ROC plot to give a more useful summary of thetrade-offs in a business context. The graphs which result from this fusion of cost-bene-fit and model trade-off information are referred to by different names in different con-texts. The concept of lift is commonly used in CRM applications and relates tocustomer targeting models. Basically lift is a measure of how much better your predic-tion of ‘interested’ customers is using a model than would have been the case throughpurely random selection (the approach that appears to be used by most cold-callingcentres!). For example, suppose that of all customers sent a catalog without using anyCRM model, 1.5% make a purchase. Further suppose that using the model to select cat-alog recipients, results in 9% making a purchase. The ‘lift’ would be calculated as 9/1.5,or 6. Lift may also be used as a measure to compare different data mining models.Since the estimate is computed using a data table with actual outcomes, lift compareshow well a model performs with respect to this data on predicted outcomes. Thisgraphing of various ‘lift scenarios’ can then be carried out – resulting in lift curves(also referred to as gain charts).

Figure 2.9 Typical ROC plot for a clinical test at a variety of parameter settings

general concepts

53

In the example shown in Figure 2.10 the relative performance of various modellingalgorithms is compared in terms of a lift chart for the area of mail shot delivery. The x-axis represents the percentage of the total population covered, say 100,000 people. They-axis presents the cumulative percentage of correctly classified positive cases – say30,000 who would respond if they received the mail-out. The chart should include theperformance of a random case selection (a straight line from [0%,0%] to [100%,100%])and the performance of the model under investigation. Other possible lines in thechart include the performance of other competing models, the performance of a per-fect classifier, and the quota to be achieved. From the Figure 2.10, we notice that kNN(k–nearest neighbour) reaches the 83% quota faster than the DT (decision tree).

Self-assessment questions(a) What are:

– tuples– MAR– ‘fuzzy matching’– ‘stratified sampling’– interval variables.

(b) What are the three broad categories of ‘dirty data’ defined by Kim et al.?(c) Describe Simpson’s paradox.(d) Explain the difference between supervised and unsupervised learning.

% population

% of respondents

Savings

Quota80

60

40

20

0

90100

70

50

30

10

0% 20% 40% 60% 80% 100%

randomDTkNNperfect

Figure 2.10 A Lift chart for mail-shot responses

3 exploratory data analysis and visualisation


56

Contents


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58

3.1 An illustration: how not to do data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

3.2 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.3 Specific approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

ActivitiesResearch activity Examples of visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Exercise 3.1 Advantages of visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Exercise 3.2 Visual patterns in data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Exercise 3.3 Line graph vs scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Exercise 3.4 Visualising data (which option is best?) . . . . . . . . . . . . . . . . . 68

exploratory data analysis and visualisation

57


When you have completed this Chapter you will be able to:! explain the differences between exploratory data analysis (EDA) and visualisa-

tion, and discuss whether these are of any real importance! appreciate the differences between geometric and symbolic approaches to visual-

isation! identify the main types of visualisation and their relative strengths and weak-

nesses! explain how for certain types of application (particularly those in high dimen-

sional space) non-linear mappings which are not explicitly ‘representational’ canbe of great value.


58

Introduction

A large number of graphical data visualisation techniques exist and due to their flexi-bility and intuitive output are often considered the ‘lowest’ level of BI tool. As thesevisual representations of data can help identify trends, relations and/or biases in datasets they are often referred to as ‘exploratory data analysis’ (EDA) tools. Whether EDAtools are themselves a BI method or simply a useful adjunct to a number of other BIalgorithms is not really a debate worth worrying about, the point is that in either set-ting they are useful end-user tools. In this Chapter we introduce the EDA approach, arange of EDA graph types and, where necessary, a little supporting mathematics.Due to the graphical nature of much of the discussion in this Chapter figures tend tobe the best way to illustrate a range of points. However, we are constrained both interms of space and the use of colour. In view of this the BI module website has full col-our versions of all of the figures discussed in this Chapter and in addition has pointersto a number of additional illustrations for many of the graph types located at ‘external’websites. There are of course many more EDA and visualisation techniques than arecovered in this short introduction. However, it is hoped that this discussion will pro-vide some insight both into the use of visualisation as an initial exploratory tool aswell as to its incorporation within other BI approaches.


59

3.1 An illustration: how not to do data mining

Recently I was supervising a student who was looking at using data mining tech-niques to uncover patterns in detailed drug prescription data from patients acrossScotland. Even at the level of aggregation at which GP anonymity is maintained (drugsdispensed by local health area) this data is substantial – we received, from the ScottishInformation and Statistics Division (ISD), around 6,000,000 records per year of pre-scriptions. Clearly a database of this size is potentially well suited to the application ofdata mining. The student dived straight into this large data set using a sophisticatedtool (IBM’s Data Miner, as it happens) and came back after a few weeks enthusiasti-cally reporting an interesting association between a certain class of anti-inflamma-tory drug and areas of social deprivation. I was interested in this association andprobed him further. What proportion of prescriptions in Scotland belonged to thisclass of drug? (And its parent class?) How many postcode level districts with this lowsocial deprivation score existed within each local health areas in Scotland? On askingthese and other questions I found that he had no idea of the answers – he had no ‘feel’for the prescription data picture over Scotland as a whole. Without such a perspectiveon a data set it is difficult to interpret or even judge the significance of any specificrules, clusters or associations that might result from data mining activity. So while thelimits of exploratory data analysis (EDA) must be recognised the associated techniqueshave much to offer in terms of an initial approach to describing and getting a ‘feel’ forany data set.


60

3.2 Visualisation

Finding rules and clusters or creating models to explain a set of relationship is all verywell, but can people actually understand what is going on? The ‘end users’ are typi-cally non-specialist in the area of data analysis and often find mathematical formal-isms off-putting (at best!). This is where visualisation comes in. The task of the rangeof tools and approaches which come under the ‘visualisation’ banner is to make pat-terns in data apparent to users and to give graphical expression to what can often becomplex mathematical relationships.

Research activity Examples of visualisation You may find it useful to collect examplesof varied ways of visualising data, from books, magazines, newspapers, the www etc.

The point at which Exploratory Data Analysis (EDA) leaves off and visualisation takesover is an indistinct one and we are not going to waste time worrying about such defi-nitional boundaries. However, the first few graphical techniques illustrated below arewidely used outside of data mining and some would object to them being labelled as‘visualisation’. The real strength of the techniques is seen when highly complex datasets, often represented in multi-dimensional space, have to be mapped and exploredby users. The ability to ‘interact’ with a data set is also a key feature of both EDA andvisualisation. One example of this is the technique of ‘brushing’ where specific pointsor sub-sets of points are selected by the user and the effect of their removal and/orinclusion can be viewed interactively (how does the line of best fit alter?; are the confi-dence intervals different?; does removing this outlier affect the statistical significanceof a test for group differences?; etc.). There are a number of excellent books in visuali-sation. My favourites (though they make little explicit reference to business intelli-gence or data mining) are those of Ed Tufte (Tufte 1983; Tufte 1990; Tufte 1997). If youhave not seen these beautifully produced texts and have any interest in graphical pres-entation you would do well to take a look at them. Even though it is less sophisticatedthan the others my personal favourite is still his earliest book, The Visual Display ofQuantitative Information (Tufte 1983). For a ‘statistics’ perspective on visualisation youwould be hard pressed to do better than consult John Tukey’s classic text, ExploratoryData Analysis (1977) or, for a slightly more graphically oriented slant, Bill Cleveland’sVisualizing Data (1995). Of most direct interest here is an excellent text edited byFayyad, Grinstein and Wierse (2002) entitled, Information Visualization in Data Miningand Knowledge Discovery.

Exercise 3.1 Advantages of visualisation What are the principal advantages of pre-senting data in visual form? Are there any disadvantages?


61

General techniques for visualisationMany different techniques are used to present data but most can be broadly catego-rised as either geometric or symbolic.Typically the geometric approach uses a set of axes onto which the graphical elementsare projected. These elements may include points, lines, surfaces and, in the case of2.5D (the name given to 3D images projected onto 2D surfaces – i.e. a computer screenor sheet of paper), even volumes. Symbolic representations are typically used for non-numeric data and so are not usually tied to any set of axes. They include icons, arrowsets, arrays, networks, etc. Due to the fact that the space on which these are laid outhas no immediate ‘meaning’ in terms of Cartesian geometry or scale, topology is ofcritical importance in the symbolic approach. In more complex visualisation it is oftennecessary to reduce the number of dimensions; thus even for data which are initiallynumeric a symbolic representation may be the only feasible projection (see, for exam-ple, the World Cities map in ‘Non-linear projections: Sammon plots and multidimen-sional scaling’ on page 72).


62

3.3 Specific approaches

The range of techniques illustrated here is of course not exhaustive, but hopefully itwill give a feel for the range of options and their importance in ‘selling’ the output ofBI tools to end users. Some of the newer and more complex visualisation approachesare also touched on in ‘The BI market and future trends’. A collection of colour imagesfrom Chapters 1 and 2 of Fayyad, Grinstein and Wierse (2002) is available at the follow-ing address: http://www.bh.com/companions/1558606890/pictures/The specific graphics are referred to according to the format Chapter_02/Fig2-1.gif, etc.When I refer to these I will use the notation FGW’02 Figx-y.gif.1 Some of FGW’s graphsuse the data sets Iris_Flower and Cars_by_Country (among others) and so I will usethese two data sets in some of my own examples.

Scatter plotsIn some ways these are the simplest of all displays yet they are often very useful.I came across one of my favourite illustrations of the importance of having a graphicalrepresentation of data to supplement the statistical one while reading Tufte (1983),though he was in fact using a data set due to Anscombe (1973) and known as ‘Ans-combe’s quartet’, having four data sets as shown in Table 3.1. These data sets all havethe exact same ‘statistics’ in terms of their number of points (11), the mean values ofX and Y (9 and 7.5), standard deviations (3.32 and 2.03), and even correlation coefficient(r(x,y) = 0.82). [In fact, as we shall see in ‘Predictive modelling – regression’ onpage 111,the data sets even have the same regression lines and R2 values.] If all we hadwas the summary statistical ‘model’ of the data we might claim that the sets were allidentical – or at least very similar. However, a quick glance at the scatter plots createdfor these data as shown in Figure 3.1, clearly illustrates how wrong such a claim of sim-ilarity would be!

1 However, all the figures can be accessed through the links provided from the rele-vant section of the BI module website. This is probably a ‘safer’ bet as they recentlymoved location on the web when the original publisher was taken over by Elsevier.

Table 3.1 Tables of 4 x/y data points making up Anscombe’s quartet

Data Set 1 Data Set 2 Data Set 3 Data Set 4

X Y X Y X Y X Y10 8.04 10 9.14 10 7.46 8 6.588 6.95 8 8.14 8 6.77 8 5.76

13 7.58 13 8.74 13 12.74 8 7.719 8.81 9 8.77 9 7.11 8 8.8411 8.33 11 9.26 11 7.81 8 8.4714 9.96 14 8.10 14 8.84 8 7.046 7.24 6 6.13 6 6.08 8 5.254 4.26 4 3.10 4 5.39 19 12.50

12 10.84 12 9.13 12 8.15 8 5.567 4.82 7 7.26 7 6.42 8 7.915 5.68 5 4.74 5 5.73 8 6.89


63While this example is a little extreme/contrived it does make the point that goingstraight to the ‘statistics’ can be a dangerous approach and in the case of simple 2-dimen-sional data sets taking a minute to view the scatter plots can reap major dividends.

The graph shown in Figure 3.2 illustrates anotheruse for the scatter plot – the identification of out-liers. The data point (3.2, 9.5) clearly shows up asan outlier on this plot. Looking at each axis on itsown does not show up the outlier (see Figure 3.3overleaf) as it effectively ‘hides’ in the marginaldistributions. (In fact in the y dimension thepoint does show up as an outlier – but only justand there are 6 ‘upper’ outliers of which ourpoint is the least extreme.) Finding a line of bestfit for the data with this outlier included resultsin an R2 of 0.72, while removing the outlier givesan R2 of 0.91 (much more convincing evidence ofa strong correlation). The slope of the best fit linealso changed from y = 1.7 to 1.9 times x. This iscounter-intuitive as the outlier point has a yvalue of approximately 3 times x, but this merelyillustrates the importance of removing outliersbefore carrying out (in this case) correlation anal-ysis. We do not have space here to go into anexplanation of regression or the meaning of R2

values – see Predictive modelling – regression’ onpage 111 for more detail.

Figure 3.1 Scatter plots of the data produced in Table 3.1

Figure 3.2 Scatter plot showing the additional returnin profit (y) due to spend on advertising (x)

Additional profit


64

Figure 3.3 Box and whisker plot showing the spread of the cost (x)and profit (y) variables from Figure 3.2

Figure 3.4 Scatter plot of MPG against weight of carfrom the Cars_by_Country data set

Fuel consumption (MPG)


65

Even when the data are relatively sparse the possibility of visualising lines or curvesonto the scattered points exists. As the data become more dense this is not so easy ascan be seen in Figure 3.4 which plots a scatter graph of MPG against Weight from theCars_by_Country data set.2 What is clear from this denser plot, however, is that thereappears to be a strong relationship between increased weight and increased fuel con-sumption (i.e. reduced MPG). Picking up such ‘visual correlations’ is another strengthof the simple scatter plot approach.Additional dimension can also easily be added to a scatter plot by, for example, usingcolour or symbol shape as shown in Figure 3.5. It now becomes clear that in addition tothe overall relationship between weight and MPG there is also clear evidence that cer-tain countries have different tendencies.

2 The Cars_by_Country data set contains over 400 models of car manufactured inone of three regions – USA, Japan and Europe. In addition to make and model of carit contains seven other descriptive variables including: year of first manufacture,weight, horsepower, MPG, etc. A sub-set of the data is shown in Table 3.2.

Figure 3.5 Scatter plot of MPG against weight of car from the Cars_by_Country data setusing alternative colour/symbols to indicate the various countries

Fuel consumption (MPG)


66

A very powerful extension to the basic scatter plot is to use a matrix of related plotsfrom which richer sets of patterns and relationships may emerge, as can be seen inFigure 3.6 3.

Exercise 3.2 Visual patterns in data List four interesting relationships or patternswhich become apparent to you as you look more closely at Figure 3.6.

Fayyad et al. (2002) note that there are a number of (more complex) variations on thescatter plot matrix theme. These include the Hyper-Slice [FGW’02 Fig2-3.gif] and theHyperBox [FGW’02 Fig2-4.gif]. The Hyper-Slice (van Wijk and van Liere 1993) projects amatrix of panels (‘slices’) taken through the data from a defined point of focus. Theynote that ‘prospection’ (Spence, et al. 1995) which only projects a sub-set of the pointsaccording to some user-defined criteria (as is the case in graphic ‘brushing’) may be amore useful version of this approach in the case of data mining.

3 This graph really works best in colour and you should take a look at the version onthe website, which also uses alternative colour to indication the origin of the car.

Figure 3.6 Scatter plot matrix of six variables from the Cars_by_Country data set


67

Line graphsAlmost as well-known and arguably even more straightforward than scatter plots,these graphs illustrate relationships between variables. These graphs may consist ofsingle lines or curves. They may use colour or line format to compare multiple lineswithin the same set of axes. As with scatter plots it is possible to arrange sets or matri-ces of such line plots to illustrate more complex relationships.

Exercise 3.3 Line graph vs scatter plot I am not always convinced that multiple linegraphs work as well as matrices of scatter plots. Take a look at FGW’02 Fig2-1.gif. Doesthis figure appear convincing to you? In what ways do you feel it is better or worse thanthe equivalent scatter plot?

Often a line graph is used where no ‘line’ exists. Typically the line is added by thegraphing package using a ‘best fit’ algorithm and line format specified by the user.A classic mistake in using this type of graph is to ‘join up’ data points to create a linewhere such a step is not warranted and may be quite confusing – for example, draw-ing a line where categorical data is specified on the x-axis. This and a number otherexamples of the poor use of line graphs are given at Michael Friendly’s excellent siteunder the title of ‘Darts: Context – compared to what?’ at http://www.math.yorku.ca/SCS/Gallery/context.html. This is part of a larger site maintained by Friendly entitledthe Gallery of Data Visualization. The site attempts to ‘display some examples of theBest and Worst of Statistical Graphics, with the view that the contrast may be useful,inform current practice, and provide some pointers to both historical and currentwork’ (http://www.math.yorku.ca/SCS/Gallery/). Well worth checking out – it evenattempts to identify the ‘best statistical graphic ever drawn’!

Bar-charts, pie-charts and histogramsOnce again a familiar set of graphical tools. Bar charts should often be used in place ofline graphs when categorical data is being reported on. The use of histograms can beuseful in data mining contexts as they give an easily understood picture of the under-lying distribution of the data set. Using a histogram matrix allows for the comparisonof a number of distributions at the same time as illustrated in Figure 3.7 for the rangeof variables that make up the Cars_by_Country data set.

Figure 3.7 A matrix of histograms illustrating the distribution profiles of the variableswhich make up the Cars_by_Country data set


68

Extensions to simple line graphsThe basic idea of drawing lines in a 2D (or 2.5D) space has been extended in a numberof ways to provide alternative representations which are particularly suited to certaintypes of data:! surface plots! permutation matrix! survey plots! parallel coordinates.

Surface plots use ‘contour’ lines to join up points of equal value (as is the case withheights on a geographical map). Borrowing ideas from existing mapping or graphicsdesign is a strong theme in visualisation. There is a range of research areas such asimage perception, semiotics, visual metaphor, and dimensional scaling – to name buta few – which have relevance to visualisation but the ‘bottom line’ appears to be thatsticking to representations (such as physical maps) with which most people are famil-iar brings cognitive benefits. Moving from 2D contour lines it is possible to project 3Diso-surfaces in a similar way. These images are familiar to many from medical analysiswhere, for example, a volume visualisation of a human body is constructed from aseries of ‘slices’ taken by some type of scanning technology, e.g. in magnetic resonanceimagery (MRI) [FGW’02 Fig1-7b.gif]. Another technique for 2.5D surface projection is‘rubber sheeting’ where the image on screen appears as a 3D ‘latex’ mould [FGW’02Fig1-3a.gif].The permutation matrix and survey plot graph each data value as a vertical bar or hor-izontal line respectively, the bar being scaled relative to the value of the variousdimensions of the data point being displayed. I personally prefer the survey plotapproach, but this may be a good example user perception being driven by personalhistory. The way in which each dimension in a survey plot is stacked into an imagi-nary cylinder and then ‘trimmed’ according to the values of that dimension remindsme of using the lathe for turning wood at school and is immediately intuitive. Com-pare [FGW’02 Fig2-7.gif] and [FGW’02 Fig2-7.gif] which are examples of a permutationmatrix and survey plot respectively.The approach of parallel coordinates is somewhat different and can at first look like aconfused spider’s web! The data is shown with respect to N parallel (and equallyspaced) vertical axes. Each data point (instance) is represented as a line (more accu-rately a poly-line, as its angle typically alters as it crosses each parallel axis) whichtakes a position on each axis that is proportional to its value in that particular dimen-sion. See [FGW’02 Fig1-5b.gif and Fig2-25.gif].

Exercise 3.4 Visualising data (which option is best?) Of the methods introduced sofar, which would you choose to visualise the ‘beer and diapers’ phenomenon that wediscussed earlier? Is there a risk of distorting perceptions by these means?

Glyphs and iconsThe use of glyphs is a pseudo-symbolic approach which maps data values to graphicalattributes such as colour, shape and size. Thus while 2.5 dimensional mapping limitsus to the presentation of three variables simultaneously, the inclusion of graphicalattributes can allow 2 or 3 additional variables to be added while still retaining visualclarity. Formally scalar glyphs are used to represent a single variable. The most com-mon approach is to add the size of spheres to a 2.5D projection thus increasing thenumber of variables shown from three to four. This is illustrated in Figure 3.8 where


69

we add a scalar glyph using the size of spheres to represent the number of cylinders, inaddition to the existing dimensions shown (i.e. Year, Weight and MPG)4. In some waysthe use of colour to identify countries in the earlier figures (e.g. website versions of Fig-ures 3.5 and 3.6) already constituted a use of glyphs. (See also FGW’02 Fig1-4a.gif whichuses a scalar glyph to illustrate the density of cloud in a meteorological visualisation.)When two dimensions are incorporated into the glyph’s construction this is referred toas a vector glyph. This is often used in engineering applications to represent flowinformation where both direction and strength (or speed) of flow must be shown (seeFGW’02 Fig1-4b.gif for an example of such an application).

Perhaps the most common (but in my view often confusing) use of glyphs is the starglyph. In this approach the ‘star’ has as many points as there are variables being repre-sented and the length of each ‘ray’ from the centre of the star is determined by thevalue of each variable (normalised to provide rays of potentially equal length). Due tothe fact that I find this representation quite confusing when too many instances areplotted I have shown star glyphs for a selection of 30 rows from the Car_by_ Countrydata set in Figure 3.9 overleaf. Notice that Country has been omitted from theattributes shown in the star glyphs. It would be possible to code the countries numeri-cally (USA = 1, Europe = 2, Japan = 3) but I find that this artificial allocation of values tobe plotted based on categorical data leads to less meaningful graphs. (In some ways asimilar argument could be made for Year as it is not measured quantitatively in thesame way as the other variables shown.)

4 Once again this looks best in colour and a version on the BI module website alsouses different colours to show a 5th dimension – country of origin.

Figure 3.8 A 2.5 dimensional plot of the Car_by_Country data with theadded ‘scalar glyph’ to show number of cylinders as represented by sphere size


70

Recursive table visualisationsThis class of visualisations is referred to as ‘recursive’ because the displays impose ahierarchical structure on the data such that some dimensions become embeddedwithin the display ‘into’ other dimensions. They are also referred to by alterativenames, Wong and Bergeron (1997), for example, refer to this set of approaches as ‘hier-archical axes’ visualisations. As with much when trying to tie down terminology inthis rapidly-changing area it is not clear whether ‘clustered visualisations’ shouldbelong within this classification. Clustered approaches include dendrograms andKohonen networks which we will not discuss here as they are covered in Chapter 4.

Table 3.2 Table showing the first 10 cars from 1982 in the Cars_by_Country data as used in Figure 3.9

Name MPG Eng. size Horse Pwr Weight Acc.0 to 60 Origin

Chevrolet Cavalier 28 112 88 2605 19.6 USAChevrolet Cavalier 2-door 34 112 88 2395 18 USAPontiac j2000 se hatchback 31 112 85 2575 16.2 USADodge Aries se 29 135 84 2525 16 USAPontiac Phoenix 27 151 90 2735 18 USAFord Fairmont Futura 24 140 92 2865 16.4 USAVolkswagen Rabbit l 36 105 74 1980 15.3 EuroMazda glc custom 37 91 68 2025 18.2 JapanMazda glc 31 91 68 1970 17.6 JapanMercury Lynx l 36 98 70 2125 17.3 USA

Figure 3.9 A star glyph plot representing 28 of the instancesin the Car_by_Country data set


71

Other that the clustered approaches noted above, recursive table visualisations are notthat widely used. Perhaps the most widely known is Dimensional Stacking. In thisapproach each dimension is divided into a small number of discrete bins, while thedisplay area consists of a grid of small sub-images. The overall number of sub-imagesis determined by the number of bins in each of the two ‘outer’ dimensions (these arespecified by the user). Within each sub-image a further decomposition takes placebased on the number of bins in the next two dimensions, and so on until all dimen-sions have been allocated. (See FGW’02 Fig1-6a.gif for an illustration of this approach.)We do not have space here to describe the range of additional recursive table visualisa-tions that exist but they include (with relevant web-site illustrations), the following: ! Temple MVV (FGW’02 Fig2-16.gif)! N-vision (FGW’02 Fig2-17.gif)! Worlds within Worlds (FGW’02 Fig2-18.gif)! and the wonderfully named Fractal Foam (FGW’02 Fig2-19.gif).

Non-linear projections: Sammon plots and multidimensional scalingWhen the number of dimensions exceeds that which can be easily displayed (i.e. morethan four or so) it may still be possible to see patterns, relationships or clusters basedsimply on the distances between points (in N-dimensional space). However, projectingthese intra-instance distances in a linear manner, as would be the case in a scatter plot,tends to provide a poor representation – the true distances between points tend to belost. The multidimensional scaling approach provides a non-linear mapping down totwo dimensions which tends to preserve the distance between all points. (In some waysthese techniques can be seen to resemble the data pre-processing techniques fordimensional reduction – particularly, Principal Components Analysis – introduced inChapter 2.) One of the first authors to propose such plots was Sammon (1969) – a Sam-mon plot of the Iris data can be found at FGW’02 Fig2-22.gif. (The three types of irisflower are shown in the plot using different colours and their clustering can be clearlyseen – though in this case as there are only four dimensions to begin with it does notparticularly illustrate the strength of this approach to high dimensional data.)One of the most impressive (and convincing) demonstrations of MDS I have seen is inthe notes by David Barber where he provides a graph showing the relationship betweenworld cities based only on the distance between them – as reproduced in Figure 3.10overleaf. Of course the mapping is not perfect and you might respond, ‘why not justgive the longitude and latitude figures and get a much more accurate representation?’.However, this is missing the point. We are looking for a method to provide interestingprojections based only on the distance between points. Imagine that we wished to mapthe ‘distance’ between cities based not on their geographical proximity but on ten otherattributes, such as: population, total area, number of airports, average income, cost ofliving, etc. We could find the matrix of intra-city distances based on this 10-dimensionalspaces and the MDS projection would then allow us to visualise the results.


72

Figure 3.10 An MDS plot showing the ‘location’ of some major cities in the worldbased only on the distance between each– from notes by Barber (2003)


73

3.4 Conclusion

As mentioned in the introduction to this Chapter this is by no means an exhaustive listof all visualisation options. Not only do we not have space but is it pretty difficult tohit what is a rapidly moving target (i.e. new visualisation algorithms are being devel-oped continually). We shall note a couple of recently emerging approaches in the docu-ment: ‘The BI market and future trends’5, including the use of techniques for virtualreality (VR) technology. The area of text mining (see the same document) is developingsome interesting visualisation approaches. However, due to commercial sensitivity itis not clear which algorithms are being used, though many approaches would appearto use some form of MDS, often with an additional pre-clustering phase. Hopefully this brief overview has been sufficient to provide you with an idea of theusefulness of a range of visualisation approaches for exploratory data analysis. Inaddition we have introduced a range of techniques which are often incorporatedwithin BI tools to make the interaction with (and particularly the presentation ofresults to) non-specialist end users more intuitive.

5 See the module website.

4 techniques and algorithms from data mining and kdd


76

Contents


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78

4.1 KDD: introduction to the techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79

4.2 Descriptive modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3 Predictive modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94

4.4 Association rules and patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

ActivitiesThinking point Babesiosis data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Thinking point Forecasting and predicting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Excel Activity Iris data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104Thinking point Applied analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Exercise 4.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115Thinking point Disjunctive expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120Thinking point ‘Interesting’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

techniques and algorithms from data mining and kdd

77


When you have completed this Chapter you will be able to:! appreciate the key differences between descriptive and predictive approaches and

the links to (un)supervised learning and to models, patterns and relationships! explain how density estimation can provide a quick summary of the ‘shape’ of a

data set and its use in model construction! within the area of predictive modelling, explain the difference between classifi-

cation and regression approaches! identify the range of clustering algorithms that exist to describe the grouping of

similar instances in a data set! appreciate the differences in classification approaches and the implications for

variable inclusion and interpretation of resulting classifiers! explain how association rules differ from decision trees and identify the types of

problem that can benefit from the association rule approach.


78

Introduction

In this Chapter we consider a range of techniques and algorithms from the area of datamining and knowledge discovery in databases (KDD). Many of these would alsoappear in ‘mainstream’ statistics coverage of data analysis tools, and that is wherethey have their origins. However, they are introduced here in the context of the impactthese techniques can make to a range of business intelligence problems. In addition anumber of the techniques used in DM/KDD, for example SOFM or Kohonen Networks,would not typically be covered in statistical texts.The Chapter begins with an overview of the types of BI task to which these techniquescan be applied. These are considered under the broad headings of descriptive model-ling and rule association, and are typically unsupervised learning tasks applied touncover patterns within data sets. The more general construction of models for thecomplete data set is considered to be a supervised learning activity and is consideredunder classification and regression in a section on predictive modelling.Given the nature of the material covered in this Chapter it is inevitable that the level ofdetail provided on any technique is somewhat limited. For more detailed coverage thereader is directed to the four key chapters of the course text, Hand, Mannila and Smyth(2001)1 which are relevant to this Chapter. The techniques introduced in this Chapterare illustrated using a range of examples in the area of disease diagnosis, plant identi-fication, credit rating, etc., not all of which may seem to have immediate relevance to a‘hard’ business context. However, the important point is that you gain an understand-ing of the underlying approaches.

1 Referred to hereafter as ‘HMS (2001)’.


79

4.1 KDD: introduction to the techniques

There is no easy way to ensure that a taxonomy of the techniques that can be appliedto the knowledge discovery in databases (KDD) task within business intelligence isboth logical and consistent. Consideration of different facets of the technique will leadto a different classification. Thus if we are considering techniques which operate as‘models’ (i.e. a global summary of the data set) as compared to those which worktowards uncovering ‘patterns’, we would divide the techniques in one manner. If onthe other hand we were concerned with ‘descriptive’ as opposed to ‘predictive’ tech-niques we would classify in a different manner.The difficulty of pigeon-holing algorithms has been acknowledged by Hand, Mannilaand Smyth (2001) and in Chapter 5 of their text they specify a five-facet (or, to use theirterm, ‘component’) framework within which algorithms can be placed. Table 4.1 illus-trates this framework. They note that there are key steps that apply whether we arelooking for relationships to build global models or simply to uncover local patterns(the ‘task’). We must select the nature of the model representation to be used (the‘structure’) and the method by which we will judge the ‘fitness’ of that model (the‘score function’). From an algorithmic point of view it is important to define the man-ner in which the technique will seek to optimise this scoring function (the ‘searchmethod’), as well as any methods to provide efficient access to the database (‘datamanagement’).

The framework proposed by HMS (2001) has much to recommend it and I would sug-gest that you read Chapter 5 of their text. That chapter includes a nicely worked exam-ple, introducing some of the algorithms mentioned later in this section, based on theCharacteristics of Wines data set.I propose to cover the techniques using a much simpler framework based essentiallyon the task involved. The notion of ‘supervised’ vs ‘unsupervised’ learning was intro-duced in Chapter 2. We noted that these terms come from the machine learning litera-ture, while we are primarily interested within business intelligence in ‘humanlearning’ (though often with the aid of machines). In any case a summary of the main

Table 4.1 Framework to categorise data mining algorithms (based on HMS (2001) pp142–143)

Facet/component Description Example – CART*

Task Type of overall task being undertaken (pattern discovery, regression, etc.)

Classification and regression

Structure General structure of solutions to be fitted to the problem under consideration

Decision tree

Score function Specific mechanism for ‘scoring’ the outcome of any model generated (sum of squares, etc.)

Cross-validated loss function

Search method Algorithm to allow the efficient assessment of a wide range of possible solutions

Greedy search

Data management Special considerations to allow large data sets to be manipulated (e.g. not all in main memory)

None specified

* CART: classification and regression trees


80

data mining tasks under these two headings provides a helpful way of addressing thesubject matter presented in this section, as shown in Table 4.2. The specific tasks iden-tified have the additional benefit of matching up with chapters in HMS (2001), whereyou will be directed for further technical details on: ! descriptive modelling (Chapter 9)! classification (Chapter 10)! regression (Chapter 11)! rule association (Chapter 13).

Before beginning our discussion of descriptive modelling, it is worth re-emphasisingthat certain algorithms or techniques fit into more than one task category. For exam-ple, CART (classification and regression trees) can be used in both classification andregression tasks, while artificial neural networks can be used for cluster pattern identi-fication (SOFM/Kohonen networks) as well as for both classification and regression.This is of course the reason why HMS (2001) developed their more complex framework,but working within the simpler structure outlined above allows us to gain key insightsinto the range of techniques available.

Table 4.2 Simple task-based framework used to review data mining algorithms and techniques

Unsupervised learning Supervised learning

Aimed at discovering patterns within data sets (global and local)

Aimed at building models to describe the latent structures within data sets (global)

TasksDescriptive modelling (Section 4.2, page 81) – probability distributions (page 81) – clustering (page 86)Rule association (Section 4.4, page 117)

TasksPredictive modelling (Section 4.3, page 94) – classification (page 95) – regression (page 111)


81

4.2 Descriptive modelling

The general purpose of a descriptive model is to describe all or part of a data set. Thesedescriptions can take the form of probability distributions (density estimation), rela-tionships between data attributes (dependency modelling) or splitting f-dimensionalspace into groups (segmentation and clustering). These models involve unsupervisedlearning in the sense that there is no pre-determined target outcome(s) nor is any par-ticular variable the focus of attention.

Probability distributions (density estimation)In a sense we have already covered the first approach to descriptive modelling in ourdiscussion of exploratory data analysis (EDA). However EDA provides a graphicaldescription of a data set and we wish here to develop mathematical descriptions ofdata. This is not to say that mathematical descriptions are always better, indeed inmany cases they involve us in making compromises in describing our data. However,they have a number of benefits which include the ability to reason and infer fromthese models, and to build tools such as Monte Carlo simulation models based on thedata sets. The topic of density estimation can be fairly dry and formal. We will illus-trate a number of the key points through the use of example data sets and leave theinterested reader the option of following a more theoretical discussion by looking atChapter 9.2 of HMS (2001).We begin by introducing a problem which required the generation of probability dis-tributions and uses maximum likelihood estimation to achieve this. Note that theterms density estimation and probability distribution are used interchangeably. Toestimate the density of a data set we describe its probability distribution. The exampleis taken from my own research which, while not explicitly in the area of BI, doesinvolve the creation of a decision support system (DSS) and the capture of expertknowledge for representation and reasoning, all of which are central themes in thedevelopment of any BI system.In the next section (‘Clustering’ on page 89), we use some output from a cattle diseasediagnosis system to illustrate hierarchical clustering (this application – CaDDiS – is alsodiscussed in Chapter 5 (‘Uncertainty’ on page 157) as it employs Bayesian belief networks,a graphical AI tool which uses formal Bayesian probability theory to represent uncer-tainty). We do not need to know the details at this point but will discuss the knowledgeacquisition phase involving veterinary disease experts and the use of maximum likeli-hood estimation (MLE) to summarise their knowledge. Such a summary had to take amathematical rather than a simple graphical or even textual format as the knowledgesummarised was required as an input into the Bayesian belief system which would sim-ulate the expert’s diagnostic processes. (We do not discuss the details of the MLEapproach here but readers can find a useful summary in Chapter 4.5.2 of HMS (2001).)The CaDDiS diagnostic system that was being built covered 20 commonly occurringdiseases in sub-Saharan cattle and used 27 signs associated with this set of diseases asthe basis for diagnosis. Each expert was asked to provide ‘sign prevalence’ data for thedisease(s) in which they were an authority. Thus an expert on the African cattle dis-ease Babesiosis would be asked, ‘of all the cattle you have diagnosed with Babesiosiswhat proportion of them had a fever, stopped producing milk, exhibited lethargicbehaviour?’ and so forth for the full set of all 27 signs of interest to the system. Thequestion was asked in this ‘direction’, i.e. given the presence of a disease what are thechances of seeing particular signs, because it is relatively easy to get experts torespond based on their past experience. Ultimately the system needed to convert thisknowledge to operate in a ‘diagnostic’ direction, i.e. ‘given that the animal exhibits this


82

set of signs what are the chances that a range of diseases will be present’? As we shallsee in Chapter 5, this ‘directional transformation’ can be achieved using Bayesian logic.While it was relatively easy to get disease experts to express a judgement on the likeli-hood of signs given that a disease was present, this does not mean that their judge-ments were in agreement. This is typical of almost all expert knowledge elicitationexercises involving more than one expert and the challenge to the expert systembuilder is to capture the level of disagreement in some meaningful way. Perhaps in astatistical formalism that can then be used as part of the reasoning process. This is illus-trated by looking at the case of Babesiosis, for which nine experts provided opinions.Rather than being asked to specify precisely the percentage of cattle which exhibited agiven sign the experts were given the five 20% bandings plus two additional bandingsrepresenting no cattle (0%) and all cattle (100%). The results for the likelihood of seeingthe signs (a) fever and (b) lethargy are shown in Figure 4.1. The darker bars representhow many of the nine experts indicated each of the particular likelihood bands, whilethe lighter bands indicate the proportion of opinion allocated to each band based on theprobability distribution discussed below and shown in Figure 4.2.

!

Thinking point Babesiosis data What aspects of Figure 4.1(b) might repay furtherinvestigation?

!

The graphical description of the experts’ opinion allows us to make an initial assess-ment simply by looking at the more darkly shaded histogram of ‘expert opinion’. Thisassessment of expert responses suggests that the probability of fever being presentwhen an animal has Babesiosis is higher than the probability of lethargy. However, atleast as important is the much higher level of agreement between experts (‘concord-ance’) for the case of fever than there is for the case of lethargy. We do not have space atthis point to discuss why this might be the case, though one obvious possibility is thatthe sign ‘fever’ is more clearly defined than that of ‘lethargy’ and that the higher degreeof ambiguity associated with this second sign is being reflected in the experts’ diagnos-tic views. Alternatively there may be a ‘real’ difference between the observation ofthese signs in the case of Babesiosis with ‘fever’ being a much more ‘indicative’ sign.

(a) Perceived prevalence of fever given Babesiosis

ProbabilityExpert opinion Modelled opinion

No. of experts

56

4

2

3

1

00 0-0.2 0.2-0.4 0.8-1.00.6-0.80.4-0.6 1.0

(b) Perceived prevalence of lethargy given Babesiosis

Probability

No. of experts

56

4

2

3

1

00 0-0.2 0.2-0.4 0.8-1.00.6-0.80.4-0.6 1.0

Expert opinion Modelled opinion

Figure 4.1 Comparison of recorded and modelled expert opinion for the prevalence of(a) fever and (b) lethargy in the situation where Babesiosis is present in cattle

within sub-Saharan Africa, based on responses from nine experts


83

While the initial graphical summary is helpful we require a mathematical descriptionwhich in this case will be a density estimation constructed by using the maximumlikelihood estimation method. Using this approach we find that the probability offever being present is centred on 0.86 (with a spread of s = 0.113) while the mean prob-ability for lethargy is 0.55 (with a spread of s = 0.356). This difference in spread repre-sents in some informal sense the level of disagreement, which is about three times asgreat for lethargy as for fever. The probability distributions derived are shown in Fig-ure 4.2. These represent 2-parameter normal distributions which are continuous butmay spread beyond the 0 and 1 bounds associated with any probability estimate.Describing the expert opinion in this way had a number of useful mathematical conse-quences. For example, we can sum (integrate) the area under the curve for each of thepossible categorical response options originally specified and thus calculate the dis-crete approximations – which were shown in Figure 4.1 in the lighter shaded histo-gram of ‘modelled opinion’. As can be seen the probability distribution resulting fromthe MLE method gives a fairly accurate reflection of the experts’ diagnostic views.

More detail on the methods used can be found in (McKendrick et al., 2000) while theway in which these density estimations were used to create a decision support systemis discussed in Chapter 5.The example given above represents a one-dimensional or univariate example of den-sity estimation. The principles extend to as many dimensions as we happen to beusing (though more than three are difficult to visualise in the way illustrated here).The example below extends the probability distribution to 2 and then 3 variables. It isnot discussed in detail but the reader in referred to Chapter 9.2 of HMS (2001) for amore technical discussion of multivariate distribution estimation. (Figure 9.3 on p286and the associated discussion are of particular interest.)The examples given below come from a real data set relating to loans made by a Ger-man bank. We will use this data set, which has 1000 loan instances and around 25descriptive variables, in the practical work during the seminar sessions which look atclassification. However, for the moment we simply take three easily understooddimensions of the data and see how density estimation can aid in our understandingof the data set. In Figure 4.3 overleaf, a scatter plot is given which illustrates the rela-tionship between the age of the loan applicant and the bank’s assessment of theircredit rating (in ‘000s of DM). Given the fact that there are 1000 data points in the plotit is quite difficult to make as much sense of this representation as we might wish.

(a) Normal model for P (fever|Babesiosis)

Probability

densityfunction

-0.1 0.1 0.3 0.90.70.5 1.1

(b) Normal model for P (lethargy|Babesiosis)

Probability

densityfunction

-0.1 0.1 0.3 0.90.70.5 1.1

Figure 4.2 Normal model for the prevalence of (a) fever N (P=0.86, 0.113) and(b) lethargy N (P=0.55, 0.356) in the situation where Babesiosis is present in cattle


84One way to assist our understanding is to look at the marginal probability distributionfitted to each of the dimensions. This is done using a normal distribution and is shownin Figure 4.4. Of course other distributions could be fitted – and would almost certainlybe more appropriate – this is simply an illustration. We could now remove the datapoints from the graph and look at the ‘shape’ of the combined distributions, butinstead we will make life a little bit more interesting by adding a third dimension –duration of the loan.In Figure 4.5 the scatter plot has been extended so that each point reflects the duration ofthe loan using a different symbol. Even the addition of colour (as shown in the version ofthe graph on the website) does not really help in gaining a sensible interpretation of thisdata. Instead a 3-D density estimation which attempts to fit the best surface to this datais shown in Figure 4.6 on page 86. Once again this looks better in colour (as shown in theweb version) and while it represents a simplification of the data the general ‘shape’ ofthe relationship between these three variables is reasonably well illustrated.

Figure 4.3 Scatter plot of credit rating by age

Age (years)


85

Figure 4.4 Scatter plot of credit rating by age (with associated histogramand normal density function for each variable)

Age (years)

Figure 4.5 Scatter plot of credit rating by age withduration of loan illustrated by different point icons

Age (years)


86

ClusteringI am using the term ‘clustering’ in its most generic sense. In fact within this group oftechniques some are referred to as ‘cluster analysis’ and others as ‘segmentation’. Inboth cases the basic aim is the same – to split the instances in the data set up into setsof relatively homogeneous objects. If the data set has f number of features (attributes)then each instance can be seen as occupying some location in f-dimensional space.The task of cluster analysis is to discover as many natural groupings (clusters) as existin the data set and to describe these to the user. In the case of segmentation (some-times also referred to as ‘dissection’) the number of clusters is set by the data analystin advance and the algorithm attempts to group the instances into this number ofclusters. For example, in a marketing exercise segmentation might be used to identify clustersof customers. The number of clusters would be set by the analyst in such a way thatthe likely number of customers in each cluster would be roughly in line with the effec-tive audience size as specified by the advertising department – thus if a new mail-shotwere to go out to 5,000 customers and a company had 50,000 potential customer con-tacts the analyst would look for an algorithm which would create around 10 roughlyequally sized clusters (Wedel and Kamakura, 1998). On the other hand a scientist look-ing for ‘natural’ groupings in long-term weather patterns would use cluster analysisbecause he or she has no idea a priori how many groups there might be (though in factlooking at ‘epochs’ of weather in terms of pressure patterns in the Northern hemi-sphere it can be seen that three such natural groupings have occurred since the middleof last century – see Smyth, Ide and Ghil, 1999). Strictly speaking, because the secondapproach attempts to uncover ‘inherent’ structure in the data it is more likely to beconsidered a ‘knowledge discovery’ activity. This contrasts with the case of segmenta-tion where we set the group sizes in a manner that is convenient to the problem at

Figure 4.6 Surface plot of credit rating by age with probability distribution ofduration shown using a least-squares best fit

Duration (months)

Age (years) Credit rating(‘000 DM)


87

hand. Fortunately the techniques used in both cases of ‘clustering’ are the same andthe more important distinctions lie elsewhere, as described below.Most introductory examples of clustering tend to use data which are described by twoattributes (features). This is simply because it is relatively easy to think in two-dimen-sional space and we can illustrate the examples by means of an x-y (Cartesian) plot.The principles apply directly to the f-dimensional problem space of more realisticcases. As we noted earlier, with vector/matrix algebra we can calculate the distancebetween two points in 5-dimensional space, (e.g. (2,5,4,7,1) and (3,2,2,6,2), not some-thing particularly easy for most of us to visualise) just as simply as we can the distancebetween (1,1) and (5,4) – which can be visualised very easily as shown in Figure 4.7.

The notion of distance is of course very important in anycluster analysis as it enables the algorithm to judge howdissimilar two entities are. In the example given abovesimple Euclidian distance is used, though this is not theonly method of calculating distance, nor is it always themost meaningful. This method assumes some degree of‘commensurability’ between the dimensions being con-sidered. So that if two variables were being considered –say length and width of football playing areas – then sim-ply plotting (or calculating) the ‘distance’ between stadi-ums based on the metres measured would be fine.However, what if we wanted to cluster stadiums accord-ing to their playing surface (m2) and spectator capacity(people)? Clearly these measures are not really commensurate. Even if we scaledaccording to the maximum values we would not get very consistent ‘spread’, i.e. forScottish Football League teams, the playing area varies from 7,000 to 8,000 squaremetres while the (seated) capacity ranges from 200 to 60,000. There are various waysof adjusting for such incompatibilities, one of the most widely used being to standard-ise the attribute or variable by dividing by its standard deviation. This ensures that alldimensions are treated with equal importance (see also the section in Chapter 2 ondata coding). Of course in some circumstances you may know that one variable is more importantthan another in measuring ‘distance’ and so you may chose to weight that variable rel-ative to the others. This would lead to a weighted Euclidean distance. There is a wholerange of minor variations on this theme such as City-block (aka Manhattan),Chebychev, Power distances and so on which each calculate distance in a slightly dif-ferent way. For example, in the City-block technique the effect of single unusual values– outliers – is damped. There is also the special case where the variables (dimensions)are binary – i.e. can only take on the value 0 or 1. For example, in document retrievaleach main index term can be considered a separate dimension which takes on thevalue 1 if a document contains that term and 0 if it does not. In this way any documentcan be located in ‘term space’ and its distance from any other document measured. Inthe special case of these binary dimensions there are better options than using Eucli-dean measures, with the Jaccard and the Dice coefficients being two of the mostwidely used measures of disagreement distance. (See Chapter 2.3 of HMS (2001) formore detailed coverage.)Having decided what measure of distance is to be used is it also important to decide onthe ‘type’ of cluster we are looking for (this is often referred to as the amalgamation/linkage rules). If, for example, we use a method that attempts to minimise the maxi-mum distance between all points in a cluster, known as complete linkage, we will endup with reasonably compact ‘spherical’ clusters. On the other hand, if we wish to

Figure 4.7 Simple plot showing the distancebetween the two points (1,1) and (5,4)


88

ensure that all members are at least similar to some member(s) of the cluster (singlelinkage) we could end up with dispersed ‘sausage’ shaped clusters. Within the discussion of the three main approaches to cluster discovery/generationdiscussed below we will also attempt to illustrate some of these differences in distancemeasures and linkage rules. Hierarchical clustering is illustrated in some detail tomake a number of generic points and we also briefly comment on the partition-basedand probabilistic approaches.

Hierarchical clustersPerhaps the most widely used type of clustering, the hierarchical form of the algo-rithm, has become familiar to (though not always understood by!) readers of scientificpapers in a variety of disciples, but particularly in the area of genetics.The trade-mark output of this type of clustering algorithm is a ‘dendrogram’ such as theone shown in Figure 4.8. In this case the DNA profiles of 47 salmon sampled fromaround Scotland have been used as the basis of the clustering. Each profile consists of amulti-locus probe which provides an intensity reading from 26 loci associated with aDNA sequence. (The clustering technique attempts to reduce this complex information– in 26 dimensional space – to a single dimension, the ‘distance’ between each sample.)The hierarchical clustering algorithm has identified three broad groupings (clusters) –two quite large with around 20 members each and a smaller third group with just fivemembers (shown at the bottom of the dendrogram). The degree of similarity is notedon the horizontal axis, such that some pairs (e.g. at point A) are virtually identical.Indeed all four members of that little sub-cluster would appear to share at least 80%similarity in terms of the DNA ‘dimensions’ used to measure distance in this example.

What might these clusters mean, or how mightthey be used? The most obvious example would beto look at some other aspect of the fish samples, saylocation or type of fish, and see whether these cor-respond to the DNA clustering. For example do allof the samples in the first large cluster predomi-nantly come from the West coast? Are the fourmembers of the small cluster all samples ofescaped farmed salmon while the others are exam-ples of West and East coast wild fish? Of course inreal examples the linkages between other explana-tory variables and the clusters will rarely be so neatand tidy, but they may lead to insights not possiblebefore the clustering took place.The dendrogram is so-called because of its tree-likestructure which is more obvious when the verticalform of the graph is used!. The hierarchical algo-rithm operates by merging points into clusters(agglomerative) or by sub-dividing super-clusters(divisive) until all instances are taken into consid-eration. The dendrogram provides a graphicalillustration of the entire process of merging (ordivision) and the degree of similarity between allinstances, sub-clusters, etc.

Figure 4.8 Dendrogram illustrating the clusteringof randomly amplified polymorphic DNA

profiles from Scottish salmon (Bron, 2003)


89

It is possible to use a fairly large number of dimensions when creating these sorts ofclusters (indeed this is one of their strengths). To illustrate this we return to the cattledisease diagnosis system (CaDDiS) introduced earlier. While it was not the initialintension of this research to use clustering it proved a useful tool in summarising theknowledge collected from 44 veterinary experts on 27 signs associated with 20 com-monly occurring diseases in sub-Saharan cattle. The mean values from the estimationprocedure on the likelihood that certain clinical signs will be observed given the pres-ence in tropical cattle of a variety of endemic diseases are shown in Table 4.3. Only aportion of the full 27 sign ! 20 disease table is given here. Also, as noted above, the MLEprocedure will have estimated a spread for each disease-sign pairing – e.g. of (s = 0.113)for ‘fever given Babesiosis’ – to allow for the full normal probability distribution withmean 0.86 and standard deviation 0.113.

It seems obvious that certain diseases will be more like one another than others.Indeed the notion of clustering (or classification) is often used in medicine to organisediseases – based on say ‘sub-system infected’ (e.g. cardiovascular, nervous, reproduc-tive, etc.) or ‘aetiology’ (e.g. viral, bacterial, etc.). However, would it be possible to clus-ter these tropical diseases according to their relative positions and distances in ‘signspace’? The dendrogram resulting from running an agglomerative hierarchical cluster-ing algorithm based on single-linkage in simple Euclidean space is shown in Figure 4.9overleaf (in this case only the eight most common ‘endemic’ diseases are illustratedthough the complete 27-dimension sign space was used by the algorithm).

Table 4.3 Results from a survey of 44 African vets on the likelihood of various signs being observed given the presence of a specific disease in sub-Saharan cattle

Disease sign Anaplas. Babesiosis Blackquarter Brucellosis ECF FMD others

Abortion 0.26 0.29 0.00 0.85 0.15 0.31 …Anaemia 0.99 0.80 0.00 0.00 0.10 0.00 …Change in urine 0.44 0.67 0.00 0.00 0.00 0.00 …Changes in milk 0.00 0.00 0.00 0.00 0.00 0.00 …Constipation 0.51 0.03 0.08 0.00 0.09 0.00 …Coughing 0.01 0.00 0.00 0.00 0.86 0.00 …Dependent oedema 0.01 0.01 0.00 0.00 0.56 0.00 …Diarrhoea 0.02 0.23 0.00 0.00 0.49 0.00 …Enlarged lymph nodes 0.01 0.00 0.00 0.00 1.00 0.00 …Fever 0.73 0.86 0.92 0.21 1.00 0.89 …Foot lesions 0.00 0.00 0.00 0.00 0.00 0.78 …Inappetence 0.75 0.59 0.92 0.16 0.74 0.64 …Lethargy 0.70 0.55 0.85 0.04 0.86 0.45 …others … … … … … … … …


90 This exercise was initially carried out simply to cross-check that the data which hadbeen collected were internally consistent. However, every vet who has looked at thedendrogram has commented on how much ‘sense’ the clustering makes (indeed somebelieve that some kind of ‘fixing’ was involved – i.e. this was not an automatic compu-ter-generated diagram!). They note in particular that the four diseases in the ‘right-hand’ cluster are all transmitted by ticks and do indeed show similar signs. They notethat the only other disease spread by an another agent (Trypanosomosis – which isspread by the tsetse fly) is different from the rest as is shown by its ‘reticent’ member-ship of the left-hand cluster. They also note that the last three diseases are similar in anumber of respects and that the most difficult differential diagnosis within endemiccattle diseases is between schistosomiasis and fasciliosis as they are shown on the dia-gram as the most alike of all disease pairs.It is of course just as easy to attempt to cluster in the opposite direction – that is to seehow the various clinical signs might be grouped in ‘disease space’. This is arguably aless intuitive and useful approach but the results based on the same eight endemicdiseases are shown in Figure 4.10 to illustrate a slightly different layout. We do nothave space here to discuss this diagram or its significance but note the difference inrange on the ‘level of similarity’ scale – i.e. some signs are very much more like eachother than were any diseases. In particular it is not surprising to see that ‘dehydration’,‘excessive thirst’ and ‘pica’ – the chronic eating of non-food materials – cluster closelytogether in disease space.

Figure 4.9 Dendrogram showing the clustering of diseases based on the application ofa hierarchical clustering algorithm to a sub-set of the data in Table 4.3


91Further details of the various agglomerative and divisive algorithms used for hierar-chical clustering can be found in Sections 9.5.1 and 9.5.2 of HMS (2001). This includes adiscussion of various linkage rules and the application of monothetic and polytheticmethods which use one or many variables at a time respectively.

Partition-based clustersThe tree-like output of the hierarchical approach provides a useful description of thepoints at which clusters are formed and gives a complete picture of the relationship ofevery instance to all other instances. In the case of the partition-based approach amore ‘natural’ interpretation of the word clustering is evident (at least in the two-dimensional case) as we are simply looking for groupings of similar objects. In thisapproach it is necessary to specify in advance how many clusters are to be formed. Thealgorithm will then optimally group instances in the space such that they are as simi-lar to other members of their cluster as possible. Typically these algorithms operate bydefining an average cluster value (centroid) and measuring the distance of eachinstance from these centroids so as to optimally separate clusters. The most widely used approach is the K-means algorithm (not to be confused with thek-nearest neighbour algorithm – see ‘Predictive modelling – classification’ on page 107– though some of the ideas used are common to both) which operates by randomlyselecting K instances and assuming these to be the cluster centroids. It is an iterativealgorithm such that in each iteration the mean vectors for all the instances associatedwith each of the current K centroids are used as the new centroids in the next itera-tion. There are other variations of the algorithm which may be more effective if thenumber of dimensions is particularly high and/or the results are not converging to astable set of centroids, but they provide the same basic solution. A very elegant illus-tration of the application of the K-means algorithm to a set of NASA control systemdata is provided in Chapter 9 of HMS (2001) (pp304–306).

Figure 4.10 Dendrogram showing the clustering of signs based on the applicationof a hierarchical clustering algorithm to a sub-set of the data in Table 4.3


92

Probabilistic model-based clustersThis approach to clustering is the most statistically complex. It considers the probabil-ity distributions of the clusters identified and uses the Expectation-Maximisation (EM)algorithm (discussed in detail in Chapter 8.4 of HMS (2001)) to determine the parame-ters of the distribution. As they note:

…the formal probabilistic modelling implicit in mixture decomposition2 is more gen-eral than cluster analysis. Cluster analysis aims to produce merely a partition of theavailable data, whereas mixture decomposition produces a description of the distri-butions underlying the data. [HMS (2001), p 323]

There is not space here to describe these methods in any detail. They are potentiallymore flexible than either the hierarchical or partition-based approaches. However,they are also more computationally intensive and the results may be more difficult topresent to end users. They also make certain statistical assumptions about the under-lying distributions of the data sets being analysed. (Readers who wish to pursue theseapproaches in more detail are directed to Chapter 9.6 of HMS (2001)).

Self-organising feature maps/Kohonen networksWe end this discussion of clustering methods with a slightly different technique. Inthe first place the Kohonen network is different because it is not normally discussed inthe context of clustering (for example, it is not covered at all in HMS (2001)). However,although systems using this technique are normally delivered as classification or ‘nov-elty detection’ applications the underlying algorithm is very clearly based on themethod of cluster identification. In addition, Kohonen Networks, also known as selforganising feature maps (SOFM), are different in that this technique comes from theclass of AI tools known as artificial neural networks (ANNs). Almost all ANNs aredesigned to carry out ‘supervised’ learning tasks (see for example the Perceptron andMLPs discussed in ‘Predictive modelling – classification’ on page 95 and in ‘Predictivemodelling – regression’ on page 111) where the network is trained to map a variety ofinputs onto output(s). In the ‘unsupervised’ learning carried out in SOFM approachesonly input data are provided. One initial response to this is, ‘what is the point?’ With-out output information what is there for the neural network to learn? The answer isthat the SOFM attempts to learn about the structure of the data and in particular toidentify clusters/patterns as they appear in the input set.SOFM/Kohonen networks are sufficiently different from the previously notedapproaches to clustering and also from most other applications of ANNs that it isworth giving a little more detail on the algorithms involved. Unlike many other ANNarchitectures a SOFM typically has only two layers, as shown in Figure 4.11 – an inputlayer and an output layer also referred to as the ‘topological map’ layer (most ANNshave a third ‘hidden’ layer). The units in the output layer are laid out in space – typi-cally in a grid pattern, though a simple ‘line’ layout (i.e. one-dimensional space) is alsoused. In some ways the algorithms used in SOFM are similar to the K-means approachof partition based clusters in that ‘centres’ of clusters are identified and associatedwith particular output nodes. However, the network also operates to ensure that out-put nodes representing similar elements in the input space are situated close togetherin the topological map layer. In some ways the whole SOFM approach can be thoughtof as a crude 2-dimensional grid that can be folded and stretched into the manydimensions of the input space in an attempt to preserve its original structure. In thissense it is inspired by neurology and the known structure of the brain. The cerebralcortex is in fact a flat sheet that is folded into the convoluted shape required to fit intoour skulls and it is known that areas dealing with hand control functions are situatedclose to those controlling the arm, etc.

2 For an explanation of mixture decomposition see HMS p323.


93

The algorithm itself is relatively straightforward: ! iteratively supply sets of input instances! for each input see which is the winning node (output neuron)! adjust the weightings on this node so that it is more like the input case (and as

these adjustments are made over time decrease the effect associated with anyone instance)

! in addition use the ‘neighbourhood’ nodes surrounding the winning neuron toassess the degree of adjustment required to the winning node, but once againthrough time reduce the scope of the neighbourhood.

The distinction between Kohonen Networks and other SOFMs is in fact defined interms of this element of the training. The learning algorithm used in Kohonen Net-works actually adjusts the weights associated with all the nodes in the winning‘neighbourhood’ rather than just the winning node. By the end of a training ‘epoch’adjustments will have become very subtle and will affect only the single winning out-put node, unlike at the start of the process where quite large areas of the network canbe dragged towards each of the training examples. Although there has been no supervised learning during the execution of the algo-rithm, but simply the discovery of clusters in the example set, it is now possible to uti-lise the SOFM in classification mode or to spot novelty. For any node in the topologicalmap it is possible to retrospectively consider the cases which cause that node to ‘fire’.By identifying characteristics of these cases it may be possible to allocate a label to anyoutput node – and thus classify as yet unseen instances. In addition an ‘accept thresh-old’ can be set for the nodes in the SOFM which specifies how close any input casemust be to the output nodes before they should be associated with them. If this maxi-mum recognised distance is not met for any node then none will fire in the topologicalmap and this is an indication of a ‘novel’ case.This is necessarily a sketchy introduction to SOFM/Kohonen Networks but it illustratesthe versatility of ANNs, which can be used in this manner as well as in their more con-ventional supervised learning (classification and regression).

s Intelligence Chapter 4 FINAL 04 DecFigure 4.11 Schematic representation of SOFM/Kohonen Network illustrating thetwo-layer architecture – an input layer and an output (topographic map) layer


94

4.3 Predictive modelling

While descriptive modelling is an attempt to summarise a data set and has no targetgoal (i.e. it is unsupervised) the case of predictive modelling is very different. Herethere is a specific goal – to try to predict the value of one particular variable of interestgiven that you know the values of other variables. This type of modelling thus involvesa predictive task where the goal is known and some measurement can be made todetermine how well the algorithm performs the prediction activity. Thus we mightwant to know, for example, whether we can predict the value of the FTSE 100 indexnext month based on current share data and other general economic indicators. Or wemight wish to predict the likelihood that a patient has a given disease, given a set ofsymptoms and a large data bank of patients with this and other disease types. Notethat the word ‘predict’ is used in its most general sense and does not (necessarily)imply any time dimension – i.e. this is not ‘forecasting’, though predictive models mayalso be applied in that way.

!

Thinking point Forecasting and predicting Given the comments above, how wouldyou distinguish between these two processes?

!

Most BI models focus on predicting whether rather when something will happen andfew support temporal elements. This reflects the fact that they are typically hard toimplement because time does not operate in the same way as many other continuousvariables. In particular it is subject to a variety of levels of ‘granularity’ – a minute, aday, day of week, weekend, month, quarter, year, etc. The types of temporal-basedmodels that are created tend to support:! timing – e.g. best date to send ski brochures is some time in Week 46! sequence – e.g. make a customer home call, followed by a mail shot! timing dependency – e.g. if special offer A is set at £12 this week it should be

altered to £9 next week.

The rule association approach – covered in Section 4.4 – can also be used to implementtime-related associations.In predictive modelling the focus of our attention is on one variable – the variablewhose value we are attempting to predict. If this variable can take on only categoricalvalues (e.g. Will this customer buy a car: {Yes/No}; Which of these 35 diseases is thepatient suffering from: {Disease 1; Disease 2; etc.}) then the task is referred to as classifi-cation. If on the other hand we are focusing on a real-valued variable (an interval orratio variable) the task is referred to as regression. These two types of predictive model-ling are discussed below. However, it is worth reminding the reader at this point thatcertain techniques can be applied to both categorical (classification) and quantitative(regression) variables. For example, the simplest form of artificial neural network, theperceptron, is introduced in the classification section, though the most wide-spread useof ANNs, in the form of multi-layered perceptrons (MLPs), is in regression tasks. Con-versely, while decision trees have some application in the area of regression their moreusual targets are classification problems. I have thus attempted to discuss the tech-niques under the category where they are most often found in practical BI applications.


95

Predictive modelling – classificationIn all the techniques discussed in this section the variable on which we are focusing iscategorical – it is often referred to as a class variable. When looking at the performanceof any of the models which can be created, we wish to see whether the correct class ispredicted for a given set of input variables for any instance. If the model gets the classright then a loss function of 0 is allocated; if the classification is incorrect a loss score of1 is given – leading to the simplest, but most widely used, form of scoring in classifica-tion models, the ‘0-1 loss function’. Of course to assess whether or not the model ‘got itright’ we must have a set of examples whose outcome class is already known. Typi-cally such a set of pre-classified instance data is divided into training data and testdata sets. The model is parameterised (supervised learning) on the training set and theresulting classifier (as the classification model is often called) is used on the test set todetermine its performance.While the 0-1 loss function is the most widely used scoring function (and most obvi-ously appropriate in cases with only binary outcomes) it is not always best. For exam-ple in the case of cattle disease diagnosis (described earlier in Section 4.2) we may havea case in our test set that has its class variable (i.e. correct diagnosis) set to Cowdriosis.If on applying the classier to the set of predictor variables the result is given as any-thing other than Cowdriosis, then we note this as a ‘wrong’ outcome and a loss score of1 is registered. However, from our clustering exercise we know that Theileriosis is muchcloser to the designated outcome than is, say, Fasciolosis (see Figure 4.9). Thus we maywish to use an alternative loss function, which penalises a classifier more heavilywhen the class outcome Fasciolosis is given than a classifier which suggests Theilerio-sis, on the grounds that the first classification model is ‘more wrong’ than the second.One other general point before looking at some specific techniques for building classi-fiers is that creating various discriminant classes effectively involves introducingsome sort of decision boundaries into the solution space. In the 2-dimensional casesuch boundaries can be easily drawn as lines on an x-y plot. If the function is linearthis will be a straight line or perhaps a series of straight lines if, for example, piecewiselinear functions (a series of straight lines) are used. If the functions are non-linear thenwe may end up with curved (but regular) lines, as in the case of quadratic or polyno-mial functions, or in the case of some neural network models, which can be highlyparameterised, irregular (‘wavy’) lines. (See Figures 5.3, 6.4 and 5.6 in HSM (2001) for anexample which visualises these various options in two dimensions.) In the 3-dimen-sional case these boundary lines become boundary planes and as we move to moredimensions, ‘hyperplanes’. While these can be more difficult to visualise and can ofcourse add massively to the computational burden of finding adequate classificationmodels, the basic principles remain the same.

Linear discriminant analysisThis was one of the first approaches to be used by statisticians looking to classifyobjects. The work of Sir Ronald Fisher in the 1930s was at the heart of these early classi-fication attempts (and many of the other useful statistical techniques which we nowfind packaged within the BI framework). One of Fisher’s most famous data sets, firstreported in 1937, related to some samples of the iris flower that he worked with. Thedata related to iris flowers of three different species (Setosa, Virginic and Versicol) forwhich Fisher had collected a range of physical characteristics, including data relatingto the length and width of the flowers’ petals and the length and width of their sepals.In total there were 150 instances in his data set with 50 examples of each type of iris –the first four instances are shown in Table 4.4. Fisher was interested in finding amethod of being able to classify a new sample flower into one of these three typesgiven that only data on its sepals and petals was available. (Fisher worked in the bio-logical/ agricultural domain and so many of his examples were of this sort. However, it


96

is not difficult to see how within, say, a HRM application the ‘type of iris’ might be‘type of employee’ and the flowers’ characteristics could be traits associated withemployees, and the outcome – the class variable – their likely promotion prospects!)

The basic idea behind the linear discriminant analysis (LDA) approach is to find a lin-ear combination of the attribute variables on which a prediction can be based that pro-duces the best separation into the correct classes of the class variable. This is easiest toillustrate in the simple two-outcome case, say predicting whether a subject ofunknown gender is male or female. Let us assume we have only one predictor variableto work with – the height of the subject (in metres). In this case we might create anextremely simple linear discrimination function:

outcome = height/1.7

which would have the following interpretation:

if outcome > 1 then gender = ‘male’; else gender = ‘female’.

Clearly this would not be a particularly good classifier. There are many men who areless than 1.7m tall and many women who are taller. It would nonetheless give betterresults than making a purely random guess (which would on average be correct only50% of the time. It may be that you could add some additional variables, say, ‘salary’ or‘number of times subject watches football per month’ or ‘time subject spent shopping’(of course we shouldn’t stereotype but in a sense algorithms in this area depend onmembers of a class, in this case males, having similar characteristics). So if we may bepermitted to be stereotypical for the sake of this example, these new variables mightgive us a more complex linear discrimination function:

outcome = 0.35 ! height + 0.0001 ! salary + 0.05 ! football per month " 0.1 ! shop hourswith the same interpretation as above.

The example above is of course totally fictitious, I have no data on which to either trainor test it. We will thus turn back to the iris data, which in addition to other useful char-acteristics also offers considerably less scope for making politically incorrect state-ments, and look at it in some detail. In addition to illustrating the use of LDA thisallows us to explore a range of issues that are of relevance to the other techniquesreviewed later in this section. It also gives us an opportunity to apply some of the prin-ciples outlined in Chapter 3 ‘Exploratory data analysis and visualisation’ (i.e. beforejumping into the task of creating a classification algorithm for our iris flowers weshould ‘explore’ what the data look like and whether it might be reasonable to attemptto create such a classifier in the first place). The iris data are more complex than ourmale/female classification in that there are three outcome classes but Statistica sup-ports a form of discriminant function analysis which allows us to specify any numberof decision boundaries and so we can easily attempt to create a linear function of ourfour variables to classify the flowers into their three distinct types.

Table 4.4 Sample table showing four instances of iris flowers and their characteristics, pre-sented by Fisher (1937)

Flower Sepal length

Sepal width

Petal length

Petal width

Iris type

1 5.3 3.7 1.5 0.2 Setosa2 6.3 3.3 6.0 2.5 Virginic3 6.7 3.0 5.0 1.7 Versicol4 5.0 2.3 3.3 1.0 Versicol… … … … … …


97

This type of linear discriminant analysis is available in most BI tool sets but I have cho-sen to use a core data analysis element of the Statistica software package rather thantheir Data Mining suite, partly due to the fact that it has a wide range of outputoptions which can aid our understanding of the process and partly in recognition ofthe statistical basis (due to Fisher) of the LDA technique.We begin with some attempts to visualise our iris data (EDA). Suppose we decided itwould be useful to visualise our data as a scatter-plot to see whether the various typesof iris separated out from one another in some way. Our first problem would be todecide which of the four variables to plot the iris instances against. In a conventionalx-y plot we can of course plot only two variables at a time, giving us six possible scat-ter-plots to select from. Suppose we chose sepal length and petal length as a startingpoint i.e. leaving aside sepal and petal width for the moment, then the plot would lookas shown in Figure 4.12.

As can be seen from Figure 4.12 there is some separation, particularly between Setosaand the other two types of iris. However, the separation between Versicol and Virginicis not complete and to check alternative variable combinations we would need to cre-ate the other five scatter-plots. An different approach might be to look first at the ‘sta-tistics’ of each variable – in the first instance to simply look at the mean values. Theseare shown in Table 4.5.

Table 4.5 Means for all variables by type of iris

Iris type Sepallength

Sepalwidth

Petallength

Petalwidth

Valid N

Setosa 5.01 3.43 1.46 0.24 50Versicol 5.94 2.77 4.26 1.33 50Virginic 6.59 2.97 5.55 2.03 50All types 5.84 3.06 3.76 1.20 150

Figure 4.12 Scatterplot of sepal length against petal length showing iris type category data

Petal width


98

Clearly some of the variables vary a great deal more than others. Sepal width, for exam-ple, appears to have similar values across the iris types, while petal length varies muchmore widely. These differences in variation can be seen very clearly in the box-whiskerplot shown in Figure 4.13. It would thus seem likely that Petal length will have morechance of discriminating between iris types than will Sepal width, which has littleinherent variation. This point is borne out by Figures 4.14(a) and 4.14(b) which illustratethe ‘range’ of values for these two variables when considered by iris type. Once again iswould appear that Setosa may be the easiest of the iris types to classify distinctly.

This initial exploration of the data using EDA is valuable – it gives us some feeling forthe data as well as the likely factors we might expect to be incorporated into any lineardiscriminant model for classification. However, the ultimate model produced may stillhave elements that come as something of a surprise. Remember that in the full lineardiscriminant model all variables are taken together – thus while sepal width, forexample, may have little discriminatory power when used on its own it may be thevery factor which separates out the ‘similar’ instances of Versicol and Virginic afterpetal length has been included in the model.

Figure 4.13 Box-and-whisker plot summarising the four attribute variables across all iris types

Figure 4.14 Box-plots by type of iris of (a) sepal width, and (b) petal length


99

There are a number of ways in which linear discriminant modelling can be carried out.We illustrate here the Forward Stepwise method – which essentially means that thealgorithm will add one variable at each step of its processing. This variable will be the onewhich provides the most discriminatory power to the model at that point in its process-ing. The step-forward process is halted when one of a number of conditions is met: (a) there are no more variables to enter into the model(b) some maximum number of steps, specified at the start of the process, have been

carried out(c) the F-value (a measure of significance) for all remaining variables that could be

entered into the model is smaller than the ‘F to enter’ setting. 3

The box above illustrates the initial dialog box in Statistica 4 to carry out the discrimi-nant analysis. Note that there are four independent variables (Sepal_L, Sepal_W, Petal_Land Petal_W) and one ‘grouping variable’ (Iris_Type) which consists of three outcomegroups. (The ‘missing data’ element can be set either to ignore missing values or to takea mean value – in our case it is of no relevance as we have 150 complete records.)

This next dialog box allows the algorithm’s operation to be specified in more detail. Wehave selected the ‘Forward stepwise’ method as described above and will run the proc-ess for up to four steps. The ‘Tolerance’ and ‘F to enter/remove’ fields are fairly techni-

3 This F-value is a commonly adopted statistic and is the ratio of variation betweengroups to variation within groups. Only if the variable under consideration forentry into the model to predict the type of iris exhibits large variation from theother iris types compared to the within group variation does it get selected.

4 All subsequent dialogue boxes illustrated here are also from the Statistica package.

Figure 4.15 Statistica: Initial dialogue box for specifying elements in discriminant analysis

Figure 4.16 Dialogue box for specifying details of LDA operation


100

cal issues and relate to the additional discriminatory power of each variable. It is notappropriate to go into a detailed description here, and in many cases adopting thedefaults suggested by the algorithm – as we will do here – is a satisfactory approach.By setting the results to be displayed ‘at each step’ we can take a look at how the algo-rithm is operating as it progresses.

Figure 4.17 illustrates the outcome after the first step in the LDA modelling. Given ourinitial exploratory data analysis it should not come as a big surprise that the most dis-criminating variable (and thus the first to be entered into the model) is petal length.We can obtain a range of statistics even at this early stage as demonstrated by the var-ious options. We will not select any of these at this point but would note that the‘Wilks’ Lambda’ value (0.059) is an overall measure of the success of the model to thispoint. A value of 1.0 indicates no discriminatory power while a value of 0.0 would rep-resent a model which gives perfect discrimination.We will not show the outputs after Steps 2 and 3 but simply note that they resulted inthe following:Step No. Variable entered Wilks’ Lambda2 sepal width 0.0373 petal width 0.025

This of course only leaves sepal length, which may be entered during the 4th step (if itmeets the F_to_enter and Tolerance conditions). Following this last step the finalmodel has been produced as shown in the dialogue below and has marginallyimproved total discriminatory power (Wilks’ Lambda = 0.023).It is possible to see a summary of the contribution of each of the variables which wereentered into the model. Figure 4.19 shows that while Sepal_W was the second variableto be introduced to the model it is in fact Petal_W that provides the second greatestdiscriminatory power (after Petal_L), followed by Sepal_W and Sepal_L (as seen in thePartial Lambda column).

Figure 4.17 Dialogue in Statistica following first step in LDA


101

While all this may be interesting to the statistician, what we really want to know iswhat are the formulae to apply for future classification and how good in terms of cor-rect/incorrect classifications is the model likely to be? The summarised performancedata can be seen by looking at the classification matrix as shown in Figure 4.20. Fromthis it can be seen that for the 150 iris instances provided the discrimination modeldoes a pretty good job. It is 100% correct for all examples of the Setosa iris type andonly misclassifies 2 Versicol samples as Virginic and 1 sample of Virginic as Versicol.

Figure 4.18 Dialogue box following final step in LDA

Figure 4.19 Summary following LDA of discriminatory function of each variable included in final model

Figure 4.20 Summary of classification performance of the model


102

This leads to an overall correct classification of 98%. However, we must be careful ininterpreting this figure. Remember that the same data that created the model havealso been used to ‘test’ its performance, i.e. we had no separate training and test sets ofdata. This being the case we would expect to get better model performance (thoughthis is unusually good!) and it would be incorrect to state that the model is likely to beable to classify as yet unseen iris flowers with an accuracy of 98% based on the fourmeasures used here.There is also an output to see specifically which cases were misclassified. These havebeen placed at the top of the data set and can be seen in Figure 4.21. The lowest value ineach row is taken to be the type of iris predicted based on the values given. Thus inCase 5 the lowest score is calculated to be for Versicol while in fact this was an instanceof Virginic – the first incorrect prediction. Note that the values are actually expressedin terms of their distance from the centroids calculated by the algorithm for each iristype – ‘squared Mahalanobis distance’. While this differs from the case of clusteringwhere the algorithm tries to create groupings, here we had pre-defined groupings, thenotion of estimating a centroid for each group and measuring the distances from thispoint is the same as was discussed earlier.

The actual formulae that have been used to create the predictive scores are derivedfrom the discriminant functions which were generated during the running of thealgorithm. These can in fact be quite complex and we will not dwell on the detailshere. Using Statistica’s ‘Canonical Analysis’ option a range of information can be foundout about these functions.

Figure 4.21 Classification performance for each case of the mo with misclassified cases highlighted (*)

Figure 4.22 Dialogue box for analysing various aspects of the final model


103

In fact two discriminant functions have resulted on this occasion (referred to in Statis-tica as ‘roots’). As you can see, the first function is the most important – having a farlarger eigenvalue (32.2) than the second function (0.29). For each of these functions it ispossible to get the coefficients associated with each of the output groups – ‘coeffi-cients for canonical variables’. These coefficients are given in both their raw (Figure4.23) and standardised (Figure 4.24) forms:

The raw coefficients are given as theseare the values that would be used if mul-tiplying the actual variable values. Thus,for example, for the Root-1 function thePetal_L value would be multiplied by -2.2and the Sepal_W value by 1.53, etc. (Don’tworry about the sign of these values forthe moment). The standardised coeffi-cients are given as these illustrate the‘real’ relative effect of each variable inthe model – e.g. the relative effect ofPetal_L is 0.95 while that of Petal_W isonly 0.58. This appears to differ from theraw values where the value for Petal_W(2.81) is slightly higher than for Petal_L.

However, it must be remembered that the mean value of Petal_L is 3.8cm as comparedto Petal_W which is only 1.2cm, so the effect of multiplying by petal length will bemuch larger – the standardised values are showing this fact in a quick summary table. These coefficients are used in turn to come up with a final set of classification functionparameter values which are shown in Figure 4.25.

Figure 4.23 Output from final model giving the raw coefficientvalues for the roots of the two discriminant functions

Figure 4.24 Output from final model giving the standardised coefficientvalues for the roots of the two discriminant functions

Figure 4.25 Output from final model giving the values to be used asparameters for each variable incorporated in the classification function


104

Basically the values in the ‘Classification Function’ table can be used as follows:! Predicted score (Setosa)

= -16.4 ! Petal_L + 23.6 ! Sepal_W " 17.4 ! Petal_W + 23.5 ! Sepal_L " 86.3! Predicted score (Versicol)

= 5.2 ! Petal_L + 7.1 ! Sepal_W + 6.4 ! Petal_W + 15.7 ! Sepal_L " 72.9! Predicted score (Virginic)

= 12.8 ! Petal_L + 3.7 ! Sepal_W + 21.1 ! Petal_W + 12.4 ! Sepal_L " 104.4

Given any un-classified instance of an iris for which you know the four specified meas-ures, you can work out these three predicted scores. Whichever of these scores turnsout to be the highest, that is the class of iris which the model predicts you have aninstance of. The sign of the coefficients in this table is also of significance in terms ofthe ‘direction’ of their impact rather than their relative importance. Those coefficientsthat are positive increase the predictive score for that species as they grow larger,whereas negative coefficients decrease the predictive scores as they become larger.The raw data together with these formulae and parameter values are available in anExcel spreadsheet in the BI module web area for download and manipulation.

Excel Activity Iris data Use the Excel spreadsheet referred to above to try out somevariations on these data and analyses.

We have spent quite a bit of time going through the details of the iris example usinglinear discriminant analysis. This has allowed us to investigate the data set in somedetail as well as to illustrate the fairly complex process of LDA as executed within Sta-tistica. While the LDA approach can be shown to be one of the most efficient of theclassification techniques it is perhaps not the easiest to understand and its outputs donot lend themselves to simple interpretation. The subsequent predictive classificationtechniques will be treated in less detail but outputs will be given for the iris example,for sake of comparison, as well as other examples as appropriate.

Decision treesArguably this is a more straightforward approach than linear discriminant analysisand it is certainly the case that the outputs produced are more easily grasped by thenon-specialist reader. One of the confusions about the decision tree approach is thatthere are so many competing algorithms: ID3, C4.5, CART (or C&RT), CHAID, etc. Inaddition it is not always clear how decision trees and association rules differ from oneanother. In this section we will introduce the approach and help clarify some of thispotential confusion.The independent variables (attributes) on which decision trees are based can be eithercontinuous or categorical. Many of the examples given in introductory texts use cate-gorical attributes and we shall begin with an example of this sort before returning tolook at a decision tree approach to the iris data. Non-numeric approaches, using algo-rithms such as ID3, were popularised within the machine learning and AI community,but they are simply variants of the more general form of tree-structuring algorithms.A well known example given in a number of texts relates to the task of learningwhether or not you would play tennis, based on a range of weather conditions. I willborrow this example and simplify it to create a tree which indicates whether to playrugby or not.5 The data set, based on past examples of weather and whether or not arugby match was played, might look something like the following:


105

Table 4.6 Initial rows of a data set indicating the status of a range of variables and whether or not rugby was played (the outcome)Day Forecast Temp. Wind Humidity Play rugby?1 sunny warm weak normal yes2 overcast mild weak low yes3 rain cool strong normal no4 overcast mild strong low yes…..From an extended set of data of this type we could run an algorithm which produced adecision tree such as the one shown in Figure 4.26:

The nodes at each branch in the tree represent attributes, with the specific values relatedto each attribute being shown on the stems connecting each node. The leaves of the treerepresent the ultimate outcome/classification. The tree is shown upside-down with theleaves at the bottom and the ‘root’ node at the top, but this is the convention typicallyused!. Note that not all of the attributes in the sample set (e.g. Humidity) were required increating the tree – knowing these attributes would not have improved the classificatorypower of the tree. Note also that the ‘root’ node on the tree (Forecast) has three branches.This is important as it implies that the algorithm used in the tree’s creation was one whichcan perform multi-level splits. Many of the best-known algorithms (e.g. QUEST or C&RT)are restricted to using only binary splits, i.e. any parent node can only have at most twochild nodes, while others e.g. FACT or CHAID may make multi-level splits where three ormore child nodes are permitted. It is worth noting that any tree containing multi-levelsplits can in fact be represented as a series of binary splits. One advantage of the multi-way split may be the readability of the tree (and/or rules) produced, which tend to be morecompact. However, a number of the multi-split techniques e.g. THAID only allow anyattribute variable to be used at a single node which can be an unduly limiting constraint.Returning to our PlayRugby example it should be noted that the tree can be repre-sented as a set of rules (a disjunction of conjunctions to use the formal AI jargon forsuch representations). Thus rather than representing our outcome as a graphical treewe could create two logical expressions which captured the same information.! Logical expression for PlayRugby? = NO:

(Forecast = Rain # Wind = Strong)⁄(Forecast = Sunny # Temp = Freezing)where # is the AND symbol (‘conjunction’) and⁄is the OR symbol (‘disjunction’).

! In a similar way the second logical expression for PlayRugby? = YES is:(Forecast = Rain # Wind = Weak)⁄(Forecast = Overcast)⁄(Forecast = Sunny # Temp= ¬ Freezing)where the additional ¬ symbol has been introduced, representing NOT.

5 You can find another version of this decision in Chapter 4 ‘Introduction to handlinguncertainty’ in the Decision Making module.

Rain Sunny

Strong Weak Freezing ¬ FreezingYes

Wind Temp.Overcast

Forecast

YesNo YesNo

Figure 4.26 Decision tree indicating the nodes (variables) with associated values to beincluded in a decision tree to answer the question ‘will rugby be played?’


106

Yet another representation of the tree is to create a set of if-then rules of the followingformat:

IF (Forecast = Rain AND Wind = Strong) THEN PlayRugby? = NO

Another four rules of this type would be required to describe the full tree. This type ofrule is exactly the format used by rule-based reasoning systems and also by the ruleassociation algorithms described in Section 4.4. The main difference between the rulesseen here is that they are part of a decision tree which means that taking the rulestogether as a set they represent a complete mapping of instances into outcomeclasses. In addition because the nodes and branches are tied into a tree structure theyrelate to mutually exclusive (disjoint) sets of possible instances. As we shall see later,in the case of association rules such mutual exclusiveness is not required (i.e. two ormore rules may be invoked for the same instance), nor are association rules required tocover the whole of the instance space (completeness).In addition to the ability of algorithms to use multi-level splitting criteria, another keydifferentiator relates to whether the algorithm can adequately deal with non-categor-ical predictor variables (attributes). The Classification Tree module of the Statisticapackage implements the QUEST and C&RT algorithms, which allows for categorical orordered predictors (or indeed a mix of the two) but is restricted to ‘binary univariatesplits’ i.e. only one variable at any splitting node, which will then have only two childnodes. We will now take the example of Fisher’s iris data, see what type of decisiontree is created (Figure 4.27), and consider how this compares to the results of the LDAapproach.

Figure 4.27 Decision tree produced by Statistica for the Iris data setdiscussed earlier in this chapter under the LDA approach


107

The first thing that strikes you is that this appears to be a much simpler model – bothin terms of its structure and its interpretation. The classification tree uses only two ofthe attribute variables: petal length and petal width. Interestingly the variable whichentered the LDA model as the second most discriminating attribute, sepal width, is notincluded in this model at all. It may be that the LDA model was ‘over-fitting’ the dataand so in addition to being more complex it would have suffered degraded perform-ance when used on new/unseen examples. In addition to being a simple model, thefact that only one attribute of the iris is required to make a classification is an attrac-tion of this model. However, in terms of performance it does not do as well as the LDAmodel. Four of the Virginic samples are misclassified as Versicol and two of Versicol aremisclassified as Virginic. As was the case in LDA, all of the Setosa samples are correctlyclassified.

Nearest neighbourThe basic idea behind this approach is very simple, Icall it the ‘teenager algorithm’ – basically, I’ll dowhatever my peers do! The point is that for any newinstance you simply map the vector into the data setspace and find the nearest previously enteredinstance for which you know the class label – this isthen the class which is predicted for this newinstance. Thus in the example shown for the two-dimensional case in Figure 4.28 the new point ‘X’would be classified as belonging to Group 3.

In fact, although we are only interested in the alloca-tion of a particular instance, this approach impliesthat a decision boundary exists between the variousclasses. If we are using simple Euclidean distance todetermine ‘closeness’ of points then the boundarywill be determined by a point mid-way along the linejoining the two closest points in each class of thetraining instance set. This is shown in Figure 4.29aand is illustrated for the three class example given inFigure 4.28 in Figure 4.29b. This boundary is referredto as the Voronoi Tessellation. In the more generalcase this boundary will be a hyperplane rather than aline! Indeed some authors refer to the points within aclass as ‘Voronoi neighbours’. In many cases the use of simple Euclidean distancebetween points is not the most effective. The use of Mahalanobis distances was notedin the section above on linear discriminant analysis – this typically involves a transfor-mation on each vector representation of the data points using an inverse covariancematrix transformation which adjusts the distance in accordance with the spread.Another option is to use one of the dimension reduction approaches introduced inChapter 2 and then measure distances between points in this simplified space.

1 11

12

2 22

33

3

3 3

3

3X

Figure 4.28 A nearest neighbour approachcould be used to allocate the unknowninstance ‘X’, based on a training set for

which the class labels are known (1, 2 or 3)

1(a) (b)

222

11

12

2 22

3

3

3

3 3

3

3

Figure 4.29 (a) The decision boundary betweenpoints in different classes, and

(b) the boundary illustrated for the trainingpoints shown in Figure 4.28


108

One of the problems with the nearest neighbourapproach is that if there are outliers in the training dataset these may lead to misclassification of new points. Inthe example shown in Figure 4.30 the unknown point ‘X’would be allocated to Group 2 if a simple nearest neigh-bour approach were used. To help avoid these effects dueto outliers the K nearest neighbours extension to thealgorithm is used. This works by expanding the area of‘nearness’ to the unknown point until it incorporates notjust the nearest neighbour but K neighbours. This processis illustrated by the dotted line in Figure 4.30 which havebeen stopped at the point where 5 neighbours have beenincorporated in the area around the point ‘X’ (i.e. K = 5). The class of each neighbourwithin the ‘circle’ is considered and the most numerous class is deemed to be the ‘cor-rect’ classification – in the case shown, Group 3 (and of course in the more general casethe area of interest around a new data instance will be defined by a hypersphere ratherthan a circle). The selection of an appropriate value for K is not always obvious. Clearly avalue greater than 1 is useful to avoid the problems associated with outliers. At theother extreme a value equal to the total number of points in the training set would notbe useful as this would always classify into the most populous class. In practice anumber of heuristic learning approaches can be used which modify the size of K andcheck the performance of these various levels on a variety of training and test sets.In addition to the similarity to the K means algorithm (introduced in ‘Partition-basedclusters’ on page 91) this approach shares a number of characteristics with the casebased reasoning (CBR) approach of AI (outlined in Section 5.6 on page 169). In CBR thedistance is measured in some ‘semantic’ rather than physical manner but the centralidea of finding the ‘closest’ example already in existence is directly analogous. Theother interesting thing to note about the nearest neighbour approach is that it doesnot in fact create a model of the domain. For each classification event they refer backto all of the data in the training set. In this sense they sit uncomfortably in the broadclass of ‘supervised learning’ algorithms, though in the case of the K nearest neigh-bour there is at least the option of using a training/test approach to evaluate the mostappropriate level of K.

The simple perceptronThis was one of the earliest approaches to demonstrate the utility of machine-basedclassification. It adopts one of the simplest structures and its algorithm concentrateson developing a decision boundary surface though this may be non-linear. It wasnamed a ‘perceptron’ as this was the early name given by AI researchers in the area ofartificial neural networks to the core processing units. In fact the perceptron models asingle neuron and is based on the ‘accumulate and fire when threshold is reached’behaviour found in the human brain (see Chapter 5 for more detail on the neurologicalbasis of the ANN approach). For the purposes of the discussion here we do not need todeal with any of the detail of neural networks as the function of the perceptron ismodelled by the following simple formula:

where x is a vector representing the i attributes of each instance and w is a set ofweights (parameters) associated with these attributes. The values returned by u(x) arecompared to a threshold value, T. If a given instance of the value of u(x) exceeds T thenthe perceptron is ‘fired’ (indicating an outcome of Class A) otherwise it is not fired (i.e.Class B), thus achieving binary classification. We need to arrange the weightings in

22 X2

2

3

3

33

3

33

Figure 4.30 The extension of the approach toconsider the K nearest neighbours

u(x) $(xi wi! )=


109

such a way that all instances of Class A will result in u(x) > T, and conversely that allinstances of Class B will satisfy u(x) < T. The score function will be the number of mis-classifications for a particular set of wi. For the purpose of simplifying the description of the most widely used perceptronlearning algorithm we let the threshold T be equal to 0. In addition, if we transform allthe instances in Class B such that xi % " xi, then we simply need to find the set ofweights (w) for which u(x) > 0 over all instances. The algorithm starts with some set ofweights and uses these to find the value of u(x) and in this way classifies the firstinstance in the training set. If the instance is correctly classified then the weightsremain as they were. If it was not correctly classified (in which case the value of u(x)will be less than zero – remember all instances of Class B have been transformed totheir ‘negative complements’ such that correct classification is determined by u(x) > 0for all cases) then the weights must be adjusted such that the value of u(x) is increased.One simple way of achieving this increase is to add a small multiple of the misclassi-fied instance vector to the parameter vector, i.e. w = w + dx. The value of d can be asmall constant, say 0.1. Thus if the current set of parameters for a four attribute objectwas w = {4, 2, 7, 3} and x = {2, 1, -7, 3} was an example of an instance in Class A beingevaluated the value of u(x) would be –2 (i.e. a misclassification). This being the case thevalues of w would be altered by d to become {4.1, 2.1, 7.1, 3.1}. (This still leads to a value of–1.7 for u(x) which is still a misclassification but at least the weights have provided an‘improved’ estimate.) This evaluation is carried out for all instances in the training set,possibly cycling through the data points a number of times. Apparently it can be proved (just don’t ask me to do it) that if two classes are separableby a linear decision surface then this algorithm will find the values of w to give totallycorrect classification (always assuming that the value of d is sufficiently small – as thiseffectively specifies the ‘granularity’ of the algorithm). If the two classes cannot be lin-early separated then it may be that using a sum of squares scoring function will bemore effective. It is also possible to extend the basic perceptron approach to allow it tohandle cases with more than two classes of outcome. However, the approach is basi-cally limited by its linear decision boundaries. The more general multi-layer percep-tron (MLP) model is a much more flexible extension of the basic ideas introduced hereprimarily due to the fact that it contains at least one ‘hidden’ layer which provides fornon-linear transformations. The MLP is one of a number of classes of artificial neuralnetworks – it is discussed in ‘Clustering’ on page 86 as it is used in predictive regres-sion modelling and also in Section 5.5 on page 164 where ANNs are described in thecontext of ‘intelligent’ systems.

Support vector machinesIn addition to the development of MLPs, and neural net-works in general, another technique which grew out of workon perceptrons was the support vector machine (SVM). Theperceptron was only able to provide ‘correct’ classificationwhen the classes were perfectly separable by some decisionboundary (the case illustrated in Figure 4.31 for the classes‘solid’ and ‘unfilled’ objects). This line, or in the more generalcase hyperplane, gave the best performance in terms of gen-eralisability when it was located as far away as possiblefrom all of the data points.

Decisionboundary

(linear)

Figure 4.31 Example of a set of ‘unfilled’and ‘solid’ objects in 2-dimensional space

perfectly separated by a linear decisionboundary


110

Of course in practice decision boundaries will often notbe linear, as in Figure 4.32.

The SVM approach can generalise the separationachieved in the case of the simple perceptron by map-ping the original data using a set of mathematical trans-formations (known as kernels) into an ‘enhanced’ space.In this enhanced space the mapped data are linearlyseparable and this linear dec ision surface in theenhanced space is equivalent to the nonlinear decisionsurface in the raw data measurement space (as illus-trated in Figure 4.33). In addition to it being much sim-pler to find a linear solution in the enhanced space, theSVM approach uses a scoring function known as the‘margin’ which attempts to position the decision bound-ary in the optimum position between the two classes,thus preserving the benefits of generalisability seen inthe simpler linear case with perceptrons.

The SVM approach can in fact support both classification and regression (as can anumber of approaches discussed in this section). The specific iterative training algo-rithm used by SVM models to minimise the error function, together with the regres-sion/classification distinction, leads to four main classes of SVM models, furtherdiscussion of which can be found in Vapnik (1998). As noted in HMS (2001):

… practical experience with such methods is rapidly improving, but (parameter) esti-mation can be slow since it involves solving a complicated optimisation problem…(p358).

As such, support vector machines are a good example of a technique within businessintelligence, and specifically in data mining, which may be able to benefit from theadvent of ‘grid computing’ (see ‘The BI market and future trends’).

Other approaches to classificationThere is in fact a much larger set of techniques that can be applied to the classificationproblem. As HMS (2001) note, ‘many of these are powerful and flexible methods,(developed) in response to the exciting possibilities offered by modern computerpower’, and ‘development and invention have not ended’. The four techniques dis-cussed were chosen as they represented reasonably distinct approaches, while SVMsare both an extension of the basic perceptron idea as well as an example of a ‘develop-ing and innovative’ approach.

Decisionboundary

(non-linear

Figure 4.32 Example of a set of ‘unfilled’ and‘solid’ objects in 2-dimensional space for which

a more complex (non-linear) decisionboundary is required to achieve separation

Input space ‘Enhanced’ space

SVMtransformation

Figure 4.33 A set of ‘unfilled’ and ‘solid’ objects in 2-dimensional space shown in their original inputspace as well through an SVM transformation into an ‘enhanced’ space


111

Some of the approaches are described in the latter stages of Chapter 10 in HMS (2001)where more detail can be found. Logistic discriminant analysis is based around therelative likelihood of a given instance belonging to Class A compared to the probabilitythat it belongs to Class B. Since there are only two classes, the probabilities must sumto one, so:

p(B) = 1 " p(A)

The logarithm of the ratio p(B|x)/p(A|x) is known as the odds ratio and can be shown tobe a linear function in xi. This model can be used in most cases where Fisher’s linear dis-criminant model is used. However, it is slightly more flexible in that it allows for multi-variate data that are not normally distributed. Whereas LDA assumes normality Logisticdiscriminant analysis can also provide for ‘mixed’ models, i.e. where some of the predic-tor variables are discrete and others are continuous. The use of the logistic method in aregression context is discussed more fully below under ‘Generalised Linear models’.Other classifiers include the naïve Bayes model (which is covered in Section 5.3 onpage 148 when we consider uncertainty in AI systems), as well as extensions in theneural network area such as multi-layer perceptrons and projection pursuit (both ofwhich are also applied in regression and are covered in the next section). The develop-ment of new mixture models and radial basis functions, as well as the use of ‘boosting’to increase the flexibility of classifiers are covered in Chapter 10.9 of HMS (2001).

Predictive modelling – regressionModels which fall within the definition of ‘regression’ are concerned with predictingthe value of a quantitative variable. However, despite the fact that they try to ‘regress’an outcome value for an set of inputs rather than simply attempting to identify whichclass an instance might belong to as is the case in classification, there are many simi-larities between both types of predictive modelling algorithms. Many of the apparentdifferences are due to an alternative vocabulary. In general the terminology used inregression is closer to mainstream statistics so that, for example, the response variableis often referred to as the dependent (less frequently target) variable while the predic-tor variables are known as the independent or explanatory variables.Another major difference between classification and regression relates to measuringthe performance of the generated models. Clearly the ‘fitness’ of any regression modelis just as important as it was in the case for classification. However, the 0-1 loss func-tion is not particularly useful in the case of regression – no model is likely to predictexactly the correct value of the response variable. Thus appropriate score functionsmust be used to estimate how ‘right’ any model is. The most widely used measure ofmodel performance is the least squares approach, so called because it tries to mini-mise the sum of the squares of the difference between the observed values and thosepredicted or expected by the model. (The squaring is required because the predictedvalue may be greater or less than the observed value and by squaring this and addingall such differences we get a consistent estimate of the total amount by which a model‘misses’ the correct estimation.) In the case of a linear regression model with a singlepredictor we end up with a simple regression line. To illustrate the sum-of-squares cal-culations we can look again at ‘Anscombe’s quartet’, the data presented in Chapter 3(see Figure 3.1 on page 63). In addition to having the same mean and standard devia-tion the four data sets in fact have the same lines of regression (i.e. the linear functionwhich results in the lowest sum of squares – note that if we allowed quadratic func-tions a much better fit for Data Set 2 could be found). This is shown for data sets 1 and 2(Figure 4.34 overleaf). An alternative line through data set 1 is also shown (dashed line)which has an R2 value of 0.32. The fact that this value is lower than the 0.67 of the bestfit line, illustrates what can be seen visually – i.e. that the new line does not fit the dataas well as the regression line.


112

It is important to note at this point that had we allowed for a quadratic function to beused in the example above the model would still be linear in its parameters. (The bestline of fit in this case is in fact y = -0.127 x2 + 2.78 ! –6, and is an almost perfect fit, i.e.R2 > 0.99) This is a subtle point that is easy to miss. To give another example involvingtwo predictor variables X and Z, we might find a model:

Y = 2.5 ! X + 4.1 ! Z3 – 0.4 ! X ! Z"2 + 17

This model is still linear in all its parameters and thus Y may have been estimatingusing a linear regression approach. A nice discussion of these issues with a simpleexample is given in Chapter 11 of HMS (2001) pp371–373.Staying with models which are linear in their terms but moving to more than one pre-dictor variable (multiple regression) we will end up with a regression plane. The leastsquares approach can be reasonably easily extended to this situation even if illustrat-ing the outcome graphically would be a little tricky, at least for more than two predic-tor variables. The most common way to express the goodness of fit of the regressionplane (or indeed line) is to consider the residuals. Each residual value is the differencebetween observed and predicted, and the sum of these residuals is used to calculatedthe coefficient of multiple determination – more commonly known as the R2 (R–square) value. Formally the R2 value measures the reduction in the total variation ofthe response (dependent) variable due to the multiple explanatory (independent) vari-ables, and is calculated as follows:

R2 = 1 " (Residual_SS/Total_SS)

where Residual_SS is the sum of squares of the residuals (also known as the ‘error sumof squares’) and Total_SS is the total sum of squares.Understanding what the R2 value means and its interpretation is critical to all regres-sion analysis, so we will spell this out in a little more detail. The smaller the variabilityin the residual values relative to the overall variability, the better the ‘fit’ of the regres-sion line/plane. Take the simple case of a single predictor variable (x) on a responsevariable (y). If x and y were perfectly related there would be no residual variance andthus the ratio would be zero (and the R2 value equal to 1). Conversely if there was norelationship whatsoever between x and y, the ratio of the residual variability of the yvariable to the original variance would be 1, thus leading to an R2 of zero. In practicethe coefficient of determination (R2) will fall somewhere between these extreme val-ues. If, for example, an R2 value of 0.7 is found for a regression model this can be inter-preted as meaning that we can explain 70% of the variability based on the explanatoryvariables in the model – i.e. only 30% is left in the residual variability. Obviously we

Figure 4.34 Scatter plots of the first two data sets in Anscombe’s quartet showing theregression lines and R2 values (with an additional line – poorly fitted – for Data Set 1)


113

wish to get the R2 value as close to 1.0 as possible, indicating that almost all the varia-bility in the response variable can be predicted by the model.In most multiple variable regression models the actual ‘fitting’ value we must use isthe Adjusted R2. This involves a slight technical adjustment to the initial R2 calculationto allow for the number of degrees of freedom in the problem to be recognised.Degrees of freedom (dft) are based on sample size – i.e. size of training set minus one –and, in the case of dfr, the sample size and the number of variables included in themodel (i.e. dfr = number of instances minus number of parameters estimated in modelminus one). Formally:

Adjusted R2 = 1 " [(Residual_SS/dfr)/(Total_SS/dft)]

In many cases the Adjusted R2 and the initial R2 will not differ that much, and theinterpretation of the Adjusted R2 value is precisely the same as discussed.Before looking at some specifics of regression algorithms a general note of cautionshould be made. Any predictive model can really only be said to be valid within thelimits of the data set for which has been constructed. Any extrapolation beyond thisdata set is bound to be problematic. For example, you might find that the averagespend per customer (S) is nicely predicted based on your company’s advertising budget(A), according to the model:

S = 2.75 ! A – 3

In other words once you have dealt with the ‘set up’ costs of the advertising (the –3)you can expect a customer to spend almost three times as much on your product asyou spend on marketing it to them. However, let’s assume that this model has been‘learned’ from advertising spends in the region of £5–£20 per customer (on a variety ofproducts). It would be a foolish advertising executive who would propose that £1,000was spent advertising a new product (let’s say a new car) per customer on the assump-tion that the model could be extrapolated to indicate an average spend in return ofaround £2,750 (we simply have no initial data to inform us of how to expect customersto behave at levels of advertising spend greater than £20).

!

Thinking point Applied analysis Can you think of some work-related situations towhich this kind of analysis might usefully apply?

!

Generalised linear modelsWe have thus far been focusing on linear models, both in the discussion above and intheir use for classification in the case of LDA. These are in fact a special case of themore generalised linear models. These types of model can be used to predict responsesboth for dependent variables with discrete distributions and for dependent variableswhich are nonlinearly related to the predictors. They have three main advantages overthe simpler linear models described up to this point:(i) non-linear as well as linear effects can be tested(ii) the predictor variables may be categorical as well as continuous(iii) the predictor variable is no longer assumed to be normally-distributed, but can

instead belong to a family of ‘exponential’ distributions (including the Poisson,binomial and gamma).


114

These generalisations make the whole regression approach much more flexible. Theneed to use non-linear relationships (i), is self evident but it may be worth consideringbriefly the advantages of the second and third extensions to the linear model. Thereare many situations where it is important to be able to give a categorical (or discrete)outcome. For example, if your regression model is attempting to predict which producta customer is most likely to buy then ultimately the outcome needs to be stated interms of one of a number of discrete possibilities. Or, consider a cargo estimationmodel in the oil industry. Having a set of cost-benefit outcomes which works from theassumption that the regression model’s output of ‘2.3 oil tankers’ is a reasonableanswer is likely to lead to big problems. (i.e. It is not really possible to work with 0.1 ofan oil tanker. Either we have to go for 2 tankers and leave 15% of the oil undelivered, orwe have to go for 3 tankers with the major additional expense this implies. Thus thetype of ‘rounding’ that might be quite legitimate in a regression model which esti-mates the number of trucks needed to deliver cargo in a given month cannot necessar-ily be scaled up to another problem within the same domain.) Another example whichillustrates the utility of both the extensions (ii and iii) noted above is seen in the areaof family planning. Despite the fact that we talk of the ‘average family’ having 2.4 chil-dren there is a big difference between 2 and 3 children. Therefore any regressionmodel would have to be able to produce a non-continuous estimate to number of chil-dren. In addition the distribution for the total number of children per family is highlyskewed (i.e. many families have 1 or 2 children, fewer have 3 or 4, very few have 5 or 6,etc.). This being the case it would not be appropriate to assume a normal distributionfor the predictor variable – in this particular case a Poisson distribution would proba-bly be best suited.The relationship between the variables in a regression model is sometimes referred toas the link function. As noted it is important that this is not restricted to being linearin nature. For example, the relationship between the likely cost of your health needsand age is not linear. In early adulthood the average health status of those who are 30does not vary much from those who are 40 (or even perhaps 50). However, the differ-ence between 60 and 70 year olds is likely to be much more marked. In other words aswe get older the likely costs of maintaining our health do not increase in a linear man-ner but probably according to some power function. It would thus be imperative in anyregression model created to estimate health costs associated with aging that it waspossible to model the link function using a non-linear (e.g. power) relationship.The discussion in Chapter 11.3 of HMS (2001) goes into more technical detail as to thebenefits and methods of generalised linear models. In particular they illustrate the useof logistic regression (using a Bernoulli distribution – another of the ‘exponential’group) for the case where the predicted variable can take on only a binary outcome(actually a probability value between 0 and 1). They also give a nice worked examplebased on the now infamous Challenger space shuttle ‘O-ring’ data (pp386–837). Inaddition to illustrating the use of a logistic regression model it also illustrates veryclearly the problems of extrapolating from a limited sub-set of data.

Multi-layer perceptrons (MLP)In the section on classification we introduced the simplest form of neural networkstructure, the ‘perceptron’. The class of models covered in this section consist of sets ofperceptrons (neurons) connected together into a network – and thus represent anapplication of proper ANNs (you cannot really call a single neuron a ‘network’!).Strictly these models belong to the feed-forward, non-linear class of neural networksand are characterised – as their name suggests – by multiple layers. At a minimum theMLP must consist of an input layer of neurons, an output layer, and at least one inter-mediate or ‘hidden’ layer – i.e. not visible as an input to or output from the ANN. WhileAI researchers tend to think of MLPs in terms of network architecture and learningalgorithms, statisticians tend to think of them as highly parameterised non-linear


115

models. This makes them amazingly flexible as they can model even small irregulari-ties on decision boundaries. However, it also makes the parameters of these modelsdifficult to estimate and can lead to problems with overfitting. Many early MLPs weresubject to criticism as they made exaggerated claims of performance based on overfit-ting training set data, only to perform much less effectively on new cases. This dangerof overfitting is now much better understood and there are techniques for the appro-priate use of training data to guard against overfitting.

Exercise 4.1 Overfitting What is the ‘danger of overfitting’? Give an example.

The mathematical representation of an MLP with just a single hidden layer between aset of input variables x and the output variable y is:

As was the case for GLM this involves a linear combination of the predictor variablesrepresented by the i elements. In addition to this a non-linear transformation is intro-duced to the model by the fj element. While MLPs may contain many hidden layers it isactually possible to show that a single intermediate layer is sufficient to model anydesired degree to non-linearity (i.e. has the same essential structure as the formulashown). In practice you may want to introduce more than one hidden layer for the pur-pose of more easily interpreting the resulting network. It is also possible to have net-work structures where nodes ‘by-pass’ layers – e.g. some of the input nodes areconnected directly to nodes in the ‘second’ hidden layer of the model. While these approaches are highly flexible the estimation of their parameters can betricky. While a sum of squares loss function is typically used the number and non-lin-ear combination of parameters can mean that training the network (i.e. fitting suita-ble parameters) can take a long time. (There are stories told within the MLPcommunity that on occasions the dangers of overfitting have been overcome quiteaccidentally, simply by the model-builder giving up on additional training of the net-work!). It is inappropriate to discuss the details of parameter estimation here but somedetail is given in HMS (2001) Chapter 11.4. One other important point to note at this issue relates to the ‘threshold’ logic thatoperates within the non-linear transformation function. In the perceptron (and theclassic neural network model as discussed in Chapter 5) the inputs are summed and ifa pre-determined threshold is reached the neuron is ‘fired’; otherwise it is not. Thisprocess is implemented as a ‘step function’ with the value 0 until a threshold isreached at which point the output switches to 1, as shown in Figure 4.35 overleaf.While most MLPs are still sensitive to relatively ‘narrow’ input windows, the outputgenerated in the threshold transformation will typically have a logistic form, as shownin Figure 4.35. This is sometimes known as a ‘sigmoid’ function, or S-shaped curve, andis not the only option – the hyperbolic tangent function is another reasonably com-mon option. Note that the output is still in the range 0–1 and that is it sensitive to a rel-atively limited range of inputs, i.e. input values in the range -2 to +2 cover over 95% ofthe function. However, unlike the step function the ‘smooth’ output created by thesealternatives can be easily differentiated and have other useful mathematical proper-ties when it comes to attempting to parameterise the model.The fact that the output from an MLP must be in the range of 0-1 means that the varia-ble may have to be scaled and/or re-coded to be correctly interpreted. These and otherdetails associated with the general operation of artificial neural networks are coveredin Chapter 5.

y $j (wj fj)! [$i

= (wi xi)]!


116

Other techniques for regressionThe two approaches described are the most widely known and used in the area ofregression analysis. However, in both the case of mainstream statistics (where GLM isthe predominant approach) and neural networks (where MLP dominates) there is arange of alternative methods. Unfortunately these have not gained as much accept-ance as they might warrant – partly due to the ‘overselling’ of ANN technology. AsHMS (2001) note:

Partly because of this power and flexibility, but probably also partly because of theappeal of their name with its implied promise, they have attracted a great deal ofmedia attention. However, they (ANNs) are not the only class of flexible models. Oth-ers, in some cases with an approximating power equivalent to that of neural net-works, have also been developed. Some of these have advantages as far asinterpretation and estimation goes. (p393)

They go on to mention a number of prime examples of highly parameterised modelswhich are often overlooked. In particular they mention generalised additive modelswhich extend GLM by allowing for weighting of transformed versions of the predictorvariables and projection pursuit regression which, rather than focusing on one varia-ble at a time when fitting the regression model, provides a mechanism to allow con-sideration to be given to combinations of predictor variables. We do not have space todiscuss these here but the interested reader may look at Chapter 11.5 of HMS (2001) aswell as some of the ‘further reading’ suggested by them in Chapter 11.6.

Figure 4.35 Simple input/output response function illustrating the differencebetween the step function and the logistic function


117

4.4 Association rules and patterns

This is perhaps the most straightforward of all the approaches discussed in this sec-tion. Association rules do ‘exactly what they say on the tin’ (as the advert goes) – theyare rules which state how things are associated. Here we turn away from attemptingto predict or even describe the whole of our problem domain. We are not looking forany global models but rather for specific local concepts that relate to some aspect ofour data set. Thus a Web Tracking firm might find that 15% of their target browsingpopulation visited both Yahoo and Google in the past week. A bank might find that if acustomer closes her main account but currently has her mortgage with the bank, thereis a 40% chance that she will move her mortgage within a year of the account closuredate. Notice that while these are potentially useful pieces of knowledge they tell usnothing about the other 85% of the Web browsers (in the case of the Web tracker), norabout the chances of anyone who remains a bank customer moving their mortgagewithin the year. Despite the apparent simplicity of the association rule, there are a number of issuesthat are not that straightforward and require resolution. The first, and perhaps mostobvious, is how are all the potentially interesting relationships in a data set going tobe investigated? Take, for example, the ‘market basket’ analysis used by supermarkets.A typical supermarket might stock 6,000 to 10,000 items. Even if we assume that only1,000 of these are sufficiently widely purchased and that we only consider pairs ofjointly purchased products, so as to give rules of the form, ‘2.7% of customers buy bothBrie cheese and red wine’ (as opposed to patterns where, say, three items were boughttogether), there are 21000 possible patterns to investigate. In the case of the Web Track-ing company where no ‘closed set’ of possible websites exists the number of potentialpatterns is to all intents and purposes infinite. Fortunately life is not quite as randomas that and the patterns we tend to find are related in various ways to one another.This section will discuss the strategies and techniques for making this rule discoveryboth feasible and useful. We begin with some basic principles and definitions.

Representation of rulesFormally, rules consist of a proposition (known as the ‘antecedent’) and a conclusion(the ‘consequent’). Thus we might state:

... if the driver is under 21 and the car is a Ferrari then the insurance will be high.

The antecedent, on the left-hand side of the rule, specifies the conditions while theright-hand side of the rule indicates the outcome. Both ‘sides’ of the rule are in factBoolean statements (i.e. statements which can only take on the values TRUE or FALSE)about the domain under discussion. The rule links these two statements (propositions)together by stating that if the left-hand side of the rule is true, then the right-handside must also be true. It is also possible to define probabilistic rules which state that ifthe left-hand side is true then the right-hand side will be so with probability p. (Rulesare one of the most widely used knowledge representations in artificial intelligenceand are discussed in more detail in Chapter 5.) The most usual type of variable to have in the conditional part of a rule is a categoricalvariable, but there is nothing to stop the creation of Boolean expressions which specifylevels of real-valued variable (potentially on both sides of the rule – though more nor-mally this would be used in the antecedent). So we might state, ‘if spend > £300 thencard payment (p = 0.98)’ – i.e. of all the customers who have spent over £300 in ourshop, 98% of them have paid with a card rather than cash. This may seem similar tothe types of rule we created in the decision trees discussed in ‘Probability distributions


118

(density estimation)’ on page 81 – and indeed it is. We will discuss the similarities anddifferences between association rules and decision trees later in this section.

Rule support and confidenceAny discussion of how data mining tools go about finding association rules must con-sider the issue of an itemset and its frequency. This is perhaps best illustrated in termsof an example and we shall use ‘market basket’ analysis. However, I have decided touse a newsagent (rather than a supermarket) just for a bit of variety. Table 4.7 illus-trates a series of ten customers who have made purchases across four product typesfrom our newsagent. (I know that you don’t normally use a ‘basket’ in the newsagent’sbut in the spirit of the example… Also in market basket analysis it is normal to identifythe ‘raw’ product bought while our example illustrates categories of purchases – now Ialmost wish I had stuck to the supermarket!)In the table a ‘1’ represents that this category of product was bought by that customerin that basket of goods, while a 0 represents that the product category was not pur-chased. The table shown can be mathematically represented as an ‘indicator matrix’(of dimension i by j), in our case with 10 transactions (i) and 5 item categories (j). Thismatrix would typically be very sparse – imagine the typical supermarket with 6,000items compared to the 50–200 items you might buy in a single transaction, or even the2–3 items you might buy at the newsagents compared to the 50–100 categories ofitems there might be. I have had to make our indicator matrix a little less sparse toillustrate some points.

(‘Drink’ refers to soft drinks but I used this header so we can refer to each column by itsinitial letter – and of course there are many more item categories not shown here.)

An itemset is simply a row in the table for which one or more columns are set to thevalue 1. Thus the itemsets for row b1 in our table would be {P}, {M}, {S}, {PM}, {PS}, {MS}and {PMS}. For a given itemset pattern & we state that the frequency fr(&) is thenumber of cases in the data that satisfy &. To isolate frequently occurring patterns wecan simply set a frequency threshold and consider all itemsets against this value. If weset the threshold to 0.3 then the frequent sets across the data set would be {P}, {M}, {C},{S}, {PM}, {PC}, {MC} and {PMC}. We are interested only in itemsets with more than asingle item in them (or else there would be no left and right hand sides!) and fromthese we can see that for a given itemset pattern & we can create rules of the form& % ' (i.e. alpha ‘implies’ beta). Thus we could create a rule Paper % Magazine fromour purchase data. The frequency with which this itemset occurs fr(&#') is also

Table 4.7 Illustration of 10 ‘basket’ transactions at a newsagent’s shop indicating the types of goods sold within each transaction

‘Basket’ Paper Magazine Cigarettes Drink Sweets

b1 1 1 0 0 1b2 1 0 0 1 1b3 1 1 1 0 0b4 0 1 0 0 0b5 1 1 1 0 0b6 1 0 0 1 0b7 0 0 0 0 1b8 0 1 0 0 1b9 1 0 0 0 0b10 1 1 1 0 0


119

referred to as the support. The accuracy (often referred to as the confidence) of anyassociation rule & % ' is the fraction of the rows which satisfy ' out of all those rowsthat satisfy &, i.e.:

c (& % ') = fr(&#')/fr(&)

Thus our rule (Paper % Magazine) has a frequency of 0.4 (4 out of 10 customer pur-chases contain this set of product types, which is how it qualified for frequent setinclusion in the first place) and an accuracy of 4/7. In a similar way:! fr(Paper % Cigarettes) = 0.3; c (Paper % Cigarettes) = 3/7! fr(Magazine % Cigarettes) = 0.3; c (Magazine % Cigarettes) = 3/6! fr(Paper # Magazine % Cigarettes) = 0.3; c (Paper # Magazine % Cigarettes) = 3/4

The rule frequency (support) tells us how often the rule might be applicable, while therule accuracy (confidence) indicates the level of belief we might have that the right-hand side of the rule is in fact true, given the left-hand side.Note: These rules can of course be just as easily expressed in the ‘opposite’ direction,e.g. (Magazine % Paper). The frequency will remain the same but the confidence maychange, e.g. c (Magazine % Paper) = 4/6. In addition to having moved away from glo-bal models to local patterns we also no longer have a particular variable whichreceives the focus of our attention – i.e. there is no response variable. Of course anyalgorithm for finding rule associations can easily be extended to focus on a particularvariable, or variables, simply by specifying that only rules with this variable(s) show-ing as a consequent should be included in the association rule set.

GeneralisationsAlthough going through all the possible sets of combinations in our newsagent exam-ple may have appeared a bit laborious, there were in fact a few generalisations thatcould be made along the way. For example, as the itemset {D} did not qualify for inclu-sion in the frequent sets there was no need to consider any combination of ‘Drinksplus any other purchase type’ as these could not by definition be included in mostcomplex frequent sets. This observation and a couple of rules from set theory combineto give a simple but efficient algorithm known as the Apriori algorithm which can beused to construct candidate itemsets from large sparse indicator matrices (details ofthe algorithm can be found in 13.3.2 of HMS (2001)). While the Apriori algorithm shares some similarities with rule induction algorithms itdiffers fundamentally in that it does not have a pre-defined conclusion to work with.In addition it must consider computational efficiency as it typically expects to work onvery large sparse matrices. While Ross Quinlan and others have been working oninduction algorithms (e.g. ID3) since the late 1960s most of the work on rule associa-tion is much more recent. The Apriori algorithm, for example, was first documented in1994. While the algorithms may differ it is nevertheless a fact that the rule representa-tion format of association rules is identical to that which can be used to codify decisiontrees. We now turn our attention to these similarities and, perhaps even more impor-tantly, differences.


120

Association rules vs decision treesAny decision tree of the type introduced in‘Probability distributions (density estimation)’on page 81 can be represented as a set of rules.For example, Figure 4.36 shows the relationshipbetween Y and a series of variables {A, B, C, D} inthe form of a tree. Basically rules are formed as aconjunction of all the nodes (with their respec-tive conditions) as the left-hand side and theclass label associated with any ‘leaf’ of the treeas the right-hand side. Thus, for example, thefollowing rules can be derived from the tree, tak-ing the left-most and right-most branchesrespectively:

A # (B = ‘a’) # ¬D % Y = .T.

¬A # (C ! 4) % Y = .T.

If the format of the representation of these relationships are equivalent then are notdecision trees and association rules more or less the same thing? No, they are in factvery different – most fundamentally in their focus on models as opposed to patterns. Adecision tree (as used for predictive classification) and the rules which make up thetree are both mutually exclusive and exhaustive. This means that: (a) for any given instance within the data set there will be only one rule that speci-

fies what its outcome class will be', and (b) that every instance in the data set will be covered by a rule.

(Note: due to lack of space the tree shown in Figure 4.36 does not cover the whole inputspace and is therefore not exhaustive, though it is mutually exclusive. It would only beexhaustive is we had set some additional rules such as, ‘when A is true it is never validto consider a value for C’, etc.) Neither of these properties is a necessary characteristicof a set of association rules – in fact almost by definition, given the sparse nature ofthe types of data for which this technique is used, it is most unlikely that they will everprove to be exhaustive.However, the very exhaustiveness of decision trees can be a problem when it comes tointerpretation of the rule set. Take for example the following simple disjunctive rule:

if (V # W)⁄(X # Y) % Z = 1

This can be transformed into the following two simple rules:

if (V # W) % Z = 1

if (X # Y) % Z = 1

!

Thinking point Disjunctive expression Why is the following expression described as‘disjunctive’? (V # W)⁄(X # Y) % Z = 1

!

A

B

.T.

.T. .F.

.F.

<4 !4

=‘c’

=‘b’=‘a’C

D

Y=.F. Y=.T.

Y=.F. Y=.T.

Y=.F.

Y=.T.

Figure 4.36 Relationship between Y and the setof variables {A, B, C, D and E} represented

as a tree, for which a set of rules can also be constructed


121

The tree version is unfortunately not so easily separable as it requires a single rootnode (illustrated in the Figure 4.37 as ‘V’) even though that node, (e.g. V, and for thatmatter, W) is not strictly required in rules that refer only to X and Y.

An alternative method for creating association rule sets is to generate a classificationtree and then prune the branches. To begin with all branches of the tree are consideredas possible rules. Each branch/rule is then checked to see whether dropping conditionsfrom the left-hand side of the rule affects the accuracy of the rule with respect to theinstance set the rule would now cover. If the accuracy is unaffected – or indeedimproves – then that condition may be removed from the left-hand side to produce asimpler rule that is at least as accurate. In addition to the case of disjunctive conditionssuch as the example shown above where (V # W # X # Y) % Z = 1 will simplify to (V #W) % Z = 1, there will typically be many other conditions which are added during thetree-growing phase of a classification algorithm due to their ‘average’ improvement ofthe solution as a whole. When looking only at a sub-set of the instances, however, itmay become clear that these conditions are not required (and indeed may lead todecreased accuracy). Using this method a large fraction of the potential rules (as codedin the tree) will disappear leaving a relatively simpler association rule set.The obvious question to which the previous paragraph begs an answer is, ‘why not justlook for the association rules directly?’ (i.e. using, for example, the Apriori algorithm).The answer is that the methods developed within tree classification algorithms arevery efficient and in particular have good automatic methods of dividing up variableswhich are not categorical into discrete partitions. Given this fact and the relativelystraightforward nature of the post-processing that has to be carried out on the tree togenerate the rules, this is often a reasonably sensible approach. However, the methodof rule searching will be biased in a particular way and the definition of a rule’s ‘inter-estingness’ is taken out of the hands of the analyst and constrained to the relativelytechnical definition imposed by the rules of tree pruning.

!

Thinking point ‘Interesting’ What are the criteria for a rule to be ‘interesting’? !

VW

XY

Z = 1

.T..T.

.T..T.

.T..F.

.F..F. W

Figure 4.37 Tree representation of a simple disjunctive rule-set, illustrating the inefficiency of this format


122

Ways of defining the ‘interestingness’ of rulesThere are some fairly obvious examples of rules which we can identify as not being ofinterest. For example, if a building society knows that 5% of its customer base (total-ling 2.5 million customers) will alter their mortgage over the course of a year then thefollowing left-hand sides for the rule ‘if antecedent then customer will alter mortgage’are not interesting:

Clearly we can guard against this type of uninteresting rule by setting appropriate‘threshold’ levels for accuracy (confidence) and frequency (support). The accuracythreshold should be set at the very least a little higher than the ‘background’ accuracy(i.e. the a priori chance of finding the consequent). Setting the threshold for frequency isa little more tricky – clearly rules with very low coverage are seldom of much interestbut occasionally these ‘rare’ cases may provide useful insights. In general interesting-ness increases with accuracy, assuming that coverage is fixed, but the converse is notalways the case. HMS (2001) suggest a number of additional algorithmic tricks whichcan be used as criteria for interestingness. These include constructing a 2 ! 2 contin-gency table for the frequencies of what amount to the true positives, false positives,true negatives and false negatives, and computing their various Chi-square scores. Theyalso introduce a ‘particularly useful measure of interestingness’, the J-measure. Inter-ested readers are directed to Chapter 13.6.3 of HMS (2001).Ultimately the interestingness of association rules will be dictated by the domain andthere is no substitute for local problem understanding in defining and assessing thefinal utility of a rule. Rules which the user already knows about or which should not beconsidered for legal reasons (e.g. the inclusion of gender or ethnicity as a condition of acredit-scoring rule) can only be defined based on domain knowledge. In some cases itmay be possible to link the rules to their likely value. For example, in a market basketanalysis the rule could be scored in proportion to the sum of the sale value of all theitems in its antecedent (on the assumption that rules relating to high value items willbe of greatest interest). In all of this search for rules the basic principle of Occam’s razorshould also be adhered to – i.e. the simpler the rule the better.

Table 4.9 Some examples of rules which are not interesting in the context of a buildingsociety and the likelihood of its customers altering their mortgage

Condition Conclusion Accuracy Coverage

None altered mortgage 5% 100%

if balance to pay > £25k altered mortgage 5% 75%

if detached house altered mortgage 5% 25%

if Account No. = 12AZ 6789 altered mortgage 100% 1 in 2,500,000


123

5 techniques from ‘intelligent’ systems and artificial intelligence


126

Contents

Learning outcomes for this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.1 Reasoning and logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.2 Knowledge representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.3 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.4 Rule-based inference systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .160

5.5 Neural networks (ANN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .164

5.6 Other techniques from AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

ActivitiesExercise 5.1 Induction of rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Thinking point Soluble pills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130Thinking point Implication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134Thinking point A proverb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Exercise 5.2 Using predicative calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Thinking point Some questions easier than others . . . . . . . . . . . . . . . . . . . . . .142Thinking point Uncertainty vs ignorance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151Exercise 5.3 CaDDiS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158Thinking point Similarity and difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

techniques from ‘intelligent’ systems and artificial intelligence

127


When you have completed this Chapter you will be able to:! describe the range of knowledge representation formalisms which exist within AI

and their potential use in BI applications! appreciate the formal schema which exist for logical reasoning and the impor-

tance of defining semantic and syntactic issues in this context! explain the ways in which uncertainty can be modelled within intelligent systems! appreciate the difference between symbolic and sub-symbolic processing, and in

particular the relative strengths of rule-based and ANN systems! identify and describe novel or less widely adopted approaches to intelligent sys-

tem design and describe their potential impact of future BI systems (includingfuzzy logic, case-based reasoning, genetic algorithms, etc.)


128

Introduction

While the word ‘intelligence’ appears in the title of our BI module and we have men-tioned the links between BI and the areas of artificial intelligence (AI) and machinelearning, we have yet to describe formally what these linkages are. In much of the BIliterature you will search in vain for a mention of ‘AI’ or ‘expert systems’. This is largelybecause these technologies are seen to have failed in terms of delivery on the promisesmade in the late 1980s and BI suppliers do not wish to be tarred with that particularbrush. It is true that the capabilities of AI in the 1980s were over-stated by a number ofsuppliers with a consequent sense of disappointment when ‘intelligent’ systems failedto deliver on their promises. However, in BI the word ‘intelligence’ is not simply thereas a marketing hook to appear advanced and sophisticated; rather it is used because ofthe attempt to incorporate a range of techniques which come from AI into the typicalsuite of BI applications.Of course within the field of AI and more generally (see, for example Steven Spielberg’sfilm of the same name) there is much debate as to exactly what ‘intelligence’ might be.We do not have enough space (or, perhaps, enough interest) to get involved in philo-sophical debate at this point. Instead we adopt a fairly pragmatic definition thatallows for any action which if carried out by a human we might describe as ‘intelli-gent’. This approach was first popularised by Alan Turing in his famous Turing Testwhich adopted this ‘black box’ approach and judged a system to be intelligent in termsof its outputs rather than by how it happened to be operating internally.In this Chapter we describe the types of reasoning which are possible within an AIcontext as well as the various methods by which knowledge can be represented. Giventhe importance of dealing with uncertain knowledge within BI systems a section deal-ing specifically with this issue is included. The most successful applications from AI –which are, arguably, rule-based expert systems and artificial neural networks – arethen described in some detail, together with notes on how these approaches havebeen incorporated within BI products.


129

5.1 Reasoning and logic

Many approaches to logical reasoning have been adopted by AI researchers with avariety of strengths and weaknesses apparent in each. A range of specific representa-tion schemes will be discussed in Section 5.2, but first it is worth giving an overview ofthe general types of reasoning that can take place. These are broadly based on thetypes of reasoning we might find in human intelligence.

Induction: reasoning from the specific to the generalWe have already come across a number of techniques that make use of the inductivetechnique. Many of the predictive modelling techniques covered in Chapter 4 would fallinto this category – indeed any type of supervised learning algorithm would notionallyqualify, as they all involve taking a set of specific examples and working out a generalrule/model/relationship. This has obvious parallels in the area of human reasoningwhen you look at, for example, the way young children develop rules/models. A childfalls off a chair; off a step and off the side of the bed. On each occasion the child may behurt (with luck only in very minor ways). Now the child has to think about the conse-quences of falling off this wall – is it likely to hurt or not? Well, she has never actuallyfallen off a wall before but she has learned (inductively) from a number of otherinstances of falling off objects that it is likely to hurt and so takes precautionary action.Aristotle summed this up in his simple statement:

What we learn to do, we learn by doing… (Aristotle, Ethics)

There are many examples of this type of generalisation which we engage in ashumans. Often the examples we have to learn from are fairly straightforward and donot require any complex algorithm to work out the model/rules. At other times we arenot sure how exactly we have come to a particular model of the world but feel fairlysure it is accurate, or at least good enough to influence our actions in some way. In allcases we see the following relationship at work:

Facts + Conclusions % Rules

Take for example the simple case of children and whether or not they are currentlyattending school. We have a series of facts (instances, examples) and the conclusionassociated with each: attending/not attending school.Name Age Gender At school?Fiona 7 female yesJoe 3 male noFrank 9 male yesLinda 8 female yesBetty 2 female no

Exercise 5.1 Induction of rules From these facts and conclusions what rules might weinduce?


130

Notice that quality and specificity of the induced rules will depend on the range ofexample cases provided. As we had no examples of 4–6 year olds our rules are not asuseful as they could have been. In addition we did not use the attribute ‘gender’. HadBetty been male and Frank female then we might have come up with a rule aboutmales not attending school which would have been incorrect. Even with this set ofdata we could have ‘overfitted’ the rules to specify different starting ages for girls andboys. This would again have been unhelpful and was avoided in our informal algo-rithm by contextual knowledge that we brought to bear. Hopefully in an automaticinductive learning environment the volume and range of examples would help guardagainst such inadequate rules.

Deduction: reasoning from the general to the specificThe second type of reasoning is perhaps the sort we most readily associate with thenotion of logic. For example in logic puzzles or detective novels we are expected to useour general knowledge about the world to come to specific conclusions. Thus, forexample, we find Dr Watson (of Sherlock Holmes fame) on seeing pills on a bed-sidetable at the scene of a crime, making the following deduction:

From their lightness and transparency, I should imagine that they are soluble inwater. (From A Study in Scarlet, 1887)

!

Thinking point Soluble pills What is the rule in this example?!

Dr Watson has never seen these particular pills before. He is using his general knowl-edge about the domain of pharmacy medicines to make an observation specific to thiscase. Thus we can summarise the deductive process in terms of:

Facts + Rules % Conclusions1

Thus in our school-children example, if we knew the fact that:Susan is 7 years of age (Fact 1)

together with the general rule that:after a child’s 5th birthday they will attend school (Rule 1)

we could reasonably deduce that ‘Susan is attending school’.Of course not all deduction is quite so simple. In many instances there may be a largenumber of rules that are potentially applicable to provide a conclusion. In additionthere may be rules which when taken to their logical conclusion end up contradictingother sets of rules. The issues surrounding searching through sets of rules and dealingwith conflicts, etc. are considered later in a discussion of types of logic as well as inrule inferencing. For the moment we move on to the third general type of reasoning –abduction – which is a little more subtle than the first two. Abduction is an importanttype of reasoning to be able to handle if any system is to attempt to mirror human rea-soning processes to some extent.

1 Note that Watson’s’ conclusion is to some degree probabilistic: ‘I should imagine…’


131

Abduction: ‘reasoning backwards’This final general type of reasoning enables us to reason in the ‘opposite’ directionfrom the initial logical implication. Formally, the process of abduction allows thatgiven A%B and B, it is possible to infer A. Strictly abduction is referred to as an‘unsound’ form of inference as the conclusion may not be true for all interpretations inwhich the premises are true. This is perhaps best illustrated by an example.A typical rule from an expert system designed to diagnose faults in domestic electricalappliances might be as follows:! if TV picture is fuzzy! and video recording is poor! then aerial is not attached.

This looks like a normal predicate relationship (see ‘Predicate calculus’ on page 132)from which we can use deductive reasoning to draw sound inferences (known asmodus ponens – again see next section). However, this is not the case – the rule above issimply a heuristic (a rule of thumb) which may be useful in a diagnostic situation. It ispossible that the electronic circuits that control the TV monitor and the video tuningfunctions are both malfunctioning (it may be more likely that aerial attachment is thesource of the problem but this is not a necessary – ‘sound’ – inference). It is interestingto note that the converse/opposite of the above rule is true (i.e. a normal predicaterelation):! if aerial is not attached ! and TV picture is fuzzy ! then video recording is poor.

Effectively what we have done in the case of the initial rule is apply abductive reason-ing to come up with a new rule which while potentially useful may not always be cor-rect. Obviously the whole point of a diagnostic expert system is to use ‘symptoms’ toattempt to reason about their cause. However, it is not the symptoms which cause thefault or disease, but rather the other way around and thus we are constrained to usingabduction with all its potential pitfalls. There are ways to address some of the problems of ‘unsound reasoning’ involved inabductive reasoning. For example, it is possible to attach a certainty factor to any rule,in the form A%B (0.95). In our heuristic rule above this would be used to indicate thatin 95% of the cases where the TV picture is fuzzy and the video is not recording thecause will be a badly attached aerial (thus avoiding the strict causal interpretation toour ‘converse’ rule that was implied by the initial abductive rule). There are also alter-native representations of the knowledge – such as Bayesian belief networks – whichhelp to address these problems (see ‘Bayesian belief networks (BBNs)’ on page 157).


132

5.2 Knowledge representation

To this point we have taken a relatively informal view of what is meant by reasoningand introduced the three main types by reference to the way we commonly imaginehuman reasoning might operate. In doing so we have in fact glossed over a number ofissues that are important in any formal logic processing. Unlike humans, computersare not at all good at dealing with incomplete or poorly specified statements – theyhave no ‘common sense’ with which to fill in the gaps. When we talk of computer logicand reasoning we must therefore introduce a degree of rigour if we are to achieve anyresults whatsoever. In the following sections we will introduce a number of formal knowledge representa-tion schemes used by computer-based reasoning systems. We will begin by looking atthe most fundamental of all reasoning representations – the propositional and predi-cate calculi. This brief introduction attempts to provide sufficient detail to give thereader a feel for the subject. However, due to the excessively symbolic nature of thisformal representation it is not possible to cover the entire set of definitions and infer-ence rules. Instead guidance is given on appropriate further reading and a workedexample is provided to illustrate the use of predicate calculus in a simple automatedfinancial advisor. Rules are a simpler and more commonly used mechanism to repre-sent knowledge in practical applications and these are introduced next. Finally anoverview is given of some ‘richer’ representational approaches such as semantic nets,frames and scripts.

Predicate calculusThe predicate calculus is a representational schema used within AI to enable auto-mated reasoning. It achieves this by relying on a clearly defined set of formal seman-tics together with sound and complete rules of inference. Strictly speaking, thediscussion below relates to first-order predicate calculus (i.e. there are other and morecomplex types of predicate calculus). It is sometimes, incorrectly, referred to as ‘predi-cate logic’. Although it enables logical inferences to be made it is strictly a calculus –that is, the formal definition of a representational language.The predicate calculus is in fact based on the even more basic propositional calculuswhich is in many ways a set of formalisms that describe the way that many of uswould think about simple computer logic (assuming we had even done so!). As withany formal language we need to describe the set of symbols which have meaning in its‘universe of discourse’. This set of symbols is in fact very limited, consisting of:! propositions: U, V, W, etc;! truth symbols : TRUE and FALSE ;! connectives: #, (, ¬, %, = (explained by example below).

and that is basically it! The use of brackets is also permitted – see below.A proposition is any statement about the ‘world’ which is being reasoned about thatmay take on the value true or false. For example, ‘the coffee is hot’ or ‘the train is late’.These propositions are represented by uppercase letters (normally from towards theend of the alphabet). Thus U is the proposition ‘the coffee is hot’, W denotes ‘the train islate’, and so forth. It is possible to use the set of symbols to create ‘sentences’ from theatomic elements of the calculus. (‘Sentence’ is used here not in the ordinary grammati-cal sense, but in logical terms, i.e. ‘the verbal expression of a proposition etc.’.) To do thisa set of rules exists which define all instances of what a valid sentence is. In the discus-sion of predicate calculus we will not go into this level of detail but to illustrate thedefinitional rigour required we will list the full set of valid sentence definitions here


133

(you will also note a degree of internal referencing, or recursion, which is common insuch definitions).Valid ‘sentences’ within the propositional calculus can be defined according to the fol-lowing rules:every proposition and the truth symbols are in themselves sentences

e.g. U, W, FALSE are all sentencesthe negation of a sentence is a sentence

e.g. ¬W, ¬TRUE are sentencesthe disjunction of two sentences is a sentence (OR operator)

e.g. U ( W is a sentencethe conjunction of two sentences is a sentence (AND operator)

e.g. ¬W # W is a sentencethe implication of one sentence to another is a sentence (‘IMPLIES’ operator)

e.g. U % W is a sentencethe equivalence of two sentences is a sentence

e.g. U # W = Z is a sentence.The symbols can also be grouped together, which will mean that, for example, the sen-tence (U # W) = Z is entirely different from the sentence U # (W = Z). Only well-formedformulae – i.e. sentences adhering to the rules outlined – can be used in logical argu-ment if we wish to ensure sound and complete inference.While we would appear to have defined quite a lot in our discussion thus far we havein fact only dealt with the syntax of the propositional calculus – i.e. the set of rulesdefining legal sentences. We have still to define the semantics of the language whichis what gives ‘meaning’ to the sentences. This is quite a tricky distinction for us tograsp at first as we already know something of the semantics of Boolean logic – i.e. weknow that if U is TRUE then the negation of U (¬U) will be false. However, our rules asdefined thus far would not let us make such an assertion – they simply state that if U isa sentence then ¬U is also a lawful sentence. The assignment of a truth value to propo-sitions is called an interpretation and our definitions thus far state that this mappingmust be into the set {T, F}. These symbols are used to denote the truth value assigned toa sentence and formally are distinct from the truth symbols TRUE and FALSE which arepart of the ‘atomic’ set of legal sentences.Just as the syntax of propositional calculus had to be explicitly defined so do itssemantics. Remember that the assignment of a truth value is referred to as its inter-pretation and must take on the value T or F.The interpretation of sentences using the various propositional operators can be for-malised as follows:! the truth assignment for negation, ¬U, is F if the assignment to U is T, and T if the

assignment to U is F ! the truth assignment for conjunction, #, is T only when both conjuncts (the name

used to refer to the propositions being combined using the AND operator) havethe truth value T, otherwise it is F

! the truth assignment for disjunction, ( is F only when both disjuncts have thetruth value F, otherwise it is T

! the truth assignment for implication, %, is F only when the premise before theimplication is T and the truth value of the consequent after the implication is F,otherwise it is T


134

! the truth assignment for equivalence, =, is T only when both expressions have thesame truth assignment for all possible interpretations, otherwise it is F.

Many of us already have an intuitive feel for the meaning of these Boolean operatorsand may think of them most readily in terms of a ‘truth table’ – thus for two proposi-tions U and W connected by a disjunction (the OR operator) we would have:U W (U ( W)T T TT F TF T TF F F

This is of course equivalent to the third definition in our list above. (Arguably the mostsurprising of the semantic definitions is the fourth one dealing with implication.)

!

Thinking point Implication Can you think of an example to illustrate why this fourthsemantic definition might lead to counter-intuitive results?

!

There is a whole set of propositional statements which can be shown to be equivalentand these have led to a set of ‘laws’ which enable sentences to be transformed fromone expression to another. These include the associative law, the commutative law,etc. Each of these identity relationships can be verified through the use of truth tablesas illustrated below for de Morgan’s law, which states:

¬(U # W) = (¬U ( ¬W) and also ¬(U ( W) = (¬U # ¬W)

We can verify the first of these two equivalences using the following truth table:

I am aware that although two pages back this section was labelled predicate calculus,thus far we only appear to have discussed propositional calculus. I would justify this interms of the fact that the latter is a simpler variant of the former and as such is bettersuited to providing an introductory discussion of the syntax and semantics of any for-mal logical reasoning language. However, a major limitation of propositional logic isthat fact that it only allows for reasoning involving atomic assertions – assertions thatwe cannot break into individual components. In predicate calculus we can create‘predicates’ which describe relationships between atomic concepts. Thus rather thanhaving to use a proposition to denote the full sentence ‘the coffee is hot’ we could cre-ate a predicate temperature which describes the relationship between a liquid and itsstate: temperature (coffee, hot). The most important consequence of this ability todefine predicates is that it allows for the manipulation of variables within the lan-guage. Thus we could state that for every value that X may take on, where X is a liquiddispensed from our vending machine, the statement temperature (X, hot) is true (i.e.we only dispense hot drinks).

U W (U # W) ¬(U # W) ¬U ¬W (¬U ( ¬W) ¬(U # W) = (¬U ( ¬W)

T T T F F F F TT F F T F T T TF T F T T F T TF F F T T T T T


135

The symbols, syntax and semantics of predicate calculus must be defined in much thesame way as we illustrated for propositional calculus – except of course that they are alittle more complex. We do not have space here to discuss these in detail. However, as ashort worked example is provided at the end of this section it is necessary to introducea few additional elements of predicate calculus. The fact that we are now allowed towork with variables leads to the need to define two new symbols referred to as varia-ble quantifiers. The first symbol ) is called the universal quantifier and indicates thata sentence is true for all values of its quantified variable. Thus:

) Y watches(Y, television)

represents that assertion that for all values that may be taken on by Y within thedomain of reasoning watch(Y, television) is true (i.e. informally, ‘everyone watches tel-evision’). The second symbol * is the existential quantifier, which essentially meansthat the sentence is true for some of the values that can be taken on by the quantifiedvariable. Thus:

* X is related(X, John)

implies (informally) that there are some persons in the set of people that can be repre-sented in variable X to whom John is related.There is not space here to describe the full set of rules defining the syntax and seman-tics of predicate calculus. Instead a number of, hopefully, intuitive examples of first-order predicate calculus representations of English phrases are given. Most grammati-cally correct English sentences can be represented in this way, and the same applies toother European languages.! All fashion models are beautiful

) W (fashion model(W) % beautiful(W))

! Some children like pizza* W (child(W) # likes(W, pizza))

! No-one likes a cheat¬* W likes(W, a cheat) [literally: ‘there are not some W who like a cheat’

! If it’s freezing tomorrow, Craig will not go to Glen Coeweather(freezing, tomorrow) % ¬go(Craig, Glen Coe).

!

Thinking point A proverb Can you figure out which English proverb is being repre-sented in the following predicate calculus statement?() X, * Y fool(X) # time(Y)) ( (* X, ) Y fool(X) # time(Y))but ¬() X, ) Y fool(X) # time(Y))

!

Exercise 5.2 Using predicative calculus Try writing some other well-know phrases orsayings – such as ‘you can’t make bricks without straw’ – on similar lines.


136

In addition to the syntax and semantics of predicate calculus we must also define therules of inference. Once again we do not need to go into the details of these ruleswhich must adhere to a set of logical principles such as soundness, consistency, com-pleteness, etc. We simply illustrate the nature of these rules by giving the examplethat is perhaps best known (at least by name if not in terms of its meaning!), modusponens, and its corollary, modus tolens. ! By the logical inference rule modus ponens, if we know that the sentences U and

U%W are true, then we can infer W.! If the sentence U%W is known to be true and W is known to be false, then modus

tolens lets us infer ¬U.

As modus ponens (and indeed all rules of inference in 1st order predicate calculus) canbe used with variables this gives us the well-known example of logical reasoninginvolving Socrates’ mortality:

All men are mortal and Socrates is a man, therefore Socrates is mortal.

Formally we would state:) W (man(W) % mortal(W))

and man (Socrates)by substitution and the use of the universal operator we would then have:

man (Socrates) % mortal (Socrates)

and we can now (at last!) apply modus ponens to infer the conclusion:mortal (Socrates) – i.e. ‘Socrates is mortal’.

We have probably now seen more than enough of predicate (and propositional) calcu-lus to last us for some time. The important thing I hope the reader will take away isthat creating formal reasoning mechanisms is not a trivial task. It is why many of theactivities we would like to undertake within artificial intelligence prove to be intracta-ble. It is also why we should be wary of companies or products which make certainclaims as to the ‘intelligent’ features they exemplify.As a footnote it is worth mentioning that attempts to implement practical systemsusing the approaches of predicate calculus are few and far between. When this hasbeen attempted the most likely tool would be the declarative language PROLOG, whichmakes the definition of assertions and relationships as straightforward as possible.Thus while predicate calculus is largely a scheme for discussing (some would say end-lessly debating the trivia of) logical inference, many of its key principles will also beseen in the more ‘practical’ knowledge representation schemes we discuss later in thissection such as rules, frames and semantic networks.We end this section with an example that illustrates how the knowledge required for asimple financial advisor program might be represented in predicate calculus. (I fullyaccept that the description given above has skipped over certain aspects of the syntax,etc. of predicate calculus but hopefully enough detail has been presented to ensurethat the example make some sense.) The example is based, with full acknowledge-ment, in Section 2.4 of Luger and Stubblefield (1998). I have altered their example toreflect the UK context and have also simplified it slightly to facilitate a briefer discus-sion. (In addition to providing a worked example of the application of predicate calcu-lus I have also used the same example in our later discussion of rule-based systemswhich allows for some comparison to be made.)


137

Example of a Simple Financial Advisor(knowledge represented in the form of predicate calculus, with acknowledgement toLuger and Stubblefield, 1998)The basic objective of this simple financial advisor is to assess the financial situationof an individual and advise them as to whether they should be putting spare cash intoa savings account or investing in shares. (Note that this example does not attempt tomodel the up-to-date conditions of the financial investment world.)There are basically only three outcomes (conclusions) that the advisor can come upwith:(i) if you have inadequate savings then you will always be advised to increase your

savings, regardless of other factors (this is a sensible initial priority)(ii) if you have adequate savings and also an adequate income then you will be

advised to think about the riskier (but potentially more rewarding) option ofinvesting in shares

(iii) if you are on a lower level of income but already have adequate savings you willbe advised to split any surplus income between savings and shares (the ‘combina-tion’ option) to increase your savings ‘buffer’ while also trying to increase yourincome through the stock market.

These conclusions are represented using a unary predicate (i.e. one that has only one argu-ment) investment which can take on one of these three outcomes: savings, shares orcombination. The main conditions which will determine the conclusion can also be repre-sented by unary predicates savings_account and income_level which can both have oneof two values: adequate and inadequate. Thus we can specify four possible ‘sentences’:! savings_account(adequate)! savings_account(inadequate)! income_level(adequate)! income_level(inadequate)

Taking these together with the three basic advisor rules stated above we can representthe main investment implications as follows:! savings_account(inadequate) % investment(savings)! savings_account(adequate) # income_level(adequate) % investment(shares)! savings_account(adequate) # income_level(inadequate) % investment(combi-

nation)

Now the rules which allow us to assert whether savings and income are adequate ornot must be specified. These are both linked to the number of dependents that the indi-vidual being assessed has to support. An adequate level of savings is deemed to be atleast £4,000 in the account for each dependent. An adequate level of income is specifiedas being a base level of £12,000 per annum plus an additional £3,000 per dependent.To enable us to codify the assertions first for savings we define a functionmin_savings which returns 4,000 times its argument – the number of dependents.

i.e. min_savings(U) = 4000 ! U

This can be used in the predicate sentence implications to determine the adequacy ofthe savings account:

) X amount_saved(X) # * Y (dependents(Y) # greater(X, min_savings(Y)) % savings_account(adequate)) X amount_saved(X) # * Y (dependents(Y) # ¬greater(X, min_savings(Y)) % savings_account(inadequate)


138

Where the additional definitions, amount_saved(U) and dependents(U) assert the cur-rent amount in an individual’s savings account and the number of dependents he or shehas respectively (the function greater represents the normal arithmetic comparator).In a similar fashion we can create definitions and implications to allow us to makevalid assertions about the individual’s level of income. The predicate earnings repre-sents the amount earned and a function min_income given as its argument thenumber of dependents:

i.e. min_income(U) = 12000 + (3000 ! U)

Allowing us to determine the adequacy of income by:) X earnings(X) # * Y (dependents(Y) # greater(X, min_income(Y)) % income_level(adequate)) X earnings(X) # * Y (dependents(Y) # ¬greater(X, min_income(Y)) % income_level(inadequate).

We have now fully defined our set of predicates representing the knowledge of the sim-ple financial advisor. To infer the outcome for a specific individual we would have tospecify values for the ‘input’ predicates amount_saved, earnings and dependents. Foran individual with £15,000 in her/his saving account, an annual income of £19,000 andthree dependents we can now specify the full set of sentences that define the situation:1 amount_saved(15000)2 earnings(19000)3 dependents(3)4 savings_account(inadequate) % investment(savings)5 savings_account(adequate) # income_level(adequate)

% investment(shares)6 savings_account(adequate) # income_level(inadequate)

% investment(combination)7 ) X amount_saved(X) # * Y (dependents(Y) # greater(X, min_savings(Y))

% savings_account(adequate)8 ) X amount_saved(X) # * Y (dependents(Y) # ¬greater(X, min_savings(Y))

% savings_account(inadequate)9 ) X earnings(X) # * Y (dependents(Y) # greater(X, min_income(Y))

% income_level(adequate)10 ) X earnings(X) # * Y (dependents(Y) # ¬greater(X, min_income(Y))

% income_level(inadequate).

We can take assertions 1 and 3, which after substitution in assertion 7 and the use ofmin_savings(3) gives the value 12,000 and will lead to:

amount_saved(15000) # dependents(3) # greater(15000, 12000) % savings_account(adequate)

This leads to the new assertion (11): 11 savings_account(adequate)

In a similar way we can use assertions 1, 2 and 10 to give:earnings(19000) # dependents(3) # ¬greater(19000, 21000) % income_level(inadequate)

And thus an additional assertion (12): 12 income_level(inadequate)

Finally we can use assertions 11 and 12 as the premises of implication 6 and by modusponens come to the conclusion investment(combination).


139

Classification of alternative knowledge representation approachesMany different approaches to representing knowledge have been discussed andimplemented by researchers in the field of AI over the past 50 years. We have given adetailed description of predicate calculus as this is the most fundamental of the logicalapproaches, with many variants and extensions. In addition it differs sufficiently fromthe rule-based approaches more commonly referenced in introductory material onreasoning in BI to warrant discussion because it raises a number of different issues interms of knowledge acquisition and delivery.The approach of predicate calculus is one of a number (and arguably the ‘parent’) oflogical representation schemes. It can also be seen as belonging to the broader classifi-cation of declarative representations. Declarative approaches attempt to simply state(‘declare’) facts about the world. They can be contrasted with procedural representa-tion approaches which incorporate information about how to carry out the process ofsolving the knowledge problem within a domain. Take a simple illustration from eve-ryday life. Assume I call a taxi to my office in Glasgow. On entering the taxi, I mightinstruct the driver as follows, ‘drive ahead for 100m and at the T-junction turn right. Goover the brow of the hill and at the traffic lights turn left (always assuming they havegone to green – we shall assume some ‘operational’ knowledge on the part of ourdriver!). Drive straight for about half a mile, through three sets of traffic lights and turnright at the fourth set. In 200m stop on the right-hand side of the road and let me out.’On the other hand it is much more likely that I would get into the taxi and say, ‘take meto the GFT cinema’! Sorry to drag the first part of our illustration out for so long but that is the nature of theprocedural approach – each set of actions has to be specified in careful detail. Of coursethis is not to say that the declarative approach is without its problems. If for example Ihad jumped into the taxi and said, ‘take me to see Gandhi’ the result may have been lesspredictable. In the first place I would need to assume that my taxi driver was aware thatthere was a film of this title (rather than assuming that it was the statue of Gandhi inLondon’s Tavistock Square that I wanted to see!). Secondly he or she would have to beaware of the existence of the GFT cinema which shows ‘low demand’ older films andalso of the fact that in that particular month there was a retrospective of Ben Kingsley’smajor film roles (and of course the fact that Kingsley starred in the film ‘Gandhi’).Assuming the driver was aware of the second fact but not the first then there is nothingin the declarative request which would stop her from heading towards the M77 and ajourney to London where she happens to know the film is being shown. On the otherhand ignorance of the second fact may find us on the way to Glasgow airport on theassumption that I am planning to take a trip to India. The point is that to utilise declar-ative approaches we require much more contextual knowledge to make sense of thestatements/requests being made. If the reasoning system does not have access to thislevel of detailed contextual knowledge – as in the case of our taxi driver and her inabil-ity to appropriately respond to the term ‘Gandhi’ (referred to in the literature as ‘seman-tic disambiguation’) – then unpredictable results may well ensue. Mylopoulos and Levesque (1984) proposed a simple four-category classification fortypes of knowledge representation schema, which is still largely valid. Their first threecategories are all examples of the declarative approach.1 Logical representation schemes: this is the class of schemes which use formal

logic and inference rules to represent and reason with knowledge. The best-known example of this approach is first-order predicate calculus though there aremany other logics. As noted above, the most commonly used programming lan-guage in which these various logics are implemented is PROLOG.


140

2 Network representation schemes: these representations attempt to captureknowledge in the form of a graph where the nodes are used to represent concepts(most often objects) while the arcs between these nodes represent the relationshipsor associations between these objects. There a number of examples of the networkrepresentation approaches the most well-known of which is semantic networks(described below), but others include conceptual graphs (Sowa, 1984) and concep-tual dependency theory (Schank and Rieger, 1974). While not normally included inthis category (perhaps because they were not in common use in 1984 when theseclassifications were first proposed) there is a case to be made that both Bayesianbelief networks and causal networks belong to this class (see Section 5.3).

3 Structured representation schemes: the approaches used within this class extendthe representations used within the network schemes by allowing the nodeswithin a network to take on more complex structures. While nodes in a networkschema are simple (or atomic) objects, the nodes in the structured approach willtypically consist of data structures consisting of a set of named slots to whichslot-values (‘fillers’) may be attached. The most widely used schemes to utilisethis approach are frames and scripts (see below).

4 Procedural representation schemes: the schemes classified as belonging to thisapproach differ from the first three in that, as the name implies, they use a proce-dural approach in the representation of knowledge. This means that there is infor-mation stored within the knowledge itself which can be used to drive theprocedure of solving the problem. Sets of if-then rules (sometimes referred to col-lectively as a ‘production system’) are an example of such an approach as the orderin which the rules are ‘fired’ is defined within the rules and meta-rules themselves(see below for a more detailed explanation). The well-known AI programming lan-guage LISP is also an example of a procedural approach. However, LISP is referredto as a functional language, denoting the fact that its syntax is derived from themathematical schema of recursive functions. Somewhat confusingly, LISP can alsobe used as an implementation basis for a number of the declarative approachesintroduced above, such as semantic nets and frames. As was the case with PROLOGwe do not have space to properly introduce LISP here but a number or useful point-ers are given in the references and on the module website.

You may be wondering why this list of approaches contains no mention of artificialneural networks (ANNs). This is because there is a sense in which ANNs do not use anyknowledge representation approach at all (at least not an explicit one). All of the tech-niques listed above belong to the class of symbolic reasoning approaches. That is to say,they look at what the human brain appears to be doing while it is engaged in the rea-soning process and attempt to symbolise this knowledge and the associated rules ofinference. The neural network approach has an entirely different focus. Work in thisarea was driven by developments in domains such as neurology, which was interestedin understanding how the brain was functioning – i.e. at its ‘lowest’ level. As we shallsee in greater detail in Section 5.5, ANNs were designed to mirror the neurons and syn-apses of the brain, with no attempt being made to capture any external symbolicknowledge about a domain. They are therefore often referred to as taking a non-sym-bolic (or perhaps more accurately, ‘sub-symbolic’) approach to representing knowledge.We will now give a little detail on some of the main approaches to symbolic knowl-edge representation which have been used for practical applications within AI overthe past 20 years. (This is not to imply that no practical applications of the more formallogical representations exist, but they tend to be in specialist domains and typicallyonly involve end users and knowledge capture in an indirect manner.)


141

RulesPerhaps the most common and certainly the simplest form in which to reason withdeductive logic is to use if-then rules. These are also sometimes referred to as condi-tion-action pairs: basically, if the premises of the rule (the ‘condition’), expressed as an‘if’ statement are true then the ‘action’ follows. This ‘action’ is normally that the con-clusion is also deemed to be true. In this way most of the predicate sentences in oursimple financial advisor could be expressed in the form of rules, for example:

IF cash in savings account > £20,000THEN savings are adequateorIF saving are adequateAND income is significantTHEN consider investing in stock market

The ‘IF’ statement may incorporate any number of AND or OR connectives to createmore complex conditions. Obviously in our example above there would have to be arule which had certain premises and the conclusion that ‘income is significant’. Therule-based approach is deceptively simple but decisions as to how to search the rulespace, including directions of inference, meta-rules and so forth can lead to unex-pected complexity. It is also possible to introduce some level of uncertainty into rulesby asserting the conclusion with less than absolute certainty, for example:

IF annual income > £50,000THEN income is significant (0.95)

(which would be interpreted as meaning that we are ‘95% sure’ that this level ofannual income indicates a ‘significant income’). Of course exactly what 95% suremight mean and how we would use this information in further inference processing isnot clear (see certainty factors in Section 5.3 for further discussion). More details onrule-based expert systems are provided in Section 5.4.

Semantic nets, frame and scriptsAs noted above these are all examples of declarative representations which use simpleor structured network schema. The nodes in these networks represent concepts, oftenobjects or facts – in the case of semantic nets these concepts are simple ‘atomic’ enti-ties while for frame and scripts they may be more complex structures. The arcs withinthe networks represent relationships between two or more nodes. These relationshipsmay be of a variety of types but two of the most widely used are the ‘is_a’ and ‘has_a’arcs which are used to represent instance/class and part/whole relationships respec-tively. The knowledge contained and inferences that can be made are totally depend-ent on the network’s interpreter. This being the case the network is only as useful as itssemantics are well defined. The network is a graphical construction which can be rela-tively easily drawn and understood. However, this can lead to a false sense of confi-dence in its utility if the interpreter cannot reason with such structures automatically.A number of useful properties now more familiar to programmers due to the practices ofobject-orientation and found in modern programming environments are associated withnetworked representations. These would include inheritance, encapsulation and poly-morphism. Semantic networks of the type shown in Figure 5.1 overleaf are well known asthey are loosely based on work first reported in 1969 by Collins and Quillian. The fact thatsuch graphs have been reproduced in many texts on knowledge representation ever sinceis testament to their usefulness in illustrating a number of important points.


142

The majority of nodes in this semantic netare classes of objects, the exception beingthe root node ‘Joey’ which is an actualobject (or more correctly an instance of apenguin). An important property of seman-tic networks is that of inheritance. Informa-tion about objects/concepts is stored at thehighest level of abstraction possible. Thismeans that it is not necessary to store allknowledge explicitly but rather to deducecertain facts as they are required. Thus wedo not need to store the information thatJoey has wings or the fact that he canbreathe as these are inherited from hismembership of the Bird and Animal classesrespectively. The principle of inheritanceleads to ‘cognitive economy’ and also helpsto prevent update inconsistencies. In addi-tion to these practical benefits it wouldappear to mirror the way in which wehumans store such knowledge. Researchers,including Collins and Quillan, developed‘associationist’ theories of human knowledge where they showed, for example, thathumans took less time to answer the question ‘can a penguin fly?’ than the question‘can a penguin breathe?’ – seemingly implying that non-inherited knowledge is morereadily available in human memory than inherited inferences.

!

Thinking point Some questions easier than others Can you think of any other expla-nation for why it appears to be easier to answer the question ‘can a penguin fly?’ than‘can a penguin breathe?’

!

(It should be noted that this example reflects the simplest form of hierarchical inherit-ance where each node has only one parent. More complex hierarchical structures exist,such as the ‘lattice’, where objects may belong to more than one class, and in this casemore than one parent may lead to ‘multiple inheritance’, including the possibility ofinheriting contradictory or inconsistent properties. There are mechanisms for dealingwith such cases, for example ‘nearest parent where property is noted takes prece-dence’, but we do not have space to discuss them fully here.)The relationships within our ‘animals’ example are relatively simple and in many waysthese are the most useful types of networks as they produce few problems with respectto interpretation. In addition it is relatively easy to enter new sub-classes within thehierarchical structure due to its regularity. For example, it would be easy to create asub-class of Birds called Song_Birds. This sub-class would have the relationship ‘can:sing’ and thus the various examples of song-birds: canaries, robins, etc., would not berequired to store this knowledge but could inherit it from their parent class. Somesemantic networks incorporate more complex and varied types of relationships as thenaming conventions on the arcs shown in Figure 5.2 illustrate. This semantic network isjust a small section of the knowledge represented within the Universal Medical Lan-guage System (ULMS) developed by the National Library of Medicine in the USA.

ANIMAL

has

can

cannot

skin

breathe

wingsis a

hascan

fly

fly

is a

is a

can sing

is a

is a

FISHBIRD

PENGUINCANARY

JOEY

Figure 5.1 A simple semantic networkbased on an original example from Collins and Quillian,

reported in Harmon and King (1985)


143

The problem with this type of representation lies in building an interpreter to ‘under-stand’ the various types of relationships indicated on the arcs. Also, once such an inter-preter is created, how generalisable will it be? This is one of the reasons that simplerelationships such as ‘is_a’, ‘has_a’ and ‘can’ were used in early networks. However, theneed to represent more complex relationships lead to a search for ‘deep structures’within language. Fillmore (1968) and Simmons (1973) focused on the ‘case structure’ ofEnglish verbs and then linked in the roles played by noun phrases to create case rela-tionships with the verb as central node. Case relationships provide a representation foragents, instruments, objects, time and location as they relate to the main verb. (Thiswhole approach is in fact very similar to that of ‘faceted classification’ first proposed ininformation science by Ranganathan as early as the mid 1930s (Chan, 1994).)

A simple example of a case relationshipused in car maintenance is shown inFigure 5.3. Note that the verb ‘tighten’forms the central node of the case struc-ture, but that it is not really clearwhether the sentence should be seen asreferring to the future or to the present.Even more critically it may be that theobject of the sentence has been misun-derstood or represented. While theeffect of carrying out the overall proce-dure may indeed ‘tighten up the gasket’

Figure 5.2 Part of the semantic network available within the UMLS illustrating the use of various types of relations(http://www.nlm.nih.gov/research/umls/META3.HTML)

mechanic tightenagent object

instrument

time

gasket

future

spanner

Figure 5.3 A case relationship representation of the sentence‘the mechanic should tighten up the gasket using a spanner’


144

it is more likely that the objects which must actually must be manipulated by themechanic with the spanner are nuts attached to the bolts on which the gasket isplaced. This merely illustrates the difficulty of getting such representations right andhighlights the importance of domain knowledge (something it is notoriously difficultto provide to automated reasoning systems).The interest in deep semantic structures led to much more complex attempts to for-malise this type of human reasoning. Perhaps the best known is Roger Schank’s theoryof conceptual dependency which begins with a simple model (actions and objects, andmodifiers on these two basic primitives). However, the actions are then sub-dividedinto a dozen or more basic types of action, for example:! PTRANS – transfer the physical location of an object (e.g. send)! MTRANS – transfer mental information (e.g. tell)! INGEST – ingest an object by an animal (e.g. eat), etc.

We do not have space here to discuss these approaches in detail. (An example of theextension of Schank’s approach as applied to scripts is given in the section below – seeFigure 5.5.) Because they are mostly used for qualitative reasoning they have not as yetseen significant use within the area of business intelligence. However, they are impor-tant within the growing area of text mining (see ‘The BI market and future trends’). Inaddition any long-term attempts to add ‘common sense’ understanding to improve BIuser interface design will require approaches of this type to be adopted.In completing our discussion of network representations it is worth making a fewbrief remarks on frames and scripts. As noted earlier the main difference betweensemantic nets and frames is that the nodes themselves will typically have more com-plex structures than the atomic nodes seen in semantic nets. Thus a more structuredversion of the semantic net shown in Figure 5.1 above might be represented usingframes as shown in Figure 5.4. Note that in the absence of specific information (‘slotfillers’) the default value of the super-class is assumed to be inherited. Thus it isassumed that Fred can sing, as this is true for all canaries. However, Fred cannot fly (aproperty he would normally have inherited from all birds via the sub-class canary) –presumably because he has only one wing. We do not need to explicitly specify thatJoey cannot fly as this is assumed to apply to all penguins and, as this sub-class is‘closer’ than bird, it over-rules the general flying property.

Robinis a: bird

Canaryis a: birdcan sing: Yes

Penguinis a: birdcan fly: No

Maryinstance: personage: 13gender: femalelives at: .....

Joeyinstance: penguin

Fredinstance: canarycan fly: Nohas wings: 1owned by: Mary

(from person)

(to animal)Birdis a: animalcan fly: Yeshas wings: 2owned by: (person)[default: NULL]

Figure 5.4 Frame representation of the domain introduced in Figure 5.1


145

The slots within each frame are in many ways similar to the attributes in a traditionalrecord structure. In addition to taking on explicit values these attribute may also havedefault values (or use an ‘inheritance flag’) and may be associated with proceduralattachments. These are pieces of self-standing code (also referred to as ‘daemons’)which are fired based on certain conditions associated with the slot being met: e.g. ‘if-added’, ‘if-needed’, etc. In addition to the ‘cognitive economy’ associated with the hier-archical inheritance of knowledge these procedural attachments can aid ‘economy ofprocessing’ as they are only activated when conditions dictate that they are required.Luger and Stubblefield summarise the strengths and weaknesses of frames as follows:

Frames add to the power of semantic nets by allowing complex objects to be repre-sented as a single frame, rather than as a large network structure. This also providesa natural way to represent stereotypic entities, classes, inheritance, and default val-ues. Although frames, like logical and network representations, are a powerful tool,many of the problems of acquiring and organising a complicated knowledge basemust still be solved by the programmer’s skill and intuition. (Luger and Stubblefield1998, p324)

In many ways scripts are simply an extension of frames, their distinguishing featurebeing that they attempt to capture and represent contextual knowledge about certaintypes of typical situations. Thus when we buy a train ticket, eat at a restaurant, attend aconcert, etc. there are certain sets of actions and sequences of events that we might typ-ically expect to take place. Scripts were first developed by Schank and his team and arein many ways a natural extension of conceptual dependency theory (the extensionbeing the addition of contextual knowledge). Without doubt the best known exampleof a script is Schank and Abelson’s ‘restaurant script’ (1977). The label ‘script’ was chosento indicate that this representation has much in common with a theatre or film script –i.e. it involves actors (or ‘roles’), ‘props’, ‘scenes’, etc. Schank and Abelson argued that all(or at least the vast majority of) behaviours that will be seen in terms of people makinga visit to a restaurant can be expressed as variants of the script shown in Figure 5.5 over-leaf. (In this particular case the specific ‘track’ of the restaurant script is a ‘coffee shop’ –in the case of a formal dining experience the script may be extended to include addi-tional actors, say a wine waiter, and more elaborate actions, such as a ‘wait to be seated’or ‘enter the cocktail lounge’, etc.)Because of the additional contextual knowledge represented in scripts the programsbuilt to utilise them were quite impressive and could often appear to make ‘intelligent’assumptions about the situation. For example, Schank’s program might be given thefollowing story:

Bill went to the coffee shop at lunchtime today. Bill asked the waiter for a menu. Heordered a burger and chips. On leaving he noticed he was running out of cash.

From these facts and the restaurant script we might expect to get sensible answers toquestions such as: ‘Did Bill eat lunch today?’ (this is not explicitly stated but could rea-sonably be inferred); ‘Who gave the menu to Bill?’; ‘Did Bill pay by credit card or cash?’;‘Who ordered the burger and chips?’. All of these are entirely obvious to us usinghuman ‘commonsense’ but are normally very tricky for a computer to get right. Forexample, the last question implies that the program knows who the ‘he’ was in thethird sentence. In fact the last person referred to in the previous sentence is the waiter– who may well be a ‘he’. Such unpicking of ambiguous indefinite pronoun references(anaphoric reference) is just one of the many tasks the program must engage in.


146

Many game-playing packages use these types of approaches to add ‘AI’ to the interac-tion with players. As with semantic networks the practical application of suchapproaches within BI has to date been limited but it is possible to envisage circum-stances where such contextual knowledge could be extremely valuable.

Mixed representations/architecturesWhile we have introduced each of the knowledge representation schemes as a sepa-rate approach there are of course possibilities that a variety of schemes can be used inconjunction with one another. Many knowledge-based programming environmentsprovide such ‘hybrid’ approaches, typically allowing for both rule-based and frametype (object oriented) representations. There are also ‘blackboard architectures’ whichthough primarily designed as search space control mechanisms also enable knowl-edge taken from different underlying representations to be brought together for thesolution of some overall problem.

Script: RestaurantTrack: Coffee ShopProps: Tables Menu Bill Money F = FoodRoles: S = Customer W = Waiter C = Chef M = Cashier O = Owner

Entry conditions: S is hungry S has MoneyExit conditions: S is not hungry S has less Money O has more Money S is pleased (no pay path) S is hungry S is not pleased

Scene 3: EatingC ATRANS F to WW ATRANS F to SS INGECT FOptions: (a) return to Scene 2 to order more; (b) go to Scene 4

Scene 2: Ordering(Menu on table) OR (W ATRANS menu to S)*S MBUILD choice of FS MTRANS signal to WW PTRANS W to tableS MTRANS ‘I would like F’ to W

C MTRANS ‘no F available’ to WW PTRANS W to SW MTRANS ‘no F available’ to C

(go back to *) OR(go to Scene 4 at ‘no pay path’)

Scene 1: EnteringS PTRANS S into restaurantS ATTEND eyes to tablesS MBUILD place to sitS PTRANS S to tableS MOVE S to sitting position

W PTRANS W to CW ATRANS F to C

C do (Prepare_Food Script)then to Scene 3

Scene 4: Exiting W PTRANS W to S W ATRANS bill to S S PTRANS S to M S ATRANS money to M(no pay S PTRANS S out of restaurant path)

Figure 5.5 Part of the Restaurant script(after Schank and Abelson, 1977)


147

Knowledge or reasoning – which is more important?We end this section describing the range of mechanisms for representing knowledge witha quote from Edward Feigenbaum, one of the ‘wise old men’ of artificial intelligence:

The first principle of knowledge engineering is that the problem-solving powerexhibited by an intelligent agent’s performance is primarily the consequence of itsknowledge base, and only secondarily a consequence of the inference methodemployed. Expert systems must be knowledge-rich even if they are methods-poor.This is an important result and one that has only recently become well understoodin AI. For a long time AI has focused its attentions almost exclusively on the develop-ment of clever inference methods; almost any inference method will do. The powerresides in the knowledge.2

In the context of business intelligence this is an important point. While the debatesand developments within AI surrounding knowledge representation may ultimatelybring benefits to BI, in the meantime the most important issue is to discover businessrules, associations, regressions (and other types of knowledge) and represent theseusing some formal representation – indeed any formal representation. The point isthat the hard part of the process is uncovering the knowledge and representing thecontext in which it can be used. If this is done effectively then the necessary inferenceswill fall out as a by-product of the process. In practice the most likely form in which BIknowledge will be coded is as part of a production rule system (association rules arealready in this format; decision trees are easily transformed into rule sets; as are, witha little more effort, sets of linear equations). In view of this, rule-based systems are dis-cussed in more detail in Section 5.4. However, before focusing on rules per se we turn toan issue of interest in the context of business knowledge –which is seldom 100% con-firmed – the issue of uncertainty.

2 This quote was found at: http://www.isat.jmu.edu/common/coursedocs/isat640-sp03/ which in fact has a number of interesting resources and lectures that are rel-evant to our current discussion.


148

5.3 Uncertainty

The idea of strictly logical reasoning which produces correct assertions and conclusionsis appealing. However, in the ‘real’ world such types of knowledge are rare and the moreusual circumstance is to find uncertain evidence that must be handled using unsoundreasoning to come up with ‘likely’ conclusions. It is here that much of the theoreticalpotential of formal logics comes unstuck and a variety of practical compromises mustbe made if we are to be able to perform reasoning. As Albert Einstein noted:

So far as the laws of mathematics refer to reality they are not certain. And so far as they are certain they do not refer to reality.

A number of approaches to incorporating and dealing with the reality of imperfectand incomplete knowledge have been developed and these are outlined below.

Nonmonotonic logicsOne of the key characteristics of basic mathematical logic (e.g. predicate calculus) isthat it is monotonic. This is to say that it begins with a set of true statements (or axi-oms) and as new knowledge is provided to the inference mechanism we may add tothe ‘facts’ which we know to be true about the world. Adding knowledge can neverresult in a decrease in the set of true statements about the world. This monotonic char-acteristic of logic systems does not match human reasoning well, as we often modifyour beliefs (including adopting the exact opposite stance from that initially held) inthe light of new evidence. A number of logics have been developed which allow for nonmonotonic reasoning.This means that certain beliefs (or ‘facts’) are assumed to be true at present (i.e. in thelight of the most reasonable set of assumptions based on our knowledge to date) butthat this may change in the future if these assumptions are found to be incorrect.Should an initially held belief change this will force the inference process to check allconclusions which may be based on, or derived from, this belief – not an easy taskgiven a complex set of logical assertions. We do not have space here to describe theseapproaches, which are sometimes referred to as ‘circumscriptive logics’ which can besaid to ‘reason over minimum models’. The most commonly used approach to ensurethe integrity of logical inference in such nonmonotonic systems is the truth mainte-nance system (TMS). The interested reader is directed to Jon Doyle’s work on justifica-tion based truth maintenance systems (JTMS) for one of the earliest descriptions ofsuch an approach (Doyle, 1979).

Fuzzy setsPerhaps it is their unusual name but when I have asked undergraduate students whichAI techniques they have heard they often respond with ‘fuzzy logic’ (more correctlyreferred to as ‘fuzzy set theory’). This is a relatively recent approach popularised in the1980’s by Lotfi Zadeh (see for example Zadeh, 1983). One of the interesting things aboutbeing involved in this area of research is that there are still so many active issues – it isnot uncommon to find conversations between Zadeh and others on the various discus-sion lists that exist to further research in the area, most notably the UAI (Uncertaintyin AI) list - http://www.auai.org/. Traditional logic is based on two key and linked assumptions. Firstly, given any entityand a set defined for some universe, the entity is either a member of that set or itscomplement. The related second rule is that the entity cannot belong both to the setand to its complement, the so-called ‘law of the excluded middle’. In Zadeh’s fuzzy setsboth of these assumptions may be violated – they are part of a traditional logic whichZadeh finds inadequate and refers to as crisp logic.


149

Entities in fuzzy set theory are expressed as belonging to a set with different degreesof membership. The set membership function can take on values between 0 and 1.Thus if we are talking about the fuzzy set colloquially known as ‘small numbers’ wemight decide on the following membership functions:

Number Degree of membership of ‘small numbers’

1 0.992 0.903 0.90…

10 0.40…

100 0.01

Of course the actual allocation of membership level will depend on the context. If, inthe example above, we were talking about the selling price of houses then I guess allnumbers below around 15,000 might have a degree of membership close to 1. In addi-tion to allowing partial membership of a set, fuzzy logic also allows that an object maybelong to more than one set. The graph below (Figure 5.6) represents the height offemales and the fuzzy sets representing short, medium and tall females. Notice thatfemales of a height between 1.7 and 1.8m have partial membership of both the set‘medium’ and the set ‘tall’. (Once again even this more constrained example dependson context. The representation would change when referring to the heights of men, orperhaps even of women from different ethnic groups.)

The rules for combining membership weightingswhere objects are defined in terms of a combinationof AND, OR and NOT statements are very similar tothose employed in certainty factor algebra as definedin the next section. The remaining approaches in ourdiscussion of uncertainty can all be considered to beattempts to incorporate probability theory in someway or another, and as such address the issue of ran-domness or ‘noise’ in information. Zadeh contendsthat many of the problems of imprecision and ambiguity in descriptive knowledge arecaused by vagueness and not randomness. He therefore has proposed fuzzy sets toform the basis of what he refers to as ‘possibility theory’ – an attempt to handle vague-ness in information in an way analogous to that in which probability theory handlesrandomness. (It should be noted in leaving this subject that there is still much debateabout the general utility or indeed need for fuzzy set theory. Zadeh has a reasonablybroad set of detractors both within the statistical and AI communities.)

Certainty factor algebraOne of the best known early rule-based expert systems was MYCIN, a diagnostic medi-cal system developed at Stanford University by Ed Shortliffe and colleagues. While thiswas an interesting and useful system in its own right its most significant contributionwas to provide an alternative method of dealing with uncertainty. As we shall see,using ‘true’ statistical approaches (as considered by Bayesian statisticians) the correctapproach to measuring confidence in an assertion is to somehow measure the proba-bility that the evidence supports the conclusion. However, in medical diagnostic sys-tems such as MYCIN there is often an attempt to codify much more heuristicknowledge from medical specialists. In many cases these rules of thumb cannot begiven a precise probabilistic weighting but can be described using terms such as ‘pos-

small1

0

medium tall

1.3m 1.4m 1.5m 1.6m 1.7m 1.8m 1.9m

Figure 5.6 Graphical representation of fuzzy setmembership for females of different heights


150

sible’, ‘unlikely’, ‘very likely’, etc. The Stanford certainty theory (more often referred toas the ‘certainty factors’ approach after the main notation used) attempted to capturethis level of uncertainty associated with heuristic rules.Thus an expert helping us to build our financial advisor might state that, ‘it is highlylikely that the best option for a high income earner with adequate savings is to investin the stock market’. Using the certainty factors approach this could be coded as:

IF High_Earner AND Adequate_SavingsTHEN Invest_in_stock_market (CF 0.95)

Of course in addition to having a degree of belief (CF) attached to this rule the premisesthemselves may be associated with certainty factors (e.g. ‘high_earner’ may itself bethe conclusion of another rule with a CF of, say, 0.80).Knowing how to combine the CF values of premises and rules is the key to being ableto use certainty factors and to the whole certainty theory approach. Unlike the Baye-sian approach there are no first principles on which to base the algebra so a set of heu-ristics exists to guide the reasoning. These state that given two propositions P1 and P2with associated CF values:

CF (P1 OR P2) = MAX(CF(P1), CF(P2));CF (P1 AND P2) = MIN(CF(P1), CF(P2))

These make intuitive sense in that if either one of two things can justify an assertionas being true (the ‘OR’ situation) then we can take the highest confidence level, whilein the situation that all individual statements must be true (the ‘AND’ situation) wetake the lowest CF value. An additional rule is that if the premise(s) and the rule itselfboth have CF values then the CF of the conclusion is found by multiplying these twovalues together. Using a slightly extended version of our financial rule given above wecan illustrate these rules.

IF High_Earner AND (Adequate_Savings OR Large_Inheritance)THEN Invest_in_stock_market (CF 0.95)

Now assume that we have worked through the rules which lead to this statement andknow that the CF values of the premises are as follows:

CF (High_Earner) = 0.6CF (Adequate_Savings) = 0.9CF (Large_Inheritance) = 0.25

What would be our overall CF for Invest_in_stock_market? We can use the heuristicrules of certainty theory as follows:

CF (Adequate_Savings OR Large_Inheritance) = MAX(0.9, 0.25) = 0.9CF (High_Earner AND (above)) = MIN(0.6, 0.9) = 0.6CF (Invest_in_stock_market) = 0.95 * CF(premises) = 0.95 ! 0.6 = 0.57

There is in fact one more combination heuristic which comes into effect when two ormore rules (with associated CFs) support the same conclusion. We will not go into thedetails of this extension or indeed the more fundamental assumption that all confi-dence factors may in fact be expressed as providing support for or against a particularconclusion. This again goes against the rules of probability theory where the sum ofthe evidence for and against must sum to 1. However, it reflects the practical situationthat the Stanford scientists found when interviewing experts who might state thatthey were ‘pretty sure’ a relationship was true (say a CF = 0.75) but had no intuitivefeeling as to the likelihood of the statement not being true. (Readers who are inter-ested in learning more about this approach and the success of the MYCIN system aredirected to Buchanan and Shortliffe (1984).)


151

In summary, certainty factors provided a pragmatic approach to dealing with proba-bility values in a relatively informal and intuitive manner. They were not based on anyunderlying sound statistical theory and were largely ad hoc. However, bearing in mindthe earlier quote from Feigenbaum, they were fairly effective within the context inwhich they were initially adopted. Recently, as more practical approaches for dealingwith formal probability statements (in particular Bayesian belief networks – seebelow) have become more widespread and the potential pit-falls of poorly controlledcertainty factors solutions exposed, they have largely fallen into disuse.

Dempster-Shafer theory of evidenceBefore looking at the use of conventional probability theory, there is one otherapproach which makes use of subjective probabilities (as was the case in certainty the-ory above). The theory of evidence proposed by Dempster and Shafer (Dempster, 1968)introduces the interesting concept of plausibility. Instead of assigning a specific proba-bility to a proposition, which they argue is too restricting and in many cases inaccurate,they state that any assertion can be seen as having a ‘truth’ value lying in the interval[belief, plausibility]. The degree of belief (BL) we might have will range from 0 (no evi-dence to support the proposition) to 1 (certainty of belief). The level of plausibility (PL)of a proposition will also range from 0 to 1 based on the following definition:

PL(a) = 1 – BL(¬a)

Thus if we had certain evidence to support NOT(a) the value of BL(¬a) would be 1 andso the value of PL(a) would be 0. By definition, the value of BL(a) would also be 0 there-fore indicating that our belief in the proposition (a) would be expressed on the belief-plausibility interval [0,0].To see the practical difference which this approach makes to reasoning, it is perhapseasiest to look at the situation of entire ignorance – e.g. we have no evidence for thetruth (or otherwise) of each of two competing hypothesis h1 and h2. This being the casein the Demspter-Shafer approach both of these hypotheses would be stated as havinga belief-plausibility interval of [0,1] (i.e. we have no reason to believe that either ofthese hypotheses is true but equally they are both entirely plausible). This differs fromthe approach of Bayesian probability, where in the absence of any evidence these twohypotheses would typically be allocated p(h1) = p(h2) = 0.5 i.e. the same, but non-zerovalues. The theory of evidence approach would argue that such probability values areunwarranted and that the formal probabilistic calculus fails to distinguish betweenthe notions of uncertainty and ignorance.

!

Thinking point Uncertainty vs ignorance How would you define the differencebetween uncertainty and ignorance?

!

We do not have space here to get into a full discussion of Dempster-Shafer reasoningwith its associated combination rules, etc. It is however interesting to note that as farback as 1968 (very early days indeed in terms of AI) these issues were being raised andhave still not been adequately resolved. In many ways the interest in and debate sur-rounding the use of fuzzy set theory can be seen as a reflection of these early concernsregarding the adequacy of ‘objective probability’ theory as expressed in Bayesian rep-resentation and reasoning, the subject to which we now turn to complete our discus-sion of uncertainty.


152

The ‘true’ statistical approach – Bayes’ theorem and conditional probabilitiesA number of the approaches outlined above, in particular certainty factors and Demp-ster-Shafer theory can be considered to involve ‘subjective’ probabilities in their rea-soning mechanism. As we noted, they tend not to adhere to all the ‘classical’ laws ofprobability theory, relaxing some in a way that appears to create systems better ableto mimic genuine human reasoning. However, in moving away from a formal mathe-matical basis they also become open to incoherence and inconsistent application. The‘Bayesian’ approach to uncertainty, on the other hand, attempts to adhere to an ‘objec-tive’ view of probability. It conforms to all the laws of probability theory and specifiesmathematically axiomatic (rather than heuristic) methods of propagating uncertaintyfrom one proposition to any of its dependents. The historical problem with thisapproach was that it carried a heavy computational overhead to maintain fully con-sistent sets of probabilities and also required the pre-definition of an unrealistic set ofconditional probability estimates (or involved making unrealistic assumptions aboutconditional independence). Indeed it was for these very reasons that the more ‘subjec-tive’ approaches were suggested – i.e. the Bayesian approach was considered to beinfeasible in the 1970s and 1980s. More recently the development of Bayesian BeliefNetworks (BBNs) has made feasible the definition and manipulation of complex sets ofdependent probabilities. We shall come to BBNs in a moment, but first it is importantto remind ourselves of some basic definitions linked to ‘classic’ probability theory.I always approach an introduction to probability with some apprehension – for somereason this seems an area particularly poorly covered in school and introductoryhigher education teaching. I will introduce the key concepts by means of illustrativeexamples and keep the theory to a minimum. In the discussion I will refer again to theBBN called CaDDiS (Cattle Disease Diagnosis System) which was built by a group ofresearchers at the University of Strathclyde. This is the implementation we chose torepresent the expert knowledge collected on African cattle diseases (introduced inChapter 4 to illustrate aspects of density functions and then hierarchical clustering).The decision support application is available as a stand-alone Windows program or onthe CaDDiS website, at:

http://vie.dis.strath.ac.uk/vie/CaDDiS/about.html

This site also contains a number of help/background pages, including our ‘DummiesGuide to Bayesian Belief Networks’. Some of the material from these pages has beenselected and is used in the explanations given below.

Prior and posterior probabilitiesThe basic notion of probability is not difficult to grasp (notice I did not say that itsimplications were obvious or intuitive – if people could achieve even a crude under-standing of those then it would be very bad news indeed for Camelot and other lotteryoperators the world over!). The example of the lottery and games of chance in generalare often used when discussing probability – partly because they are familiar to peo-ple but also because they fulfil a basic assumption required by probabilistic analysis,which is that they are genuinely random. (Which is one reason why the application ofdata mining – or any other fancy analysis – to predict winning numbers or sequencesin the National Lottery will always be a fool’s errand.) The probability that an eventwill happen can therefore on occasions be stated a priori, based on our knowledge ofthe world of possible outcomes.Thus:

P (2 on a single roll of a ‘fair’ dice) = 1/6P (Heart from a full deck of cards) = 13/52 = 1/4etc.


153

There is also the possibility that where we have a long series of data relating to a givenevent we can use the ‘frequency of occurrence’ to come up with a fair approximationof its likelihood, e.g.:

P (train being on time) = 1070/1150 = 90%

There are a few other basic definitions and axioms of probability:P (certain event) = 1 [e.g. P (sun will rise today)]P (impossible event) = 0 [e.g. P (me win lottery)]$ P (all events) = 1

Linking together multiple events and calculating their combined probability is whereit begins to get tricky. If two events, say A and B, can be considered to be independentthen the probability of both of the events happening is simply the product of the twoseparate probabilities, e.g.

P (2 on a single roll of a ‘fair’ dice) = 1/6P (4 on a single roll of a ‘fair’ dice) = 1/6thusP (2 followed by a 4) = 1/6 ! 1/6 = 1/36

Strictly, when we have more than one event happening we state that we have a condi-tional probability, expressed as P(A|B) [‘probability of A given B’], that is the chancethat A will occur may be affected (‘conditioned’) by the fact that B has occurred.3 Thiscan be interpreted as the posterior probability as our prior belief has been modified inthe light of some evidence. Thus:

P (train being on time | icy weather) = 60% (or some other number quite a bitlower than the initial 90% figure given above which did not state any particularconditions)

The case where joint events are conditionally independent is again simple to dealwith, indeed this is how we define conditional independence. Two events, A and B, aresaid to be conditionally independent if:

P(A and B both happening|C) = P(A happening|C) ! P(B happening|C)

That is to say, where we have some information about a situation (through our knowl-edge that event C has happened), the knowledge that event A has also happened hasno effect on our estimate of the likelihood that B will occur.In our ‘late train’ example we could propose an additional event that might affectwhether I get to work on time, ‘alarm clock operates’, thus:

P (train being on time and alarm clock operates|icy weather) = P (train being on time|icy weather) ! P(alarm clock operates|icy weather)

The chance that my alarm clock might malfunction has no effect on the probabilitythat the train will be on time. Indeed arguably the weather has no affect on my alarmclock and thus:

P (alarm clock operates|icy weather) = P(alarm clock operates)

Given that this is the case I could have simplified the above statement immediatelyand this type of independence is used within Bayesian networks to simplify andreduce the number of joint probabilities that must be estimated. If the additionalevent had been ‘leaves on the line’ then clearly the situation would have been muchmore tricky. In that case we would NOT have had conditional independence and the

3 See also Chapter 1 of the Risk Management module.


154

simple product rule would not have held. The chance of getting leaves in the first placeis linked to the weather (more likely in late autumn when it is also more likely to bewet and/or icy) and then the chance of the train being on time would also be alteredby our knowledge that leaves were a ‘given’.In the case where we do not have conditional independence we must use Bayes’ Ruleto manipulate the conditional probabilities. However, before moving on to Bayes the-ory it may be useful to give one further example (from our ‘Dummies Guide’) as thisarea can be quite confusing if you have not met this type of reasoning before.

Other examples involving conditional probabilitiesYou wish to guess the nationality of a male tourist sitting at a neighbouring table in acafé in Glasgow. What nationality would you suspect if you realised that he wasspeaking in Finnish? How would the conclusion differ if instead you heard him speak-ing in French?Finnish is a language which has no relation to most of the other languages of Europe.It is correspondingly more difficult for non-Finns to learn the language than to learnother Indo-European languages. This is reflected in the statement that:

P (He can speak Finnish|He is not Finnish) is close to zero.This is a conditional probability i.e. as we noted above, it states the chance of somethingbeing true conditional on something else being true. However, it is also true that:

P (He can speak Finnish|He is Finnish) will be close to 1, since almost all citizens of Finland will be able to speak their nationallanguage. It is intuitively obvious that if these two probability statements are truethen our observation that the tourist can speak Finnish is strong evidence that he is,indeed, a Finn. By contrast, French is an international language which is taught as a second languagein many other countries. In addition, Belgium, Switzerland and Canada (as well asmany French protectorates and former colonies) have sizeable populations whosemother-tongue is French. Hence it follows that:

P (He can speak French|He is not French)may be relatively high. Hence, although:

P (He can speak French|He is French) will be close to one, it is intuitively clear that since speaking French is not such a dis-tinctive mark of nationality as is speaking Finnish, an observation that the tourist isspeaking French is not particularly strong proof that he is French. Life rarely involves the isolated consideration of single pieces of information. Wemight hear the tourist speaking English to the waiter, and recognise that he is speak-ing with a Scandinavian accent. If we had previously heard him speak in Finnish, thiswould add little to our knowledge. However, if we had earlier heard him speaking inFrench, we might adjust our ideas to conclude that he is probably a Scandinavian whohappens to speak excellent French (though he might be a Frenchman who was taughtEnglish by a Swede!) Examples such as these can get very complicated, and the human brain becomes over-loaded when asked to evaluate large quantities of conditional information. In particu-lar, we tend to give too much weight to new information which confirms earlierevidence, even where the new information follows as a consequence of the earlier evi-dence. Thus if the tourist had earlier spoken in Finnish, it is very likely that he will


155

have a Scandinavian accent when speaking English. Our observation of this fact doesnot therefore add much (anything?) to our earlier assumptions about his nationality.Fortunately, for at least some of these situations Bayes’ Law can assist us in dealingwith the ‘conditionality’ of the statements while BBNs can help in the structural repre-sentation and manipulation of them.Note that simply to state that we are using conditional probabilities is not in itself suf-ficient to ensure correct manipulation and interpretation – we must be sure what it isthat we are conditioning on. Perhaps the most famous (mis)use of conditional proba-bility in recent history occurred in relation to the OJ Simpson murder trial in the mid-1990s, and was due to the well-known Harvard law professor Alan Dershowitz whowas an advisor to Simpson’s defence team. He stated on a US television programmethat evidence relating to domestic abuse should not be admissible in the murder trial.His reasoning was that only 1 in 1,000, or 0.1% of wife batterers actually murder theirpartners, i.e.:

P (Husband murders wife|Husband has abusive history) = 0.001

But Dershowitz had not conditioned his probability on the correct evidence because inthis case not only was it known that OJ Simpson had a history of domestic abuse it wasalso known that Nicole Simpson was dead. The correct conditional probability that thegood professor should have tried to determine was therefore:

P (Husband murders wife|Husband has abusive history AND Wife has been murdered)

As it happened the most recent data that existed on public record in the USA at thattime was for 1992 which showed that of the 4,936 women who were murdered thatyear, 1,430 had been killed by their current or former partner (i.e. even in the absenceof any evidence regarding domestic abuse the prior probability that the mate was themurderer was around 29%!). The data on abusive relationships given that murder hadtaken place was even more staggering. Of the women murdered in 1992, 890 had docu-mentation of domestic abuse and of these 715 were murdered by their current/formermate. Thus the conditional probability figure Prof Dershowitz should have used in theSimpson trial was around 80% rather than 0.1% – which would have put a rather differ-ent slant on the case for (in)admissibility!

Bayes’ theoremWhile we have defined conditional independence and also noted how joint probabili-ties are defined when two events are independent we have yet to specify how to calcu-late joint probabilities from events that are not independent. To do this we apply anaxiom on probability known as the product rule. This states:

P (A and B) = P (A|B) ! P(B) = P (B|A) ! P(A)

Remember that the ‘|’ sign should be read as ‘given’ so that if we have specified a valuefor the probability that ‘A will happen given that B has happened’ and we know theprobability of ‘B’ we can calculate the probability of the joint event ‘A and B’ – morenormally simply written as P(A,B). In addition to the fact that this rule holds in the ‘opposite’ direction – i.e. if alterna-tively we know the probabilities ‘B given A’ along with ‘A’ then we can also calculateP(A,B) – the rule can be extended for more than two events. This is referred to as thechain rule and for example with four elements would work as follows:

P (A,B,C,D) = P (D|A,B,C) ! P(C|A,B) ! P (B|A) ! P(A)


156

However, the most important rule in our discussion here comes from a rearrangementof the product rule above and is referred to as Bayes’ Theorem or Rule – after the Eng-lish mathematician and theologian, Thomas Bayes (1702-61). Bayes’ Theorem states:

P (A|B) = P (B|A) ! P(A)/P(B)

Why is this a particularly interesting rule? One of the most significant results is that itlets us assess diagnostic probabilities from causal probabilities. In other words:

P (cause|effect) = P (effect|cause) ! P(cause)/P(effect)

Often the most difficult probabilities to estimate, even for experts, are the ‘diagnostic’ones – i.e. ‘what was the likely cause given we see this set of effects?’ Using Bayes’ Rulethese do not need to be specified but can be calculated from a series of ‘causal’ proba-bilities which are typically easier to estimate. Take a medical example – perhaps the most obvious example of diagnostic reasoning,but the same approach holds for diagnosing faults in products or processes, or indeedfor ‘diagnosing’ the cause of lost customers or cancelled mortgages. Let’s say we wishto estimate the probability that a patient has malaria given that they have a fever. Ifwe let Mal be ‘has Malaria’ and Fev be ‘fever present’ then what we wish to estimatefor a patient is:

P (Mal|Fev)

This ‘diagnostic’ value may be difficult to estimate but an expert could provide us withthe ‘causal’ probability, i.e.

P (Fev|Mal) = 0.98

In other words if the patient does in fact have malaria there is a pretty high chancethey will have a fever – but what does this tell us about the likelihood that they havemalaria? Well, we need two additional pieces of non-conditional information:

P (Mal) = 0.0001 (i.e. only 1 in 10,000 people picked at random in the UK will havemalaria at any given time)

P (Fev) = 0.05 (i.e. around 1 in 20 people will have a fever at any time).Using Bayes’ Rule we can now work out the probability we were originally interestedin, that is:

P (Mal|Fev) = P (Fev|Mal) ! P (Mal)/P (Fev)= 0.98 ! 0.0001/0.02= 0.00196

We would thus say that our prior belief that a patient walking in off the street hadmalaria was 0.0001, while in the light of the diagnostic evidence (i.e. a fever) our poste-rior belief is around 0.002. Note that while this has altered our belief, the chance thatthe patient has malaria is still very small (less than 1 in 500). Note also that the aboveexample is totally fictitious and may make little sense in a real medical situation. If wewere talking about sub-Saharan Africa, for example, our prior belief in ‘malaria’ (basedon its prevalence in the population) may be as high as 1 in 10, i.e.

P (Mal) = 0.1

Or, if we discovered that our patient had recently made a trip to a tropical destinationonce again our ‘prior’ would be altered, for example:

P (Mal|‘recent tropical trip’) = 0.005

We cannot take further time to discuss more about Bayes’ theorem here but wouldsimply note that a clear understanding of its implications is important in many areas.Take, for example, a test for HIV status. While the accuracy of the test may seem veryhigh – e.g. there is a 99% chance that if you have HIV this test will be positive, this is


157

not really the most important issue. If other ‘external’ factors would lead you tobelieve that your prior likelihood of being HIV positive was very low then the fact of apositive test may only marginally alter that belief. Similar arguments hold for applica-tions such as compulsory drug testing of employees. While the statistical principlesare clear it gives some cause for concern that human inference often radically deviatesfrom Bayesian inference. For example, in a study of physicians Eddy (1982) asked for anestimation of the probability that a woman with a positive mammogram actually hadbreast cancer, given a 1% base rate for breast cancer in the population, a hit rate of 80%and a false alarm rate of 10% for the mammogram. Of the 100 physicians involved inthe study, 95 estimated that the probability of the woman having cancer was between70–80% – the correct probability estimate using Bayes’ rule is only around 8%.4 We now turn from the theory of Bayesian inference to its application in an importantand growing class of AI system, the Bayesian belief network.

Bayesian belief networks (BBNs)A BBN (also known as a ‘belief network’, ‘Bayes network’ or ‘causal probabilistic net-work’) provides a method to represent propositions (variables) and the relationshipsbetween them, when these relationships involve uncertainty. Each node in the net-work represents a single variable. These nodes are connected into a network formallyknown as a directed acyclic graph (DAG) which means that any connection betweentwo nodes must be directed (and interpreted as ‘directly influences’) and that no set ofnodes may be connected in such a way that the graph ‘loops’ round to influence itself(i.e. it is not cyclical but acyclic).

The diagram in Figure 5.7 shows aDAG for the example introducedearlier in this section, relating towhether or not I will be on my trainat 6.52am on any given morning.Of course, Figure 5.7 is a simplifica-tion and ignores such things as thedriver not turning up, high windsknocking out power cables, etc.In addition to creating the DAG,which indicates the influencesbetween variables, it is necessaryto create a conditional probabilitytable for each node/variable givenits parents. The conditional proba-bility tables have been added to theDAG to create a simple BBN usingthe Hugin package as illustratedthe Figure 5.8 (overleaf).

4 Note: Such systematic deviations from Bayesian reasoning have been called ‘cogni-tive illusions’, analogous to stable and incorrigible visual illusions (von Winterfeldt& Edwards, 1986; for a discussion of the analogy, see Gigerenzer, 1991).

Alarmoperates

Me atstation

Me onthe 6.52

Trainon time

Icyweather Leaves

on line

Winter

Figure 5.7 BBN representing the nodes which will affect the chancesthat I make it onto my regular train to set off in time for work


158

The fact that you only need to specify the conditional probabilities that ‘matter’ (i.e.have an influence in the BBN) greatly reduces the task of thinking about the problem.For example in the case of CaDDiS we had 20 diseases and 27 possible signs, all ofwhich could be either true or false – i.e. the exhaustive joint probability distributionwould contain 247 (over 140 trillion!) entries. While CaDDiS did involve quite a fewprobability estimates the total number was in the order of hundreds. In additionbecause the BBN can use Bayes’ theorem to reason from causal knowledge to diagnos-tic the definition of those probabilities that were needed was a reasonably straightfor-ward exercise for the veterinary experts.

Exercise 5.3 CaDDiS Have a look at CaDDiS (at http://vie.dis.strath.ac.uk/vie/CaDDiS/).What assumption has been made about the prior probabilities of the diseases includedin this diagnostic disease system? Is this a valid assumption? From your (admittedlylimited) knowledge of BBNs, should it be possible and/or straightforward to take analternative approach to the definition of priors?

Figure 5.8 The ‘train on time’ BBN illustrated with probability distributions as itmight be implemented in a software package (in this case Hugin)


159

We can add decision variables (representing things over which we have some control)and utility variables (what we wish to optimise) into the relationships within a BBN tocreate a decision network (also known as an ‘influence diagram’). You will have talkedabout decision/utility theory in other classes but have probably not considered in anydetail how the uncertainty implicit in these networks/diagrams is handled. BBNs pro-vide one of the best representation/reasoning mechanisms for building decision net-works (as we shall see during the second module seminar day). For the moment weshall end this discussion with what I consider to be one of the most significant bene-fits of the BBN approach, and one which sets it apart from the other approaches dis-cussed for dealing with uncertainty. Because it uses a mathematical formalism the BBN can automatically update (propa-gate) the beliefs associated with every node in the network as each new piece of evi-dence is gathered. In addition, although the construction of the graph is focused in a‘causal’ direction (i.e. join all the nodes where A ‘is the cause of’/‘has a direct influenceon’ B) the end product (the BBN) can be used to reason in either a causal or a diagnosticfashion. For example, in the CaDDiS system not only can you see the effect on the like-lihood of each disease of adding another diagnostic sign, you can also query the net-work in the ‘opposite’ direction and ask ‘given that I suspect that the disease may be X,what are the next most confirmatory signs to help prove this is the case, given all theevidence I already have at this point’.

Causal networks Before moving on to discuss the practical creation of expert systems using rule-basedapproaches it is worth adding a ‘footnote’ on one final uncertainty representationmechanism. The same causal modelling approach adopted in the case of Bayesianbelief networks is also used in a more informal manner in a class of systems referredto as ‘causal networks’. Here nodes which had a causal influence upon one another aregiven a numerical weighting to indicate a range of ‘confidence’ levels. Thus if it wasdeemed that A ‘almost always causes’ B, these two nodes would be connected by adirected arc with a weighting of, say, 10; while the fact that B ‘occasionally is the causeof’ C, would lead to a directed arc with a weighting of 2. (Assuming that A neverdirectly causes C, there would be no arc linking these two nodes.) Again we have nospace here to discuss the details of this approach (and in my view they are of largelyhistoric interest as the BBN approach incorporates most of their advantages and noneof their limitations). The interested reader is directed to two well-known and success-ful implementations of diagnostic medical systems built using causal networks: CAS-NET (Weiss et al., 1977) which diagnosed types of glaucoma; and ABEL (Patil et al., 1981)which reasoned about acid base and electrolyte imbalances in human patients.


160

5.4 Rule-based inference systems

We now turn our attention to the most commonly used approach in practical AI imple-mentations to date: rule-based systems. In many ways these can be seen as a ‘friendly’way of using predicate calculus to reason with knowledge in a format that feels morenatural to most users. Although some rule-based systems have tried to incorporateuncertainty, mostly through the use of a certainty factors approach, in practice themajority have not and so in this section we will largely ignore the subject we have justdealt with in some detail. (This also reflects the nature of much of the debate on thepractical implementation of expert systems: the rule-based protagonists are keen toget on and build systems, while those worried about representational integrity and inparticular uncertain knowledge tend to look at more complex inference options – andend up building nothing. That is perhaps a little unfair as the growth of commerciallyavailable development environments using techniques such as BNN has altered thepicture over the past 2–3 years.)One question which arises when discussing rule-based systems is, how do these varyfrom a simple flow-chart which has been implemented in a conventional program-ming language such as C or Basic? At one level a set of rules can be seen as a large deci-sion tree (or trees) and can thus be viewed as a connected series of ‘if-then-else’statements. While this is true at one level it misses the point. Firstly the ‘tree’ for anyrealistic expert system application would be very large and unwieldy. More impor-tantly, having to manage the whole set of rules as a large tree would negate one of thekey benefits of the rule-based approach – the separation of the knowledge from thecontrol aspects of the system. In a knowledge-based system it is important to be ableto add new rules (and indeed facts) without worrying too much about how they fit inwith the rest of the rule-set; that is the job of the inference engine which looks afterthe control knowledge (as opposed to the subject domain knowledge).This distinction is a reflection of the more general point that most techniques usedwithin AI operate on separation of the knowledge structures from the control ele-ments of the program. This difference was nicely illustrated by Levesque and Brach-man (1985) through the use of two versions of the same logic schema:

Version A Version Bprint_color(rose) + write(red) print_color(X) + color(X, Y), write(Y)print_color(sky) + write(yellow) print_color(_) + write(‘color unknown’)print_color(grass) + write(green) color(rose, red)print_color(_) + write(‘color unknown’) color(sky, yellow)

color(grass, green)

These two programs have exactly the same behaviours. However, the second version,which has added the predicate color (if you can excuse the American spelling and the‘yellow’ sky!) is much more flexible. By ‘wrapping up’ any implementation detailsassociated with print_color in the first line we can forget about these proceduralissues later. When adding new objects to the system in future we can therefore thinksimply about the ‘knowledge’ dimension (i.e. what colour are they) without being con-cerned about any procedural control associating with their painting. (Of course theseprinciples are not unique to AI programs – they are seen in the functions and proce-dures of conventional programming environments and in the functional polymor-phism of object-oriented approaches. In these cases, however, their use is associatedwith improved efficiency of coding and of maintenance. In many AI programs wherefocusing on building the ‘knowledge base’ can equate to 90% of the effort this divisionbecomes essential.)


161

The diagram in Figure 5.9 illustrates this divisionbetween the control aspects of the system (the infer-ence engine) and the core knowledge base. A numberof other elements necessary for an operational knowl-edge-based system are shown, but before we discussthese it is important to understand a little more aboutthe rules themselves. To do this we will return to thefinancial advisor introduced earlier in the chapterwhen we discussed predicate calculus.A set of rules which incorporates the knowledge heldin the financial advisor would be as follows:R1 IF savings_adequate = .F.

THEN investment = ‘Savings’R2 IF savings_adequate = .T. AND income_adequate = .T.

THEN investment = ‘Shares’R3 IF savings_adequate = .T. AND income_adequate = .F.

THEN investment = ‘Combination’Rn etc.

For a realistic application the knowledge, such as that illustrated above, would becomprised of hundreds and perhaps even thousands of rules. In this situation thequestion as to which rules should be fired and in what order becomes a non-trivialtask. There are a number of strategies which rule-based reasoning systems employ totackle the issue of rule execution. These strategies include the general ‘direction’ of therule inferencing procedure (forward or backward chaining), the specific search strat-egy adopted and the use of meta-rules.In the case of forward chaining we begin with the facts that we have and attempt toadd to these until we have reached some conclusion. Backward chaining employs pre-cisely the reverse approach in that it begins by selecting a potential end-point (a finalconclusion) and then reasons ‘back’ to see whether the facts that would support thisconclusion hold. The backward chaining approach is akin to the process of hypothesis testing as carriedout in human reasoning. Does this mean that we naturally reason in a ‘backwards’direction? Not necessarily. Take the case of a doctor trying to diagnose what is wrongwith you – does she use backward or forward chaining? Well certainly at some pointthere is the likelihood that she will use backward chaining (i.e. she will test certainhypotheses to see whether or not she can conclude that you have a certain disease).However, this would be a very inefficient way to begin the process – i.e. ‘let’s chooseone disease from the potentially thousands of diseases you might have and uncoverfacts until we have proven that this is (or most likely is not) the disease you have’. Infact the initial consultation is more likely to involve ‘forward chaining’ with the doctortrying to uncover as many facts about your situation as possible before thinking of anyultimate conclusion/diagnosis. This simplistic example illustrates the fact that the ‘backward chaining’ approach ismost efficient when there are few conclusions and many possible facts while ‘forwardchaining’ operates more effectively given the opposite circumstance. It also illustratesthe fact that humans often alternate between the two approaches at different stagesin the reasoning process – a technique which is not so easy to emulate in rule-basedsystems where the basic inference engine tends to operate in only one of these twomain ‘directions’.

Inference engine

Knowledge base

Explanation

Interface

Workspace

Figure 5.9 ‘Classic’ diagram of the ES/KBS layout


162

Even once the basic direction of reasoning has been decided there is still a requirementto apply some thought to the various search strategies that exist to explore the rulespace. We do not have space here to go into the details of the various options whichexist, including depth-first, breadth-first, etc. Selection of the best option will normallybe a matter of the overall ‘shape’ of the rule space that has to be searched. The reason-ing process may also be aided by the inclusion of meta-rules in the knowledge base.Strictly these meta-rules do not contain knowledge about the domain – rather they are‘rules about rules’. They encode knowledge about when to use certain types of rules inpreference to others and as such say nothing of relevance regarding the domainknowledge. In a more complex version of our financial advisor we might thus have thefollowing type of meta-rule:Rx IF rules exists that refer to property and others to bonds in regard to savings

THEN execute the ‘property’ rules, before looking at the rules relating to bonds

Of course this would have to be stated in a more formal manner to be understandableto the inference system, but it does illustrate the nature of meta-rules as well as theirfunction in the reasoning process.Various other elements are often present in an expert system as shown in the sche-matic overview in Figure 5.9. One which is not always present is a mechanism to ena-ble the experts to add to and update rules within the system. This is a useful facilityand avoids the need to refer back to the system builder, but the flexibility of allowingexpert update may have to be off-set by less sophisticated inference options withinthe reasoning itself. In the most advanced case the expert system may have facilitiesto build new rules itself and so engage in a form of ‘learning’. This will normally hap-pen in one of two situations. First, the system may come to the end of a consultationand be unable to make any conclusion. In this case the system must ask the user/expert whether it should have been possible to reach any conclusion – and, if so, pro-ceed to attempt to formulate the missing rule. The second case is where the user/expert disagrees with the conclusion provided by the system. This may be a genuinecase of expert disagreement or it may be that some rules were not specific enough todeal with some ‘special’ case situations. In either case the expert system does not learnautomatically (as in the case of ANNs) but is involved in expert-mediated learning as aresult of evidence of some inadequacy in the current rule-set. For this activity as wellas the more general user interface the facility of introspection is very valuable.Introspection simply refers to the ability of an expert system to self-analyse its reason-ing process and provide users with a justification as to how it reached certain conclu-sions or why it ‘thinks’ that certain propositions are true or false. This is a criticalelement of the user interface in any expert system as it enables the user to build uptrust in the system’s performance. In other words, by seeing that reasonable and accu-rate inferences are being made (when queried) the user comes to feel that the system‘knows’ something about the domain of interest. This is one major distinctionbetween a rule-based approach and the essentially black-box approach of ANNs. It isalso why such systems are often referred to as ‘decision support’ tools, rather thandecision making tools. The most successful use of expert systems has often been toaugment the reasoning options available to users and/or experts rather than toreplace/supplant them. For example, an expert system used in a number of hospitalswithin the US for diagnosing potential diseases in the light of basic biochemical labdata is used to suggest alternative diagnoses to the physician and in this sense actsmore as an expert ‘colleague’.


163

The only part of the expert system schematic we have not discussed is the mechanismto access external data sources. This is a relatively straightforward module within thesystem which enables reference to databases or other information of relevance to theexpert consultation. In many applications, for example patient data in a hospital-based diagnostic system, a range of ‘facts’ of relevance to the inferencing process isalready available in electronic format. Rather than wasting expert time during a con-sultation to ascertain these facts it is much more efficient to request them fromrelated databases. This does of course add complexity to the overall system as thedeveloper is unlikely to have direct control over the data entries made into the relatedsystems.


164

5.5 Neural networks (ANN)

The basic operation of neural networks (e.g. perceptrons and feed-forward multi-layernetworks) and their application to classification and regression tasks has been dis-cussed earlier. The reason that they were considered under that section on ‘techniquesfrom data mining’ is clear from Figure 5.10. Artificial neural nets operate in a differentmanner from other approaches in AI – they do not have any explicit mechanism bywhich to represent knowledge. In as much as knowledge representation can be said toexist within an ANN it can be seen as being embodied in the structure of the intercon-nected nodes and their linkage weightings. Some may ask whether this is that faraway from certain types of semantic network. On one level they could be seen as hav-ing at least a superficial similarity, however, there is one major difference. In the caseof ANN both the details of the structure and the weightings have to be learned by thesystem, and in contrast to semantic networks there is no explicit attempt to model theexpert (or indeed anyone else’s) knowledge of the domain.

The differences between the neural network approach and that of most of the rest ofAI can be summarised as follows:Most AI (ES/KBS/etc.) ANN– symbolic knowledge – non/sub-symbolic knowledge– focus is on reasoning – focus is on learning– mental/abstract: ‘modelling the mind’ – physical: ‘modelling the brain’

However, despite the fact that the approach of neural networks can be seen as very dif-ferent to others in AI and that many would consider it an essentially ‘statistical’ tech-nique, there are a number of very valid reasons why we should discuss it as part of oursection on ‘intelligent techniques’. For the record it is worth noting that long before thedebate about what constituted ‘data mining’ as opposed to traditional statistics the sta-tistical community had the same argument about ANNs, claiming they were ‘merely’well-known statistical techniques. This view is summarised by HMS as follows:

The vast amount of work on neural networks in recent years, which has been carriedout by a diverse range of intellectual communities, has led to the rediscovery ofmany concepts and phenomena already well known and understood in other areas.It has also lead to the introduction of unnecessary terminology. (HMS 2001, p 393)

Conventional programmingOutput

Output data

Suggested conclusion

Recognisedpattern

InputData inputs1, 2, 3, ‘a’, etc.

Algorithmic:

Logical:

Statistical:

Facts(Credit is .F.)

Patterns

Transformation of data

Knowledge-based/expert systems

Rules evaluated by inference engine

Artificial neural network (ANN)

Network structure learns to identify

Figure 5.10 Comparing the processing approaches of conventional programming, expert systems(as a general example of most AI approaches) and artificial neural networks (ANN)


165

Anyone interested in following up this debate in more detail is directed to Ripley’s(1996) discussion, in a text which also offers comprehensive coverage of a wide range oftechniques from statistics, neural networks, pattern recognition and machine learning.The most compelling reason that ANNs should be considered here is that their use anddevelopment was initially encouraged almost exclusively, and sometimes againstactive opposition, by the AI research community (or at least a section of it). There hasalways been an interdisciplinary grouping of scientists working in AI, including peo-ple from the areas of neurology and physiology as well as psychology. So while thepsychologists and ‘symbolic’ workers were moving forward on one front thoseinvolved in ANN research were taking their inspiration from the knowledge contrib-uted by neurologists and physiologists about what went on within the brain – theysaw computers as a medium through which the brain (as opposed to the ‘mind’, or thethinking done by the brain) could be modelled.

In looking at modelling the brain those working on ANNs also had as their main focusthe activity of learning. Once again this places the work squarely within the area of AI,which has had the subject of automated or ‘machine learning’ as a core goal for over 50years. One of the reasons that the term ANN is used most often in the AI literature (asopposed to the simple ‘neural network’ often found in statistics and data mining liter-ature) is precisely to highlight the fact that this is an artificial network that is beingreferred to – i.e. not an actual ‘neural network’, which neurologists would remind us issituated in the middle of our skulls and is often referred to as our brain! So to understand the structures used within an ANN we must first take a (necessarilysimplistic) look at the physiology of the human brain. The brain is composed of cellscalled neurons. There are around 100 billion of these cells in the average human brain.Each neuron consists of a nucleus (the main body of the cell) an axon (a nerve fibretaking signals away from the nucleus) and a number of dendrites (smaller branchedfibres bringing input signals to the nucleus). The axon carries nerve impulses (signals)from one neuron and then branches to terminate on the dendrites of neighbouringneurons. At the junction between an axon and a dendrites, i.e. the inter-neuron con-nection, is the synapse (with the junction sometimes being referred to as the synapticgap). It is reckoned there are over a trillion (1015) synapses in the human brain. Thepassing of ‘messages’ between the neural connections happens in an electrochemicalmanner with the synaptic gap acting as the ‘gating’ mechanism that determineswhether or not signals are passed. As incoming signals are received by the neuronthey act together to determine the overall strength of signal. When a certain thresholdis exceeded the neuron ‘fires’, sending a message out along its own axon as illustrateddiagrammatically in Figure 5.11. The synapses can act in an ‘excitory’ or ‘inbibitory’way thus determining not only how strong the passed signal will be but also whetherit will increase or decrease the likelihood of the neuron ‘firing’. The speed with which

Fire signal along axon to connected synapses

Threshold of the neuron

signal 1 + +signal 2 signal 3

Figure 5.11 A simple representation of the accumulation of signalsand ‘firing’ which takes place within the neuron


166

signals can be passed in an electrochemical manner is not particularly fast – certainlyan order of magnitude slower than the pure electrical signals used in computer cir-cuits. What is postulated to give the brain its amazing learning power is not its rawprocessing speed but the richness of its interconnections. Each neuron will connect toseveral thousand synapses with those cells connected to the cerebral cortex havingmore that 200,000 connections each. This leads to an incredibly complex network withan astronomical number of connections.

While the overall complexity of the human brain is amazing (and far from fully under-stood) the mechanisms operating within single neurons appear to be reasonably sim-ple. Their structure is as outlined in Figure 5.12, and it would appear that their‘learning’ potential is related to the way in which the synapses moderate and pass onincoming signals together with the internal ‘threshold’ function of the neuron itself. Itwas this simple view of the detailed structure of the brain which inspired researchersto create networks of artificial neurons and attempt to train them to exhibit learnedbehaviour, often with surprising success. (Surprising at least to AI researchers usingconventional techniques in, say, the areas of speech or image recognition where thesesimple structures out-performed very complex sets of algorithms which attempt toaddress the problem from a ‘symbolic’ perspective.) The structure of the elementswithin an artificial neural network and their counterparts in a ‘real’ brain as shownschematically in Figure 5.12.

X1

Wi1

Wi2

Wi3

Win

X2

X3

Yi

Xn

sum

weights

Summation:I = $Xi ! Wji

Inputs OutputsProcessing element

Transfer:Yi = f(I)

dendrites

synapses

axon

nucleus

Figure 5.12 The structure of a neural cell and its connections withinthe human brain and its simple equivalent artificial neuron


167

Taking the example of a single neuron such as the one shown in Figure 5.13, we can seethat the total strength of signal within the neuron will be the sum of the inputs timesthe weights, i.e.

If this value, S, is greater than the threshold, T, set for activation of the neuron then itwill ‘fire’ and pass a message to the neurons connected to its output. Taking a specificexample let’s assume that the weighting matrix W is {0.4, 0.6, 0.1, 0.7} and that thethreshold T is 0.85. Given the set of inputs {1, 0, 1, 1} the value of S will be 1.2, i.e. (1 ! 0.4+ 1 ! 0.1 + 1 ! 0.7). This is greater than T and therefore the neuron will fire. If this wereto result in an incorrect result in the ultimate output of the network then it may bethat the weights would be adjusted in such a way as to ensure that the neuron did notfire for this set of inputs, e.g. {0.2, 0.6, 0.09, 0.55}. (In practice it is more complicatedthan this because the whole set of neurons would have to be adjusted as a group,which typically results in a much more subtle alteration of the set of weights at eachiteration of the learning algorithm.)A point worth noting at this juncture relates to the implementation of ANNs. As theseartificial networks are meant to be based on the structure of the brain the most obvi-ous way to create them would appear to be one which produced a number of elec-tronic ‘neurons’ and then wired them together. This approach has indeed been usedand has certain benefits particularly in terms of speed of calculation and learningwhere the inherent parallelism of the network structure can be put to its best use.However, the more commonly used approach to creating ANNs may come as some-thing of a disappointment after all the talk of ‘building an artificial brain’. In mostANN applications the network does not actually exist in the physical form of neuronsand connections. Instead software is used to simulate the existence and operation ofsuch a network and this runs on traditional computer architectures, including desktop personal computers. While this may be a little disappointing from an ‘artificialbrain’ perspective it is of course important for individuals or companies who wish touse the technique but would be reluctant to invest in a whole new set of hardwaretools. In addition, implementing the network as software means that it is much easierto change its initial structure and experiment with various configurations. For somecomplex networks the initial ANN has been created and its performance checkedusing a ‘soft’ implementation before constructing an ‘actual’ ANN in bespoke hard-ware to gain the benefits of speed and parallelism.

s $i

(inputi weight)!=

Bathrooms

Low loan risk (accept) High loan risk (reject)

Input layer

Hidden layer

Output layer

Rooms Price Area (m2)

Figure 5.13 The architecture of a artificial neural network for ahypothetical predictor of domestic property loan risk


168

The single neuron, as shown in Figure 5.12, does not of course exist on its own, it is partof a network as shown in Figure 5.13. The actual arrangement of the network (its archi-tecture) will depend on the task being carried out and the learning algorithm involved.We have already noted the differences between perceptrons, which may vary fromsimple layer options for binary classification (Chapter 4.2.1) to the more general multi-layer perceptrons for regression analysis (Chapter 4.2.2), while in the case of Kohonen/SOFM networks a two-layer architecture is always used. The tree in Figure 5.14attempts to summarise some of the major learning algorithms which are used in ANNapplications, many of which are associated with particular network architectures.

The application of ANNs has been particularly successful in the areas of pattern recog-nition (classification) and prediction through functional synthesis. In many practicalcases there is not one single function which relates the inputs to the output but arange of functions which relate subsets of the inputs to the output – to achieve aneffective model these various functions must be synthesised in some way, a taskwhich is relatively complex in many mathematical formalisms but well handled bythe ANN approach. In pattern recognition the most obvious examples relate to imageor speech recognition. There have also been notable successes in the area of handwrit-ing recognition, both for simple signature verification (which is in fact merely a spe-cialist example of image recognition) but also for intelligent character recognition (ICR– the successor to OCR). Some applications of ICR use a combination of AI techniques intheir overall solutions, starting with ANNs to recognise letters and other handwrittenmarks and then utilising techniques from computational linguistics (NLP) to carry out‘higher level’ processing such as syntactic, morphologic and semantic analysis ofphrases, sentences and indeed full texts. However, recognition is also applied in amore abstract sense by ANN applications which, for example, ‘recognise’ a good poten-tial mortgage customer or a fraudulent credit card transaction.The strength of ANNs in functional synthesis has led to successful applications inareas as disparate as weather prediction, engine management systems and currencytrading. Essentially, anywhere that a range of input variables are functionally linked ina complex manner to produce some output (the likelihood of rain tomorrow, the bestengine modification to minimise fuel consumption, etc.) may be a candidate problemfor an artificial neural network.While ANNs are mostly applied to supervised learning tasks we have noted the use ofKohonen or SOFM networks for tackling unsupervised tasks in the area of clustering.In addition to providing clustering/segmentation in a large database these approachescan also be used to detect outliers (unusual values) and aid in the process of featureextraction.

Artificial neural networks

Feedback Feedforward

Non-linearConstructed Trained

UnsupervisedCAM BAMTSP(Hopfield)

CounterpropAdaptiveresonance

Neocognition

Backprop

Kohonen/SOFM

Supervised Perceptron Linear associator

Linear

Figure 5.14 Some of the possible learning algorithms usedwithin neural network organised into broad groupings


169

5.6 Other techniques from AI

Rule-based systems, ANNs and Bayesian networks are the most commonly used AI-related approaches in BI applications. However, a number of additional options exist.We firstly consider two approaches which, like rule-based systems, are derived fromthe way we interpret the manner in which human reasoning is carried out. The first,model-based reasoning, assumes that human knowledge cannot be condensed into arange of relatively simple if-then statements (i.e. the ‘heuristic’ approach of rule-basedsystems) but that reasoning must be driven by more complex models of the world. Thesecond, case-based reasoning, emulates an alternative human problem-solving strat-egy that of searching through a collection of previous problem-solving situations totry to infer the best solution to current circumstances. We then introduce twoapproaches which, like ANNs, are based on biological analogues: genetic algorithmsand autonomic computing.An excellent discussion of the main symbolic approaches, namely: rule, case andmodel-based reasoning is given in Chapter 6 of Luger and Stubblefield (1998). In partic-ular in section 6.5, ‘The knowledge-representation problem’, they compare and con-trast these three main approaches and outline the key strengths and weaknesses ofeach. They also consider the option of using hybrid designs in knowledge representa-tion schemes to capitalise on the strengths of each approach, though this mixing isnot without its own set of problems.

Model-based reasoningThe sort of reasoning embodied in the rule-based system approach is sometimesreferred to as ‘weak’ (or ‘shallow’). There is an implied criticism in this label, but it isalso fairly descriptive of the type of knowledge being represented which is based onrule-of-thumb heuristics, about a domain. These systems are ‘weak’ in the sense thatas soon as the expert system is asked about something for which the specific rule-setprovides no answer it is stumped – this trait is also referred to as ‘brittleness’ in thesense that the system totally breaks down when the ‘edges’ of its capacity are reached.In contrast, there are those who propose a model-based approach, attempting to cre-ate ‘strong’ reasoning systems based on ‘deep’ knowledge. By this they mean that theexpert system will contain elements which actually simulate (model) important struc-tures and functions in the domain of interest. Take, for example, the area of integratedcircuit design. In a rule-based system we might have a rule which states that:

If line1 carries signalAnd line2 carries signalAnd gap-between (line1, line2) < 5 micronsAnd substrate = gallium-oxideThen advise ‘Gap between signal-carrying lines is too small’.

The problem with this rule is that while it can provide useful feedback to the circuitdesign engineer it can do so only within a very specific context. The rule cannot helpus if the substrate we are using is something other than gallium-oxide and it mayswitch advice between a 4.99 and a 5.01 micron gap design. If does this based on its‘weak’ knowledge that 5 microns acts as some sort of magic cut-off. On the other handa model-based approach to the circuit design problem would attempt to create a sys-tem that simulated the way in which such circuits operate. This simulation wouldincorporate deep knowledge such as the processes associated with the flow of elec-trons in different type of conductive substrate. The model would ‘understand’ somebasic laws of physics and electronics. In this way it could provide a more comprehen-sive set of answers to a range of problems surrounding the IC design activity.


170

Similar examples can easily be imagined in the medical domain. Here a weak systemwould simply link an observed symptom (such as a headache) with some possiblediagnosis. A strong model-based system, on the other hand, would reason from somedeeper knowledge base which modelled the fact that the disease was caused by thepresence of some infective agents, that these agents were likely to cause inflammationof the cell linings and the potential for inter-cranial pressure build-up and thus theheadaches.We do not really have space to develop our discussion of the model-based approachhere. In some ways it represents a philosophy rather than a specific technique. A good,if slightly dated, introduction to the approach can be found in Fulton and Pepe (1990).There is also a sense in which it represents an unattainable set of objectives. As the oldadage goes, ‘all models are wrong, but some are useful’. Thus any model of a domainmust remain simply that – it is an abstraction of the system and therefore will at somelevel be incorrect. However, I am in danger of slipping into another philosophical asideso I will stop at that. Except to note that in some ways the growth in the use of Baye-sian networks and other ‘causal’ approaches can in some ways be seen as precisely themove from shallow to deeper knowledge-based systems that model-based proponentshave always advocated.

Case-based reasoning (CBR)The basic idea behind case-based reasoning is fairly straightforward and appears tomirror the way that humans reason in a range of different problem situations. Ratherthan attempting to apply any sort of heuristic rules or underlying models we often tryto think of similar situations which we have faced in the past and use the action takenon these occasions to inform our behaviour now. Thus the case-based reasoning sys-tem starts with a set of cases which represent solutions to specific past probleminstances. When reasoning about a new problem the case-based approach searchesthrough this set of cases to identify solutions to similar problems and then attempts tomodify these to address the current situation. Since the core of this technique lies inthe determination of which problems from the existing case-set are ‘closest’, thisapproach is analogous to the ‘nearest neighbour’ classification technique introducedin Chapter 4. The main difference is that with case-based reasoning we are attemptingto measure ‘semantic distance’ between cases rather than the Euclidean distancebetween two n-dimensional vectors (though if the problem is to be solved by a compu-ter then at some point it is likely that some mathematical formulation of the variouscases will need to be created for comparison).If we consider the medical domain then we can see that a variety of reasoningapproaches are used by human experts. There are certain types of diagnosis which canutilise heuristic rules, for example the use of diagnostic signs or the results from clini-cal tests to infer disease cause. There are other types of diagnostic reasoning whichmight use knowledge of anatomy or the relationships between cell biology andviruses to infer disease cause. However, in addition to this rule- and model-based rea-soning most junior doctors learn a great deal of their diagnostic skill by making wardvisits and studying patient histories, in other words by looking at cases. The CBRapproach has been used within the medical domain in systems such as CASEY andPROTOS (these two systems and a range of other CRB issues are discussed in Aamodtand Plaza (1994)).


171

The basic stages in the case-based reasoning process are as follows:! define/describe a new problem! retrieve ‘matching’ cases from the case set/database (all cases are indexed on

inclusion in the database according to their significant factors)! modify the ‘best’ retrieved case(s) to meet the circumstances of the current problem! apply the preferred transformed case to the new problem (and observe the out-

come)! add the new case with its result (negative or positive) to the case database for

future use.

One of the key problems in this whole process is the issue of ‘matching’. In a reasona-bly complex domain it is unlikely that a new case will have an exact match in the case-base, nor would looking for exact matches necessarily always be the best option as‘near’ matches may also provide useful knowledge. One of the things which humansdo very effectively, arguably one of the key distinguishing characteristics of humanintelligence, is to note that two or more situations are actually very similar despitetheir apparent differences. For example, in writing these notes and delivering them ina distance-mediated way I am involved in higher education within the area of BI.When I give a lecture to 500 first-year students on Introductory CIT in a large audito-rium, or when I meet with 10 of my post-graduates for a tutorial in my own office, I amalso involved in higher education. We see the similarities within these situationsbecause we understand something of the context of higher education, however, an AIprogram would have great difficulty in spotting the ‘missing link’ if all we did wasdescribe these three situations.

!

Thinking point Similarity and difference The corollary of the above is also true – i.e.humans are reasonably good at identifying what are very different situations despitetheir apparent similarities, but again computers are not. Can you think of a couple ofexamples to illustrate this in the human context?

!

Due to the problem of (exact) ‘matching’ the search conditions are often relaxed toensure that a reasonable number of candidate cases are found – but some of these maybe ‘high noise’ options. To select between these candidates a number of preferenceheuristics have been developed. This is where the approach differs from NearestNeighbour search in which some mathematical form would be used to measure abso-lute distance, while here we are interested in ‘semantic distance’. Kolodner (1993) sug-gests a number of preference heuristics (in some ways these are analogous to themeta-knowledge stored in meta-rules with a production system) to aid in the storageand retrieval of cases, including:! salient-feature preference: look for cases that match the most important features

or have the largest number of such matching features! specific preference: prefer cases with just the same number of matching features

before looking at more general (super-set) cases! goal-directed preference: use the ultimate goal description of the case as part of

the index framework and prefer cases which have matching goals! recency preference: check first the most recently used cases! frequency preference: prefer the most frequently matched cases


172

! ease of adaptation preference: choice first those cases which will be most easilyadapted to the current situation.

We do not have space here to develop this approach in any more detail but the inter-ested reader is directed to Ralph Bergmann’s excellent website, ‘Introduction to case-based reasoning’, at http://www.cbr-web.org/CBR-Web/cbrintro/, while Janet Kolod-ner’s (1993) book provides more in-depth coverage. In addition to the medical domain,mentioned above, another major area of application for CBR has been the legal domain(case-law and the role of ‘precedent’ making this an obvious area). In the area ofdesign, where recalling previous cases and learning from/ building on them is impor-tant, CBR has also been applied. This has ranged from the reuse of programming codewithin software design, to architecture and structural building design problems, up tolarge engineering design projects such as bridges or motorways.

Genetic algorithmsAnother area which has developed within AI and shows clear parallels with artificialneural networks is that of genetic programming or algorithms.

Genetic programming is a systematic method for getting computers to automati-cally solve a problem. ..(it) starts from a high-level statement of what needs to bedone and automatically creates a computer program to solve the problem. 5

As in the case of ANNs this approach appears to be ‘biologically’ inspired and thereforegains an immediate sense of ‘cool’ and unquestioned credibility (‘it must be good, itworks just like nature’). Unfortunately this tends to lead to lazy thinking about what isactually happening. Perhaps I am being a little over-critical but from comments I haveheard it would seem that some people take a view something like the following: ‘It’s abit like evolution, I don’t really understand all the details but natural selection and‘survival of the fittest’ seem to be pretty effective and that’s what genetic algorithmscan bring to my data mining tasks – so they must be good!’

5 Image and quote both from: www.genetic-programming.com/ – used with per-mission.

Figure 5.15 A conception of genetic programming


173

Such attitudes are not discouraged by certain websites which promote the approachas can be seen in the following quote:

One of the central challenges of computer science is to get a computer to do whatneeds to be done, without telling it how to do it. Genetic Programming addresses thischallenge by providing a method for automatically creating a working computerprogram from a high-level statement of the problem. Genetic programmingachieves the goal of automatic programming by genetically breeding a populationof computer programs using the principles of Darwinian natural selection and bio-logically inspired operations. The operations include reproduction, crossover (sexualrecombination), mutation and architecture-altering operations patterned aftergene duplication and gene deletion in nature. (www.genetic-programming.com/gpanimatedtutorial)

There is much truth in this statement but perhaps more telling is what goes unsaid.How are such programmes set up and what exactly is meant by a ‘high-level state-ment’ of the problem? We find that these are in fact similar to the functional depend-encies statements you might make in any mathematical model together with suchthings as fitness (scoring) measures and parameters to control the various iterations ofthe search and termination conditions. Seen in this way genetic programming looksmore like a ‘normal’ set of mathematical constructs which just happen to use a novelapproach to generating alternative options and selecting between them. (I am not say-ing that genetic algorithms do not have value and would point the interested reader tothe website from which the above quotes are taken as it is an excellent source for moredetailed information. However, by over-selling the approach I feel that its exponentsmay be laying themselves open to the same ‘hype and disappointment’ cycle that hasplagued other area of AI.)As David Barber points out, genetic algorithms are simply an approach to optimisation.They happen to deal with the particularly difficult area of discrete parameter space – i.e.the solutions are dependant on ‘stepping’ from one set of parameters to another ratherthan assuming that each parameter is at some point on a continuous line. The process ofevolution can be seen in this light where the optimisation function is survival and theparameters which give the best ‘solution’ are the discrete sequences of genes selected.Barber notes that genetic algorithms in general adopt a rather simple approach to dis-crete optimisation and should not be assumed to provide the best option:

Why should optimisation mechanisms based on sexual (natural) selection have anyrelevance to finding the best construction of a bridge? My point is that you shouldn’tbe swayed into applying ‘sexy’ methods to areas which are not obviously related.(http://anc.ed.ac.uk/~dbarber/lfd1/lfd1_2003_prelim.pdf p7)

Autonomic computingRelated to and in some ways even more ambitious than genetic programming or artifi-cial neural networks is the recently developing area of ‘autonomic computing’. As withthe two earlier technologies the name is inspired by a biological system, in this casethe human ‘autonomic nervous system’ (that part of the nervous system that controlskey functions without conscious awareness, such as respiration or heart rate). Thus thegoal of autonomic computing is to create self-managing systems that operate with aminimum of human involvement.


174

A key assumption driving developments in this area can be seen to relate to the math-ematician Alfred Whitehead’s comment early last century:

Civilization advances by extending the number of important operations which wecan perform without thinking about them. (Alfred North Whitehead, 1911, Introduc-tion to Mathematics)

In many ways autonomic computing is more of a vision (an ‘agenda’ as we might sayin the political context) than a technology. However it is mentioned here as it will uti-lise ANNs and genetic algorithms as well as other AI-inspired approaches to attemptto achieve its goals. It is a far-reaching and, some might feel, far-fetched programme,but it is not without support in the mainstream of computing. We do not have space todiscuss it further here but close with a quote from the IBM Research web site, whichindicates just how seriously at least one major IT player is taking the area:

This quote (i.e. by Whitehead) holds both the lock and the key to the next era of com-puting. It implies a threshold moment surpassed only after humans have been ableto automate increasingly complex tasks in order to achieve forward momentum. IBM believes that we are at just such a threshold right now in computing. The mil-lions of businesses, billions of humans that compose them, and trillions of devicesthat they will depend upon all require the services of the I/T industry to keep themrunning. And it's not just a matter of numbers. It's the complexity of these systemsand the way they work together that is creating a shortage of skilled I/T workers tomanage all of the systems. It's a problem that's not going away, but will grow expo-nentially, just as our dependence on technology has. In short, we can’t keep comput-ing as we have for year (…we need a new model).This new model of computing is called autonomic computing…http://www.research.ibm.com/autonomic/


175

5.7 Conclusion

We have seen a range of techniques which have grown up within the field of artificialintelligence and which have immediate relevance to the implementation of BI sys-tems. We have also noted a few more (thus far) speculative approaches which mayhave a major impact on future BI solutions.

appendices


178

Contents

Feedback to exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

appendices

179

Feedback to exercises

Chapter 1Exercise 1.1Clearly not. While you had only a 5% chance that an actual difference discovered wasdue purely to chance this was clearly not true once you had carried out 3 or 10 suchtests. Indeed, as an increased number of tests were carried out it would be more sur-prising if you did not find a ‘significant’ difference in at least one of the variables. Theassumptions of random chance are made by a number of the algorithms used in BI andif we systematically operate against (or in ignorance of) these assumptions we may‘discover’ surprising but ultimately invalid knowledge.

Exercise 1.2Correlation is not causation! (a) There are a number of possible explanations for the phenomena observed in the

‘Dogs of the Dow’ article (and you may have thought of more than the three listedhere). These would include: (i) exceptional performances can be a simply a matterof chance and things tend to average out over the longer term; (ii) very poor per-formances result in management action including dramatic restructuring whichmay in turn affect performance; (iii) portfolios of share are designed to smooth outthe affect of extreme fluctuations (up or down) in a small number of share prices.

(b) It is always bad practice to assume a causal link between two variables justbecause they happen to correlate. In a absence of a rational causal mechanism itis more likely that the correlation is coincidental (it is quite likely that correlationcoefficients will be statistically significant for large data samples) or that there issome common under-lying cause. In the case of our office rental and fuel taxexample, a strongly performing economy or simply inflation.

Chapter 3Exercise 3.1The major advantages should be relatively clear. They include:! comprehension: making complex data quickly and easily accessible to the lay

reader! economy: showing a large amount of data in a small ‘space’! aesthetics: getting the message over in a pleasing manner.

The main disadvantages are less often acknowledged, but they may include:! hiding information: for example showing an apparent trend in time without any

note of the variability in the data (which may in fact ‘swamp’ any apparent time-based change)

! reinforcing wrong assumptions: for example, the rule that ‘correlation does notimply causation’ is just as true of a pretty looking graphic of a relationship as if isof the dry correlation coefficient!

! lying with data: for one of the most fascinating insights into the whole range ofpossibilities for misrepresenting the truth using graphical displays see Chapter 2of Tufte (1983).


180

Exercise 3.2There are many patterns (most of which are fairly intuitive) in the Cars_by_Countrydata as represented in Figure 3.6 (though for the country of origin related ones the col-our version of the graph on the website is required). Some would include:(i) Engine size, horsepower and weight are all strongly correlated(ii) All three of the above appear to be negatively correlated with fuel consumption

(MPG)(iii) There is some evidence of similar relationships for the case of 0-60MPH accelera-

tion figures (but it is not as clear cut)(iv) Cars with a US origin tend to have larger engines, are heavier and have more

horsepower than those from Europe or Japan(v) Not surprisingly the US-based cars also tend to have lower MPG figures(vi) Over time the MPG figures have gone up and the average weight and engine sizes

have come down (in all countries but particularly the USA – again the web versionof the graph is required for this as it included the variable Year), etc.

Exercise 3.3You may disagree – it is one of the facts of life when dealing with graphics that somewill appeal to certain people more than others – but personally I find it difficult to saymuch that is positive about FGW’02 Fig1-2.gif. Certainly it tells me nothing that I couldnot already see (and in most cases see much more clearly) in the scatter plots. I can justabout pick up the increase in fuel consumption performance in the late European dataas well as a reduction in weight in the last American data – apart from that it mostlylooks like a lot of ‘noise’! This particular graph is also wasteful in the sense that ‘type’and colour appear to be providing the same distinguishing functions. You could viewthe use of colour as redundant in the graph as it stands.

Exercise 3.4This is actually quite tricky! The data to show the beer-diapers association rule isbased on analysis at a number of levels. First you have to select out those who actuallybought either beer or diapers at the weekend (‘coverage’), then you have to screen thisdata to notice that most of the interesting candidate cases are males. Finally you haveto look at the buying habits of this sub-group (i.e. both products compared to all item-sets: the ‘accuracy’ of the rule). In fact a graphical display may not be that helpful inthis context – except perhaps to use pie-charts to show the relative proportions for cov-erage and accuracy.

Chapter 5Exercise 5.1Well, we could use inductive reasoning to state a number of options, the most obviousof which would be:(a) children who are 7 or older attend school; or(b) children who are 3 or younger do not attend school.

appendices

181

References

Chapter 1Altman, E.I., Marco, G. and Varetto, V. (1993) Corporate Distress Diagnosis: comparisonsusing linear discriminant analysis and neural networks. Working Paper Series, New YorkUniversity Salomon Centre, NY.Anand, S. and Buchner, A. (1998) Financial Times Management Briefings: Decision Sup-port Using Data Mining. Financial Times Prentice Hall. London, UK.Berson, A. Thearling, K. and Smith, S. J. (1999) Building Data Mining Applications forCRM, New York, McGraw-Hill. Bhandari, I., Colet, E., Parker, J., Pines, Z., Pratap, R. and Ramanujam, K. (1997) ‘AdvancedScout: data mining and knowledge discovery in NBA data’, Data Mining and KnowledgeDiscovery, 1(1): pp121–125. Cios K, Pedrucz W and Swiniarski R (1998) Data Mining methods for Knowledge Discov-ery, Boston, MA., Kluwer Academic Publishers. Diekmann, A. and Gutjahr, S. (1998) ‘Prediction of the Euro-Dollar Future using neuralnetworks – a case study for financial time series prediction’, In. Xu, et al. (eds), Intelli-gent Data Engineering and Learning, Singapore, Springer.Fayyad, U. M., Djorgovski, S. G. and Weir, N. (1996) ‘Automating the analysis and cata-loguing of sky surveys’, In. Fayyad, et al. (eds). Advances in Knowledge Discovery andData Mining, Menlo Park, CA, AAAI Press.Fayyad, U., Grinstein, G. G. and Wierse, A. (eds) (2002) Information Visualisation in DataMining, Palo Alto, CA, Morgan Kaufmann. Hill, T., Marquez, M., O’Conner, O. and Remus, W. (1994) ‘Artificial neural network mod-els for forecasting and decision making’, International Journal of Forecasting, 10: pp5–15.Maimon, O. and Last, M. (2001) Knowledge Discovery and Data Mining: The Info-fuzzyNetwork (IFN) Methodology, Boston, MA, Kluwer.Nakhaeizadeh, G., Steurer, E. and Bartlmae, K. (2002) ‘Banking and finance’, in. Klosgenand Zytkow (eds). Handbook of Data Mining and Knowledge Discovery, Oxford, OUP.Piatetsky-Shapiro. G. and Frawley. W. J. (eds) (1991) Knowledge Discovery in Databases,Menlo Park, CA.AAAI/MIT Press. Poddig, T. (1995) ‘Bankrupcy predition: a comparison with discriminant analysis’, in,Refenes, A. P. (ed) Neural Networks in the Capital Markets, Chichester, Wiley. von Hasseln, H. and Nakhaeizadeh, G. (1997) ‘Dependency analysis and learning struc-tures for data mining’. In Zimmermann (ed) Proceedings of the 5th European Congresson Intelligent Techniques and Soft Computing. Aachen, Germany, Verlag Mainz, Wissen-shaftverlag.

Chapter 2Cabena, P., Hadjinian, P., Stadler, R., Verhees, J. and Zanasi, A. (1998) Discovering DataMining: From Concept to Implementation, Englewood Cliffs, NJ Prentice Hall.Chen, Z. (2001) Data Mining and Uncertain Reasoning, New York, John Wiley and Sons. Kim, W., Choi, B., Hong, E., Kim, S. and Lee, D. (2003) ‘A taxonomy of dirty data’, DataMining and Knowledge Discovery, 7(1) pp81–99.Little, R. J. A., and Rubin, D. B., (1987) Statistical Analysis with Missing Data, New York,Wiley.Liu, H. and Motoda, H. (1998) Feature selection for knowledge discovery and data mining,New York, Kluwer Academic Publishers.


182

Michalski, R. S., Bratko, I. and Kubat, M. (1998) Machine learning and data mining,Chichester,John Wiley.Simpson, E.H., (1951) ‘The interpretation of interaction in contingency tables’, Journal ofthe Royal Statistical Society, Series B, 13, pp238–241.

Chapter 3Barber, D. (2003) ‘Learning from data: data visualisation’. Notes from a Masters courserun by Edinburgh University School of Informatics. Available at: http://anc.ed.ac.uk/~dbarber/lfd1/lfd1_2003_visual.pdfCleveland, W. (1993) Visualizing Data, Summit, NJ, Hobart Press.Fayyad, U., Grinstein, G. G. and Wierse, A (2002) Information Visualisation in Data Min-ing, San Francisco, CA Morgan Kaufmann.Spence, R., Tweedie, L., Dawkes, H. and Su, H. (1995) Visualisation for FunctionalDesign. IEEE Visualization ’95 – Information Visualization Symposium. pp4–10. LosAlamitos, CA, IEEE Computer Society Press. Tufte, E. (1983) The Visual Display of Quantitative Information, Cheshire, CT, CT Graph-ics Press. Tufte, E. (1990) Envisioning Information, Cheshire, CT, CT Graphics Press. GraphicsPress, Cheshire, CT, CT Graphics Press.Tufte, E. (1997) Visual Explanations: Images and Quantities, Evidence and Narrative,Graphics Press USA.Tukey, J. W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading, MA.van Wijk, J. J. and van Liere, R. (1993) ‘HyperSlice’, in IEEE Visualization ’93. Nielson GMand Bergeron RD (eds) pp119–125. Los Alamitos, CA, IEEE Computer Society Press. Wong, P and Bergeron, RD (1997) ‘30 Years of Multidimensional Multivariate Visualisa-tion’, in Scientific Visualization: Overviews, Methodologies and Techniques. pp3-33. LosAlamitos, CA, IEEE Computer Society Press.

Chapter 4Fisher (1937) The Design of Experiments, Edinburgh, Oliver and Boyd.Hand, D., Mannila, H. and Smyth, P. (2001) Principles of Data Mining. Cambridge, MA,MIT Press.McKendrick, I. J., Gettinby, G., Gu, Y., Reid, S. W. J. and Revie, C. W. (2000) ‘Using a Baye-sian belief network to aid differential diagnosis of tropical bovine diseases’, Preventa-tive Veterinary Medicine, 47, pp141–156.Smyth P, Ide K and Ghil M (1999) ‘Multiple regimes in northern hemisphere heightfields via mixture model clustering’, Journal of the Atmospheric Sciences, 56, pp3704–3723.Vapnik, V. (1998) Statistical Learning Theory, Chichester, Wiley.Wedel and Kamakura (1998) Market Segmentation: Conceptual and MethodologicalFoundations, Boston, MA, Kluwer.

Chapter 5Aamodt, A. and Plaza, E. (1994) ‘Case-based reasoning: foundation issues, methodologi-cal variations and system approaches’, available at: http://www.iiia.csic.es/People/enric/AICom.htmlBuchanan, B. G. and Shortliffe, E. H. (1984) Rule-Based Expert Systems: The MYCIN Exper-iments of the Stanford Heuristic Programming Project. Reading, MA., Addison-Wesley.Chan, L. M.(1994) Cataloging and Classification: An Introduction. 2nd ed., New York,McGraw-Hill.

appendices

183

Dempster, A. P. (1968) ‘A generalisation of Bayesian inference’, Journal of the Royal Sta-tistical Society, 30 (Series B): pp1–38. Doyle, J. (1979) ‘A truth maintenance system’, Artificial Intelligence, 12(3) pp231–272.Eddy, D. M. (1982) ‘Probabilistic reasoning in clinical medicine: problems and opportu-nities’, in: Kahneman, Slovic and Tversky (eds) Judgment under Uncertainty: Heuristicsand Biases. Cambridge, Cambridge University Press. Fillmore, C. J. (1968) ‘The case for case’. In Bach and Harms (eds) Universals of LinguisticTheory, New York, Holt Rinehart and Winston.Fulton, S. L. and Pepe, C. O. (1990) ‘An introduction to model-based reasoning’, AI Expert,5(1) pp48--55.Gigerenzer, G. (1991) ‘On cognitive illusions and rationality’. Poznan Studies in the Phi-losophy of the Sciences and the Humanities, 21, pp225–249.Harmon, P. and King, D. (1985) Expert Systems: Artificial Intelligence in Business, NewYork, Wiley. Kolodner, J. L. (1993) Case-Based Reasoning, San Francisco, CA., Morgan Kaufmann.Lauritzen, S. L. and Spiegelhalter, D. J. (1988) ‘Local computations with probabilities ongraphical structures and their application to expert systems’, Journal of the Royal Sta-tistical Society, B 50(2): pp157–224.Levesque, H. J. and Brachman, R. J. (1985) ‘A fundamental tradeoff in knowledge repre-sentation and reasoning’. In Brachman and Levesque (eds) Readings in Knowledge Rep-resentation, Los Altos, CA., Morgan Kaufmann.Luger, G. F. and Stubblefield, W. A. (1998) Artificial Intelligence : Structures and Strate-gies for Complex Problem Solving. (3rd ed.) Harlow, Addison-Wesley. Mylopoulos, J. and Levesque, H. J. (1984) An overview of knowledge representation. InBrodie, Mylopoulos and Schmidt (eds) On Conceptual Modeling, New York, Springer-Verlag.Patil, R., Szolovitis, P. and Schwartz, W. (1981) ‘Causal understanding of patient illness inmedical diagnosis’. Proceedings of the International Joint Conference on Artificial Intelli-gence, Palo Alto, CA., Morgan Kaufmann. Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge, CambridgeUniversity Press.Schank, R. C. and Rieger, C.J. (1974) ‘Inference and the computer understanding of natu-ral language’, Artificial Intelligence, 5(4) pp373–412.Simmons, R. F. (1973) ‘Semantic Networks: Their computation and use for understand-ing English sentences’. In Schank and Cobly (eds) Computer Models of Thought and Lan-guage, San Francisco, Freeman.Sowa, J. F. (1984) Conceptual Structures : Information Processing in Mind and Machine.Reading, MA., Addison-Wesley. von Winterfeldt, D. and Edwards, W. (1986). Decision Analysis and Behavioural Research,Cambridge, Cambridge University Press.Weiss,S. M., Kulikowski,C., Amarel, S. and Safir, A. (1977) ‘A model-based method forcomputer-aided medical decision-making’, Artificial Intelligence, 11(1-2) pp145–172.Zadeh, L. A. (1983) ‘Commonsense knowledge representation based on fuzzy logic’,Computer, 16 pp61–65.

Revie Business Intelligence

Documents

Transcript of Revie Business Intelligence