What is Business Analytics

44
What is Business Analytics? Business Analytics (BA) is not a new phenomenon. It has been around for many years, but predominantly with companies operating in the technically oriented environment. Only recently it’s making its breakthrough and we can see more and more companies, especially in the financial and the telecom sector, deal with business analytics in order to support business processes and improve performance. Business Analytics is translating data into information that is necessary for business owners to make informed decisions and investments. It is the difference between running business on a hunch or intuition versus looking at collected data and predictive analysis. It is a way of organizing and converting data into information to help answer questions about the business. It leads to better decision making by looking for patterns and trends in the data and by being able to forecast impact of decisions before they are taken. BA can serve throughout the whole company and all C-level executives can take an advantage of it. For example Chief Marketing Officers (CMOs) can use BA to get better customer insight and enhance customer loyalty. Chief Financial Officers (CFOs) can better manage financial performance and use financial forecasts. Chief Risk Officers (CROs) can get a holistic view of risk, fraud and compliance information across the organization and take an action. Chief Operating Officers (COOs) can get better insight into supply chains and operations and enhance efficiency. 1

Transcript of What is Business Analytics

What is Business Analytics?

Business Analytics (BA) is not a new phenomenon. It has been around for many years, but predominantly with companies operating in the technically oriented environment. Only recently its making its breakthrough and we can see more and more companies, especially in the financial and the telecom sector, deal with business analytics in order tosupport business processesandimprove performance.

Business Analytics istranslating data into information that is necessary for business owners to make informed decisions and investments. It is the difference between running business on a hunch or intuition versus looking at collected data and predictive analysis. It is a way of organizing and converting data into information to help answer questions about the business. It leads to better decision making by looking for patterns and trends in the data and by being able to forecast impact of decisions before they are taken.

BA can serve throughout the whole company and all C-level executives can take an advantage of it. For example Chief Marketing Officers (CMOs) can use BA to get bettercustomer insightandenhance customer loyalty. Chief Financial Officers (CFOs) canbetter manage financial performanceand use financial forecasts. Chief Risk Officers (CROs) can get aholistic view of risk, fraud and compliance information across the organization and take an action. Chief Operating Officers (COOs) can getbetter insight into supply chains and operationsand enhance efficiency.

Companies use business analytics to data-driven decision making. For being successful they need to treat their data as a corporate asset and leverage it for competitive advantage. Successful business analytics depends ondata quality,skilled analystswho understand the technologies and thebusiness and organizational commitment to data-driven decision making.Examples of BA uses include: Exploring data to find new relationships and patterns (data mining) Explaining why a certain result occurred (statistical analysis, quantitative analysis) Experimenting to test previous decisions (A/B testing, multivariate testing) Forecasting future results (predictive modeling, predictive analytics)

Why Is Business Analytics important?Becoming an analytics-driven organization helps companies toextract insights from their enterprise dataand help them to achievecosts reduction,revenues increaseandcompetitiveness improvement. This is why business analytics is one of the top priorities for CIOs. An IBM study[4] shows that CFOs in organizations that make extensive use of analytics reportgrowth in revenues of 36 percent or more, a15 percent greater return on invested capitalandtwice the rate of growth in EBITDA(earnings before interest, taxes, depreciation and amortization).Business Analytics helps you make better, faster decisions and automate processes. It helps you address the questions and ensure you tostay one step ahead your competition. Some of the basic questions in retail environment could be: How big should a store be? What market segments should be targeted? How should a certain market segment be targeted in terms of products, styles, price points, store environment, location? Who are our customers? How should space in the store be allocated to the various product groups and price points for maximum profitability? What mitigation strategies are effective and cost efficient for example changes to packaging, fixtures, placement of product? What is the best customer loyalty program for our customers? What is the optimal staffing level on the sales floor? How many checkouts are optimal in a store? Would acquisition of a particular store brand improve profitability? Would creation of a new store brand improve profitability?

Business Analytics versus Business Intelligence

Business Analytics and Business Intelligence (BI) are two terms heavily used when people are talking about data and what it can do for their company. Some people use these terms interchangeably, some strictly distinguish between them. Some (IBM, SAP) consider business intelligence as a subset of BA, others think the opposite.There are many different opinions. For example, researchDefining Business Analytics and Its Impact on Organizational Decision-Makingconducted by Computerworld says Business intelligence (BI) is an important aspect of an organizations strategic framework. But what is beyond BI? Some indicators point to business analytics, a progression from BI, as the next step. Business analytics is predictive as well as historical, which requires a cultural shift to the acceptance of a proactive, fact-based decision-making environment, providing organizations with new insights and better answers faster Through this research, we can see that IT and business professionals mainly align business analytics with BI products. In fact, more than half of respondents (54%) cited BI as the category of products that first comes to mind when they think of the term business analytics.

Traditional BI has been associated with providing executive dashboards and reporting to monitor the assumptions and key performance metrics saying how are we doing. The opinion that BA has a broader character and offers deeper insight is also supported in the bookCompetingon analytics: the new science of winningwhere Thomas Davenport and Jeanne Harris say that Business Analytics can answer questions like why is this happening, what if these trends continue, what will happen (prediction) and what is the best that can happen (optimization).In this paper, I will consider BA as an umbrella term trying to answer BI questions likeWhat happened? When? Who? How may?as well as questionsWhy did it happen? Will it happen again? What will happen if we change x? What else does the data tell us that never thought to ask?that need advanced analytics.

How does Business Analytics work?A high-level architecture of business analytics starts with data sources representing many business systems such as point of sale system, accounting system, order processing system, online system, and so on. These systems produce data we need for analytics. Since the data is often stored in different formats, in different locations and in many cases it cannot be accessed on real-time basis, we need to set up a single repository that is able to store all this data. This repository is called data warehouse.Figure 1High-level architecture of business analytics

The Extract, Transform, and Load (ETL) job extracts data from one or more sources on a scheduledbasis, performs any required data cleansing to transform the data into a consistent format, and loads the cleansed data into the data warehouse or data mart. Data warehouse is a pool of historical data that doesnt participate in the daily operations of the organization. Instead, this data is purposefully used for business analytics. The data in data mart usually applies to a specific area of the organization.When we collect all necessary data, analysis is typically performed. Analytics tools range from spreadsheets with statistical functions to complex data mining and predictive modeling applications. As patterns and relationships in the data are uncovered, new questions are asked and the analytic process iterates until the business goal is met.DATA WAREHOUSE

Data Warehouse (DWH) systems represent a single source of information for analyzing thedevelopment and results of an organization (List and Machaczek 2004). Measures, such as the number of transactions per customer or the increase of sales during a promotion, are used to recognize warning signs and to decide on future investments.

By describing events and statuses of business processes, products and services, goals or organizational units, the data in the data warehouse mirrors the structure and behavior of the organization. In the organization, information about this relationship between the data warehouse data and the business processes, products, etc. is usually available in the form of enterprise models and documents, and is used during the design phase of the data warehouse.

Surprisingly, the knowledge about this relationship is not made available to the data warehouse users or even recorded in a suitable way. Due to the data-oriented nature of Data Warehousing, the knowledge of how the data warehouse measures relate to business processes or products is not easily accessible to data warehouse users. As it is mainly implicit knowledge, it is also more likely to be lost or forgotten.

Business users are accustomed to their own vocabularies and concepts, and data interpretation isgreatly improved by knowledge of context. Using and understanding traditional data-oriented Data Warehousing frontends therefore requires additional effort from the users. If knowledge about the business context is left to chance, data analysis is bound to miss important points and becomes more error prone.

We identify a need for describing the relationship between the data in the data warehouse and theorganization that surrounds it. We propose an approach that allows us to show:

Which parts of the data warehouse data is created by which business process or part of it How business processes have impact on the values of the data warehouse data Which parts of the data warehouse data measures which (sub)process or products Which organizational units and roles are measured along with the processes Where the products and deliverables of the processes are mirrored in the data warehouse dataStructure

Indeed, adding context and background information to a data warehouse has been an open question in Data Warehousing for years. The term business metadata is used for data that describes the business context of the core data, its purpose, relevance, and potential use. There is general agreement on the usefulness and desirability of business metadata. But how to create or derive business metadata is still very much an open question.

BACKGROUND

Enterprise Models

An enterprise model formally represents the basic building blocks of an organization, its structure, behavior and goals. It is usually organized into several aspects that can be modeled individually but also related to each other (Whitman & Ramachandran & Ketkar 2001). The Architecture of Integrated Information Systems (ARIS) (Scheer 1999) is a typical example for such an enterprise model. Other similar approaches include CIMOSA (Kosanke & Vernadat 2002) and MEMO (Frank 2001).Figure shows the outline of a generic enterprise model, organized into five aspects: The enterprise strives to achieve goals, acts through processes, has an organizational structure, produces products and uses software applications. In the enterprise model, an organization chart can be used to describe the organizational structure, i.e. the dependencies between the departments, groups and roles that exist within the organization. Similarly, business process models describe the structure of business processes with control flows, inputs and outputs. The products, applications, and strategic goals can also be modeled separately, as well as connected to the other aspects in a single model. Such an overview model can connect all models to show for example how processes fulfill goals, are performed by organizational roles, fall into the responsibility of departments, and use applications to produce products for other departments.

Example of a generic enterprise model

The Architecture of Integrated Information System (ARIS) and the Event-Driven Process Chain (EPC)

The Architecture of Integrated Information System (ARIS) concept (Scheer 1999) involves dividing complex business processes models into separate views, in order to reduce the complexity. There are three main views focusing on functions, data, and the organization, and an additional view focusing on the integration of the other three.

Multi-dimensional data model of 5-dimensional cube, i.e. a fact table with 5dimensions and 4 measures, in UML notation (cf. Lujn-Mora et al. 2002).Aggregation levels are shown only for the Time and Policy dimensions.A multidimensional model, also called star schema or fact schema, is basically a relational data model (as originally introduced by Chen in 1976) in the shape of a star or snowflake (see Figure 3 for an example). At the center of the star there is the fact table. It contains data on the subject of analysis (e.g., policy transactions, or sales, repairs, admissions, expenses, etc.). The attributes of the fact table (e.g., amount, duration, cost, revenue, etc.) are called measures. The spokes/points of the star represent the dimensions according to which the data will be analyzed (e.g., by employee and month or by customer group and covered items). The dimensions can be further broken down into hierarchies that are useful for aggregating data (e.g., day, month, year). Several stars can share their dimensions, thus creating a web of interconnected schemas that makes drill-across operations possible.

DERIVING BUSINESS METADATA

In this section, we present a weaving model for connecting enterprise models to data warehouse data,in order to derive business metadata. Business metadata allows the data warehouse users to access thecontext of the data warehouse data. The weaving model stores information about the relationshipbetween the data warehouse and the structure and behavior of an enterprise organization, e.g., whichbusiness process or part of it impacts which part of the data warehouse.

RELATED WORK

There are a lot of conceptual models available for business processes, data bases or data warehouses. But there are no models available that focus on the relationship between the data warehouse and the business processes. EPCs (Scheer 1999) incorporate a data view targeting operational data bases. EPC functions perform read or write operations on the databases and their entities. But they do not take the specific characteristics of data warehouses into account.The Business Process Modeling Notation (BPMN 2004) provides data objects, which are used and updated during the process. The data object can be used to represent many different types of object, both electronic or physical. An integrated view on Data Warehousing and business processes was introduced by Stefanov, List and Korherr (2005) in terms of a model that allows to show where and how a DWH is used by business processes, and which parts of the business processes depend on which parts of the DWH. Mazon, Trujillo, Serrano and Piattini (2005) applied the MDA framework to data warehouse repository development, and aligned multidimensional conceptual data models with code. Speaking in MDA terms, they aligned a Platform Independent Model (PIM) with a Platform Specific Model (PSM) and defined a transformation between them. Our approach can be seen on top of this work targeting the Computational Independent Level (CIM) level, as we align enterprise context with the datawarehouse conceptual data model.

Data Warehouse Design Process

The overarching guideline in the design of extracts, data entities and metadata is: "Know thy data." Methodologies, models, tools, and other modern conveniences help improve quality and delivery speed but are no substitute for knowledge of and experience with the data. Exhibit 4 is a high-level model of the design process, showing the most important steps in the three main aspects of the data warehouse design process. A more detailed representation might specify inputs, outputs, check-points and many repetitions of each step within each aspect of the process. It might also show the interactions between one aspect and the others. Exhibit 4 Warehouse Extract *----- Design ---------* *----- Design ------------* | understand queries | | understand application | | E-R diagramming | | choose data source | | set data types | | design normalization | | name tables, views, | | design edits | | and elements | | synchronize with Oracle | | design tables/views | ---- | code and document | | plan/design security | | test and verify | *----------------------* | plan production | : | *-------------------------* : | | : : | Metadata | : : *----- Design --------------* : : | understand business | : : | map user's language | : : | map application language | : : | define context, tables, | : : | elements, and values | : : | write "solutions" queries | : : | design delivery to user | : : *---------------------------* : : : : : : : : : : : : : *-------- Hardware ----------------------------------------* | PCs, mainframe, LAN, servers (warehouse, gopher, e-mail) | *----------------------------------------------------------*

To meet the integrity goals of the data warehouse concept, the right connections must be made between the three aspects of the design process. For example, the design of data edits in the extract process might suggest that initial column designs in the warehouse should be revised because a mainframe data element is too complex (codes two distinct pieces of information); as a result, the original and the new column in the warehouse must be named appropriately and the element definitions in the metadata must be re-written. All three aspects of the design process rest on an inter-related set of computing platforms. The software and the skill-sets necessary to work on these different machines are not shown in the model but are fundamental to the success of the process. Each of the three design processes is based on a slightly different understanding of the word "data." Throughout the rest of these guidelines the discussion tries to separate and differentiate each of the three aspects as much as possible. Each aspect of the design is itself a fairly complex process with many separate steps. The steps in Exhibit 4 are shown in a logical order that is progressively more detailed and more closely bound to the specific technology being used. To some extent this is also a chronological order. The sequence of steps within each aspect has been used to sequence the topics in these data design guidelines. Moving data from mainframe to the warehouse Exhibit 5 represents the data warehouse from a production perspective, showing the flow of data from a live mainframe application system to the data warehouse and then on to the data user's desktop. Exhibit 5 Moving data from the mainframe to the desktop via the warehouse *------------------------------------------------------------------* | load / | Action | archive normalize edit format verify query | | ---->--------->--------->--------->--------->--------->------>| |"live" archived "normal" "edited" .CSV Oracle user's| Status |database file data data file tables tables| |------------------------------------------------------------------| Platform | ----------- mainframe -----------------======--- server -- client| *------------------------------------------------------------------* The top row of Exhibit 5 represents the successive steps in the process of moving the data, the middle row represents the different stages between steps, and the bottom row shows the platform on which each step occurs. The processes and stages are connected as follows.

The "live" data (e.g., on an IDMS database) is continually updated as University business is conducted (through interaction with a screen or through voice-response, for example). Regularly scheduled (daily, weekly, once a semester, etc.) jobs extract some or all of the data from the live database and write archive files (which are usually stored on magnetic tape in standard formats). ETL process

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. ETL involves the following tasks:

- extracting the data from source systems (SAP, ERP, other oprational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing.

- transforming the data may involve the following tasks: applying business rules (so-called derivations, e.g., calculating new measures and dimensions),

cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),

filtering (e.g., selecting only certain columns to load), splitting a column into multiple columns and vice versa, joining together data from multiple sources (e.g., lookup, merge), transposing rows and columns,

applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)

- loading the data into a data warehouse or data repository other reporting applications.

OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

The following table summarizes the major differences between OLTP and OLAP system design.

OLTP System Online Transaction Processing (Operational System)OLAP System Online Analytical Processing (Data Warehouse)

Source of dataOperational data; OLTPs are the original source of the data.Consolidation data; OLAP data comes from the various OLTP Databases

Purpose of dataTo control and run fundamental business tasksTo help with planning, problem solving, and decision support

What the dataReveals a snapshot of ongoing business processesMulti-dimensional views of various kinds of business activities

Inserts and UpdatesShort and fast inserts and updates initiated by end usersPeriodic long-running batch jobs refresh the data

QueriesRelatively standardized and simple queries Returning relatively few recordsOften complex queries involving aggregations

Processing SpeedTypically very fastDepends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes

Space RequirementsCan be relatively small if historical data is archivedLarger due to the existence of aggregation structures and history data; requires more indexes than OLTP

Database DesignHighly normalized with many tablesTypically de-normalized with fewer tables; use of star and/or snowflake schemas

Backup and RecoveryBackup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liabilityInstead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method

Data Mining: What is Data Mining?Overview Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Continuous Innovation Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost. Example For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays. Data, Information, and Knowledge DataData are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes: operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions InformationThe patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when. KnowledgeInformation can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. Data Warehouses Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining. What can data mining do? Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments. For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can suggest products to its cardholders based on analysis of their monthly expenditures. WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries. The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs considerably from the average shooting percentage of 49.30% for the Cavaliers during that game. By using the NBA universal clock, a coach can automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot. How does data mining work? While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements: Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table. Different levels of analysis are available: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

What technological infrastructure is required?Today, data mining applications are available on all size systems for mainframe, client/server, and PC platforms. System prices range from several thousand dollars for the smallest applications up to $1 million a terabyte for the largest. Enterprise-wide applications generally range in size from 10 gigabytes to over 11 terabytes. NCR has the capacity to deliver applications exceeding 100 terabytes. There are two critical technological drivers: Size of the database: the more data being processed and maintained, the more powerful the system required. Query complexity: the more complex the queries and the greater the number of queries being processed, the more powerful the system required. Relational database storage and management technology is adequate for many data mining applications less than 50 gigabytes. However, this infrastructure needs to be significantly enhanced to support larger applications. Some vendors have added extensive indexing capabilities to improve query performance. Others use new hardware architectures such as Massively Parallel Processors (MPP) to achieve order-of-magnitude improvements in query time. For example, MPP systems from NCR link hundreds of high-speed Pentium processors to achieve performance levels exceeding those of the largest supercomputers. Examples of data mining applications

Data Mining TechniquesThere are four main operations associated with data mining techniques which include: Predictive modeling Database segmentation Link analysis Deviation detection.Techniques are specific implementations of the data mining operations. However, each operation has its own strengths and weaknesses. With this in mind, data mining tools sometimes offer a choice of operations to implement a technique.Predictive ModelingIt is designed on a similar pattern of the human learning experience in using observations to form a model of the important characteristics of some task. It corresponds to the 'real world'. It 'is developed using a supervised learning approach, which has to phases: training and testing. Training phase is based on a large sample of historical data called a training set, while testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics.It is commonly used in customer retention management, credit approval, cross-selling, and direct marketing. There are two techniques associated with predictive modeling. These are: Classification Value prediction

ClassificationClassification is used to classify the records to form a finite set of possible class values. There are two specializations of classification: tree induction and neural induction. An example of classification using tree induction is shown in Figure.

In this example, we are interested in predicting whether a customer who is currently renting property is likely to be interested in buying property. A predictive model has determined that only two variables are of interest: the length of the customer has rented property and the age of the customer. The model predicts that those customers who have rented for more than two years and are over 25 years old are the most likely to .be interested in buying property. An example of classification using neural induction is shown in Figure.

A neural network contain collections of connected nodes with input, output, and processing at each node. Between the visible input and output layers may be a number of hidden processing layers. Each processing unit (circle) in one layer is connected to each processing unit in the next layer by a weighted value, expressing the strength of the relationship. This approach is an attempt to copy the way the human brain works in recognizing patterns by arithmetically combining all the variables associated with a given data point.Value predictionIt uses the traditional statistical techniques of linear regression and nonlinear regression. These techniques are easy to use and understand. Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot. The problem with linear regression is that the technique only works well with linear data and is sensitive to those data values which do not conform to the expected norm. Although nonlinear regression avoids the main problems of linear regression, it is still not flexible enough to handle all possible shapes of the data plot. This is where the traditional statistical analysis methods and data mining methods begin to diverge. Applications of value prediction include credit card fraud detection and target mailing list identification.Database SegmentationSegmentation is a group of similar records that share a number of properties. The aim of database segmentation is to partition a database into an unknown number of segments, or clusters.This approach uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles. Applications of database segmentation include customer profiling, direct marketing, and cross-selling.

As shown in figure, using database segmentation, we identify the cluster that corresponds to legal tender and forgeries. Note that there are two clusters of forgeries, which is attributed to at least two gangs of forgers working on falsifying the banknotes.Link AnalysisLink analysis aims to establish links, called associations, between the individual record sets of records, in a database. There are three specializations of link analysis. These are: Associations discovery Sequential pattern discovery Similar time sequence discovery.Associations discovery finds items that imply the presence of other items in the same event. There are association rules which are used to define association. For example, 'when a customer rents property for more than two years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties'.Sequential pattern discovery finds patterns between events such that the presence of one set of item is followed by another set of items in a database of events over a period of the. For example, this approach can be used to understand long-term customer buying behavior.Time sequence discovery is used in the discovery of links between two sets of data that are time-dependent. For example, within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines.Applications of link analysis include product affinity analysis, direct marketing, and stock price movement.Deviation DetectionDeviation detection is a relatively new technique in terms of commercially available data mining tools. However, deviation detection is often a source of true discovery because it identifies outliers, which express deviation from some previously known expectation "and norm. This operation can be performed using statistics and visualization techniques.Applications of deviation detection include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing.

Data Mining and Data WarehousingData mining requires a single, separate, clean, integrated, and self-consistent source of data. A data warehouse is well equipped for providing data for mining for the following reasons: Data mining requires data quality and consistency of input data and data warehouse provides it. It is advantageous to mine data from multiple sources to discover as many interrelationships as possible. Data warehouse contain data from a number of sources. Query capabilities of the data warehouse helps in selecting the relevant information.Due to integration of data mining and data warehouse many vendors are investigating number of techniques to support it.

Data Mining Concepts (Analysis Services - Data Mining)Data mining is the process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data.These patterns and trends can be collected and defined as a data mining model. Mining models can be applied to specific business scenarios, such as: Forecasting sales Targeting mailings toward specific customers Determining which products are likely to be sold together Finding sequences in the order that customers add products to a shopping cartBuilding a mining model is part of a larger process that includes everything from asking questions about the data and creating a model to answer those questions, to deploying the model into a working environment. This process can be defined by using the following six basic steps:1. Defining the Problem2. Preparing Data3. Exploring Data4. Building Models5. Exploring and Validating Models6. Deploying and Updating ModelsThe following diagram describes the relationships between each step in the process, and the technologies in Microsoft SQL Server that you can use to complete each step.

Although the process illustrated in the diagram is circular, each step does not necessarily lead directly to the next step. Creating a data mining model is a dynamic and iterative process. After you explore the data, you may find that the data is insufficient to create the appropriate mining models, and that you therefore have to look for more data. Alternatively, you may build several models and then realize that the models do not adequately answer the problem you defined, and that you therefore must redefine the problem. You may have to update the models after they have been deployed because more data has become available. Each step in the process might need to be repeated many times in order to create a good model.SQL Server 2008 provides an integrated environment for creating and working with data mining models, called Business Intelligence Development Studio. This environment includes data mining algorithms and tools that make it easy to build a comprehensive solution for a variety of projects. For more information about using BI Development Studio, see Developing and Implementing Using Business Intelligence Development Studio.After you have created a data mining solution, you can maintain and browse it by using SQL Server Management Studio. For more information, see Managing Data Mining Structures and Models.For an example of how the SQL Server tools can be applied to a business scenario, see the Basic Data Mining Tutorial.

Defining the Problem The first step in the data mining process, as highlighted in the following diagram, is to clearly define the business problem, and consider ways to provide an answer to the problem.

This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by which the model will be evaluated, and defining specific objectives for the data mining project. These tasks translate into questions such as the following: What are you looking for? What types of relationships are you trying to find? Does the problem you are trying to solve reflect the policies or processes of the business? Do you want to make predictions from the data mining model, or just look for interesting patterns and associations? Which attribute of the dataset do you want to try to predict? How are the columns related? If there are multiple tables, how are the tables related? How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the business?To answer these questions, you might have to conduct a data availability study, to investigate the needs of the business users with regard to the available data. If the data does not support the needs of the users, you might have to redefine the project.You also need to consider the ways in which the results of the model can be incorporated in key performance indicators (KPI) that are used to measure business progress.Preparing Data The second step in the data mining process, as highlighted in the following diagram, is to consolidate and clean the data that was identified in the Defining the Problem step.

Data can be scattered across a company and stored in different formats, or may contain inconsistencies such as incorrect or missing entries. For example, the data might show that a customer bought a product before the product was offered on the market, or that the customer shops regularly at a store located 2,000 miles from her home. Data cleaning is not just about removing bad data, but about finding hidden correlations in the data, identifying sources of data that are the most accurate, and determining which columns are the most appropriate for use in analysis. For example, should you use the shipping date or the order date? Is the best sales influencer the quantity, total price, or a discounted price? Incomplete data, wrong data, and inputs that appear separate, but are in fact strongly correlated, can influence the results of the model in ways you do not expect. Therefore, before you start to build mining models, you should identify these problems and determine how you will fix them. Typically, you are working with a very large dataset and cannot look through every transaction. Therefore, you have to use some form of automation, such as in Integration Services, to explore the data and find the inconsistencies. Microsoft Integration Services contains all the tools that you need to complete this step, including transforms to automate data cleaning and consolidation. For more information, see Integration Services in Business Intelligence Development Studio.It is important to note that the data you use for data mining does not need to be stored in an Online Analytical Processing (OLAP) cube, or even in a relational database, although you can use both of these as data sources. You can conduct data mining using any source of data that has been defined as an Analysis Services data source. These can include text files, Excel workbooks, or data from other external providers. For more information, see Defining Data Sources (Analysis Services).Exploring Data The third step in the data mining process, as highlighted in the following diagram, is to explore the prepared data.

You must understand the data in order to make appropriate decisions when you create the mining models. Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data. For example, you might determine by reviewing the maximum, minimum, and mean values that the data is not representative of your customers or business processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis for your expectations. Standard deviations and other distribution values can provide useful information about the stability and accuracy of the results. A large standard deviation can indicate that adding more data might help you improve the model. Data that strongly deviates from a standard distribution might be skewed, or might represent an accurate picture of a real-life problem, but make it difficult to fit a model to the data.By exploring the data in light of your own understanding of the business problem, you can decide if the dataset contains flawed data, and then you can devise a strategy for fixing the problems or gain a deeper understanding of the behaviors that are typical of your business. Data Source View Designer in BI Development Studio contains several tools that you can use to explore data. For more information, see Designing Data Source Views (Analysis Services) or Exploring Data in a Data Source View (Analysis Services).Also, when you create a model, Analysis Services automatically creates statistical summaries of the data contained in the model, which you can query to use in reports or further analysis. For more information, see Querying Data Mining Models (Analysis Services - Data Mining).Building Models The fourth step in the data mining process, as highlighted in the following diagram, is to build the mining model or models. You will use the knowledge that you gained in the Exploring Data step to help define and create the models.

You define which data you want to use by creating a mining structure. The mining structure defines the source of data, but does not contain any data until you process it. When you process the mining structure, Analysis Services generates aggregates and other statistical information that can be used for analysis. This information can be used by any mining model that is based on the structure. For more information about how mining structures are related to mining models, see Logical Architecture (Analysis Services - Data Mining).Before the model is processed, a data mining model is just a container that specifies the columns used for input, the attribute that you are predicting, and parameters that tell the algorithm how to process the data. Processing a model is also called training. Training refers to the process of applying a specific mathematical algorithm to the data in the structure in order to extract patterns. The patterns that you find in the training process depend on the selection of training data, the algorithm you chose, and how you have configured the algorithm. SQL Server 2008 contains many different algorithms, each suited to a different type of task, and each creating a different type of model. For a list of the algorithms provided in SQL Server 2008, see Data Mining Algorithms (Analysis Services - Data Mining).You can also use parameters to adjust each algorithm, and you can apply filters to the training data to use just a subset of the data, creating different results. After you pass data through the model, the mining model object contains summaries and patterns that can be queried or used for prediction. You can define a new model by using the Data Mining Wizard in BI Development Studio, or by using the Data Mining Extensions (DMX) language. For more information about how to use the Data Mining Wizard, see Data Mining Wizard (Analysis Services - Data Mining). For more information about how to use DMX, see Data Mining Extensions (DMX) Reference. It is important to remember that whenever the data changes, you must update both the mining structure and the mining model. When you update a mining structure by reprocessing it, Analysis Services retrieves data from the source, including any new data if the source is dynamically updated, and repopulates the mining structure. If you have models that are based on the structure, you can choose to update the models that are based on the structure, which means they are retrained on the new data, or you can leave the models as is. For more information, see Processing Data Mining Objects.Exploring and Validating Models The fifth step in the data mining process, as highlighted in the following diagram, is to explore the mining models that you have built and test their effectiveness.

Before you deploy a model into a production environment, you will want to test how well the model performs. Also, when you build a model, you typically create multiple models with different configurations and test all models to see which yields the best results for your problem and your data. Analysis Services provides tools that help you separate your data into training and testing datasets so that you can accurately assess the performance of all models on the same data. You use the training dataset to build the model, and the testing dataset to test the accuracy of the model by creating prediction queries. In SQL Server 2008 Analysis Services, this partitioning can be done automatically while building the mining model. For more information, see Validating Data Mining Models (Analysis Services - Data Mining).You can explore the trends and patterns that the algorithms discover by using the viewers in Data Mining Designer in BI Development Studio. For more information, see Viewing a Data Mining Model. You can also test how well the models create predictions by using tools in the designer such as the lift chart and classification matrix. To verify whether the model is specific to your data, or may be used to make inferences on the general population, you can use the statistical technique called cross-validation to automatically create subsets of the data and test the model against each subset. For more information, see Validating Data Mining Models (Analysis Services - Data Mining).If none of the models that you created in the Building Models step perform well, you might have to return to a previous step in the process and redefine the problem or reinvestigate the data in the original dataset.Deploying and Updating Models The last step in the data mining process, as highlighted in the following diagram, is to deploy the models that performed the best to a production environment.

After the mining models exist in a production environment, you can perform many tasks, depending on your needs. The following are some of the tasks you can perform: Use the models to create predictions, which you can then use to make business decisions. SQL Server provides the DMX language that you can use to create prediction queries, and Prediction Query Builder to help you build the queries. For more information, see Data Mining Extensions (DMX) Reference. Create content queries to retrieve statistics, rules, or formulas from the model. For more information, see Querying Data Mining Models (Analysis Services - Data Mining). Embed data mining functionality directly into an application. You can include Analysis Management Objects (AMO), which contains a set of objects that your application can use to create, alter, process, and delete mining structures and mining models. Alternatively, you can send XML for Analysis (XMLA) messages directly to an instance of Analysis Services. For more information, see Development (Analysis Services - Data Mining). Use Integration Services to create a package in which a mining model is used to intelligently separate incoming data into multiple tables. For example, if a database is continually updated with potential customers, you could use a mining model together with Integration Services to split the incoming data into customers who are likely to purchase a product and customers who are likely to not purchase a product. For more information, see Typical Uses of Integration Services. Create a report that lets users directly query against an existing mining model. For more information, see Reporting Services in Business Intelligence Development Studio (SSRS). Update the models after review and analysis. Any update requires that you reprocess the models. For more information, see Processing Structures and Models (Analysis Services - Data Mining). Update the models dynamically, as more data comes into the organization, and making constant changes to improve the effectiveness of the solution should be part of the deployment strategy. For more information, see Managing Data Mining Structures and Models

1