TDWI Checklist Report: Digital Transformation Using ...

Sponsored by:

By Fern Halper

CHECKLIST REPORT 2018

Digital Transformation Using Machine Learning

tdwi.org1 TDWI RESEARCH

2 FOREWORD

4 NUMBER ONE ENSURE TRUSTED DATA

6 NUMBER TWO SUPPORT FLEXIBILITY IN TOOLS AND METHODS

8 NUMBER THREE DEPLOY MODELS FOR ACTION

10 ABOUT OUR SPONSOR

10 ABOUT THE AUTHOR

10 ABOUT TDWI RESEARCH

10 ABOUT TDWI CHECKLIST REPORTS

© 2018 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. Email requests or feedback to [email protected].

Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies. Inclusion of a vendor, product, or service in TDWI research does not constitute an endorsement by TDWI or its management. Sponsorship of a publication should not be construed as an endorsement of the sponsor organization or validation of its claims.

DECEMBER 2018

Digital Transformation Using Machine LearningBy Fern Halper

TABLE OF CONTENTS

555 S. Renton Village Place, Ste. 700 Renton, WA 98057-3295

T 425.277.9126 F 425.687.2842 E [email protected]

tdwi.org

TDWI CHECKLIST REPORT


TDWI CHECKLIST REPORT: DIGITAL TRANSFORMATION USING MACHINE LEARNING

deriving other descriptive insights from data. For instance, some tools automatically build predictive models from data and outcomes of interest. New tools provide natural language interfaces to ask questions and may be powered by an enterprise semantic graph. Other tools highlight insights in data automatically, opening up BI and analytics tools to more users. Easier-to-use tools reduce time to insight and have the potential to transform the way people across the organization glean insights from data.

• ML IS TRANSFORMING ANALYTICS CONSUMPTION. Models built using machine learning techniques are also transformative because they are often embedded into business systems and processes, streamlining processes and improving operations. Popular examples include models that make recom-mendations to customers on the fly based on where they are, what they are doing, and who they are, which are surfaced through zero-click applications, like pop-ups, and chat and voice assistants. Other examples include predictive maintenance to alert operations teams when repairs are needed. Machine learning is used in fraud and cybersecurity operations, and some utility companies are looking at self-healing grids. Many of these systems learn as they gather more data. At TDWI, we see that the top two areas organizations are transforming with AI/ML technologies are operations and customer engagement, but transformation reaches across organizations in IT, sales, and even HR. This transformation affects how analytics is consumed, becoming much more proactive and in context—two important attributes to drive value.

WE ARE NOW IN A NEW WAVE OF DIGITAL TRANSFORMATION AND THIS ONE CENTERS AROUND DATA AND AI Digital transformation changes how businesses operate, often by embracing new technologies and competencies. Whereas past waves were about the Internet (e.g., the Internet/e-commerce boom of the 1990s/2000s) or mobile (e.g., the mobile tsunami of the 2010s), this wave is about transforming organizations, processes, and products with data and analytics. Now, the vision is that organizations have insight at their fingertips, with data and analytics infusing applications, processes, and customer touch points. In this wave, organizations become more intelligent by utilizing and deploying analytics; this powers personalized recommendations, predictions, and services and delivers the best experiences for customers.

AI/machine learning is at the heart of this latest wave (see Figure 1). The term AI is commonly used by vendors for solutions that include machine learning (ML) or natural language processing (NLP). TDWI defines AI as the theory and practice of building computer systems able to perform tasks that normally require human intelligence, such as visual perception or decision making. Machine learning is a subset of AI where systems learn from examples with minimal human intervention.

Machine-learning technology is driving digital transformation in three important ways:

• ML IS TRANSFORMING HOW INSIGHTS ARE BUILT. Machine learning is becoming more infused in intelligent tooling across the analytics life cycle, from data collection and preparation to model building. In other words, ML is being used to help automate parts of the analytics process, whether automatically building a predictive model or automatically

FOREWORD



FOREWORD CONTINUED

FIGURE 1. Three critical areas that are part of an enterprise’s digital transformation.

TRANSFORMING INSIGHTS

• Intelligent data processing• Automated model building• Natural language interfaces

TRANSFORMING BUSINESS PROCESSES

DIGITAL TRANSFORMATIONAIML

• Embedded ML models• Proactive alerting• Proactive intervention

TRANSFORMING PRODUCTS

• Talking machines• Virtual assistants• Situational awareness

This TDWI Checklist Report discusses three broad areas of best practices for helping to make machine learning, and those involved with it, successful.

• ML IS TRANSFORMING PRODUCTS. Machine learning is also being utilized in products; popular examples include Alexa, Google Home, chatbots, and Siri. Businesses are also embedding AI/ML in the products they sell and the services they provide. These products are more often intelligent. For instance, automobile manufacturers are embedding ML algorithms into cars to learn about driver preferences. New (and old) companies are providing products that assess patient health, alert farmers to when their crops need attention, or adjust environmental conditions on factory floors. Across industries, machine learning is transforming products as we know them and driving revenue for organizations deploying intelligent apps.

OVER HALF OF SURVEY RESPONDENTS SAY DATA QUALITY IS A PROBLEM FOR ADVANCED ANALYTICS SUCH AS MACHINE LEARNING.



1flows as well as using machine learning to detect problems with data. Some tools go beyond detection to automatically remediate data quality issues based on ML models or encoded business rules. Common examples include using ML to detect and correct missing data, duplicate data, and data outliers.

As with any technology, automating data cleansing using machine learning has pros and cons. As the number of data sources (and amount of data in each source) increases, it will be harder to manually explore and cleanse all data sources needed for analytics. The good news is that automated processes can be used repeatedly. On the flip side, machines make mistakes, so users should not assume that the algorithms are 100% accurate. Typically, an enterprise needs to ensure a balance between manual and automated processes, which often depends on the use case.

TRUSTED AND GOVERNED DATA IS KEY FOR MACHINE LEARNINGAs mentioned, trusted data is key to analytics success. Trusted data is data that is accurate, reliable, timely, and reasonable. Enabling trust in data involves putting policies and processes in place to make sure the data can be relied upon, both for model building as well as models put into production. Although some data scientists like to explore raw data, it is still important that it be trusted in order to build trusted models. Even so, about a third of organizations in TDWI surveys don’t govern their data at all!4

THE NEED FOR CLEAN DATA Noisy and dirty data is a common problem in analytics, and machine learning is no exception. For instance, in a recent TDWI study, more than half (52%) of respondents cited data quality as a problem for advanced analytics such as machine learning.1 In another study2, 42% of respondents were not satisfied with how their data management platforms handle data quality. TDWI routinely sees poor data quality at the top of the list of challenges for advanced analytics. Contrary to what some believe, data scientists do like to work with clean data.

Clean data is necessary to build a reliable model. The old adage “garbage in, garbage out” applies here. If the data used to train a model is not complete, consistent, correct, and timely, then the model itself may not accurately predict an outcome of interest when new data flows through it. For example, a recommendation engine that is built using dirty data may make sub-optimal recommendations to a customer. Although some data scientists might clean up the data and then iterate and tune the model again, this can lead to an even less accurate model than the one built using no data cleansing at all.3 A bad machine learning model put into production can undermine a company’s digital transformation because the organization won’t trust the results or future models.

Some intelligent platforms on the market automate parts of the data cleansing process. This includes automating specific data cleansing rule-based

ENSURE TRUSTED DATA

1 TDWI Best Practices Report: Practical Predictive Analytics, online at tdwi.org/bpreports.

2 TDWI Best Practices Report: What it Takes to Be Data-Driven: Technologies and Best Practices for Becoming a Smarter Organization, online at tdwi.org/bpreports.

3 For instance see Krishnan, S., Franklin, M., Goldberg, K., Wang, J., and Wu, E., Active Clean: An Interactive Data Cleaning Framework for Modern Machine Learning. SIGMOD’16 June 26-July 01 2016, San Francisco, CA. https://www.ocf.berkeley.edu/~sanjayk/wp-content/uploads/2016/03/demo-final.pdf

4 TDWI Best Practices Report: What it Takes to Be Data-Driven: Technologies and Best Practices for Becoming a Smarter Organization, online at tdwi.org/bpreports.

http://tdwi.org/bpreports


https://www.ocf.berkeley.edu/~sanjayk/wp-content/uploads/2016/03/demo-final.pdf




• CERTIFIED DATA. As data continues to grow in size and scope, organizations have more data to use to build models. Some modern data catalogs enable data stewards (or others involved in data quality) to “certify” a data set as trusted, typically with an icon in the catalog, by the name of the data set. This way, users can see what data sets are trusted for use by those governing the data.

Of course, governance processes (including data profiling and quality) need to be in place to deal with any future data that will flow through a machine learning model. This might include federated analytics from external sources and third parties and might even include cloud providers. Yet, in a recent TDWI study, only 12% of respondents governed cloud-based data.6 Clearly, there is room for improvement here.

Key principles to enable trust include:

• METADATA. Metadata, which can be part of the enterprise semantic graph, provides information about the data—who created it, when it was created, how it is structured, where it originated, and how it has been used. This information can help different users better understand the data available to them and to access that data faster. Often times this metadata is stored in a data catalog. In a recent TDWI survey, 48% of respondents stated that they either currently use or plan to use a data catalog or glossary development technologies to enable them to view or access diverse data sources for BI and analytics. Modern catalogs use ML to help parse and deduce credible data from a data set. They also offer crowdsourcing features where users can rate and certify data. This helps to build trust for data that will be used for machine learning.

• DATA LINEAGE. Data can move through many systems before it is processed or analyzed. Data lineage is metadata that describes where data originated and how it has been transformed, consumed, and shared. This is obviously important for building trust—just think how crucial it is to know if data has been altered in any way before using it as input to a machine learning model. Data lineage is also a feature of some modern catalogs. Some data catalog providers offer visual interfaces to view data lineage; these can be annotated. Some data catalogs can automatically discover and suggest missing lineage between data sets.

1 ENSURE TRUSTED DATA CONTINUED

5 TDWI Best Practices Report: Accelerating the Path to Value with Business Intelligence and Analytics, online at tdwi.org/bpreports.

6 Internal 2018 TDWI study of cloud data warehouses

HALF OF SURVEY RESPONDENTS SAY THEY EITHER USE OR PLAN TO USE A DATA CATALOG OR GLOSSARY DEVELOPMENT TECHNOLOGIES TO VIEW OR ACCESS DIVERSE DATA SOURCES.




SUPPORT FLEXIBILITY IN TOOLS AND METHODS2In many cases, vendors are opening up their platforms to these tools via products with an open architecture. For example, most analytics vendors now support R in data science workbenches or as part of a full analytics life cycle product. The same is true for Python. Typically, if the vendor has a drag-and-drop visual interface, it will often let a user connect to a model developed in R as long as R is already installed on the machine.

PROVIDE THE RIGHT TOOLS FOR THE JOBData scientists often want to use open source tools; however, different open source options are better for some jobs than for others. At TDWI, we see many organizations building models in open source tooling and then deploying them into production as part of a commercial platform environment. Of course, many organizations still make use of commercial tools for machine learning.

Vendors have made an effort to make these tools easy to use, with intuitive graphical interfaces for a range of personas including data scientists, business analysts, data engineers, and others involved in machine learning. These tools often address the full life cycle of analytics, from data collection and preparation to model deployment and monitoring, and are often more robust than open source tools. The choice of tooling will depend on the organization, its use cases, and its skill sets.

• R OR PYTHON. Developed in the 1990s, R is a widely used open source statistical environment. It includes data handling and storage facilities, a large set of tools for data analysis (including ML and NLP), tools for graphical analysis, and a programming environment. Python was also created in the 1990s, but is an interpreted, interactive, object-

SKILLS, SATISFACTION, AND SCIENTISTS Data scientists come in different shapes and sizes. They are developers, statisticians, and mathematicians. Some come from other quantitative disciplines. Still others have learned skills on the job or have been recently trained in data science at a university.

Two things are clear when it comes to data scientists. First, these skills are hard to come by and hard to keep. Second, they are needed for digital transformation to succeed. That means that these people need to be happy at their jobs, which involves providing challenging work as well as enabling data scientists to use the tools with which they are comfortable.

PROVIDE THE TOOLS DATA SCIENTISTS WANTTo attract and retain talent, organizations are making sure data scientists have the tools they need. This includes open source tools such as R and Python that data scientists are comfortable using because, more often than not, this is what they were trained to use. Universities are frequently teaching the same open source languages. That means there is an increasing number of people who know how to program in open source analytics tools. The philosophy here is that data scientists should, “bring the tools that they use on the platforms they trust.” This includes using notebook environments such as Juptyer notebooks (open-source Web applications that allow users to create and share documents that contain live code).

At TDWI, we see that R and Python are among the top technologies that organizations use or plan to use to meet their analytics goals.



SUPPORT FLEXIBILITY IN TOOLS AND METHODS CONTINUED2oriented scripting language that contains ML libraries. Developers appreciate its flexibility and simplicity and use it often in developing Web applications. Python tends to be more portable and scalable than R, and many data scientists recommend it over R for production apps. Although there may be a tougher coding barrier for Python than for R, many data scientists know both and choose the right tool for the job.

• OTHER ANALYTICS TOOLS AND PACKAGES: R and Python are not the only open source tools available for analytics—Spark also contains machine learning libraries. Organizations interested in deep learning often make use of TensorFlow and Caffe. Apache OpenNLP is a popular machine learning-based toolkit for processing natural language text. For those programming in Java, Weka (Waikato Environment for Knowledge Analysis) is a suite of machine learning software written in Java.

• APIS. As stated, vendors understand that organizations may be using a wide range of tools to develop machine learning models. To support this, they are providing APIs to many of these environments to create an open architecture and enable using these models with their commercial tools.

MANY ORGANIZATIONS BUILD MODELS IN OPEN SOURCE TOOLS, THEN DEPLOY THEM AS PART OF A COMMERCIAL PLATFORM ENVIRONMENT.



Of course, it is also important that someone with knowledge of the business can interpret the model for reasonableness if the data scientist doesn’t have this knowledge. This might be a business person involved in operations, for instance.

• MODEL REWRITE. Many organizations rewrite ML models for use on a particular target platform because that code can’t be integrated into environments with closed architectures. Sometimes a model is rewritten for political reasons because a group feels that “my way is the best way.” Rewriting, of course, can delay and introduce errors into the process. Occasionally, there is no choice but to rewrite a model if the system demands it. However, models can be deployed in several ways, such as in-database or using APIs. When selecting tools, consider your systems and what they can support.

• NO DESIGNATED GROUP FOR DEPLOYMENT. Oftentimes models take so long to be put into production because no individual or group has been identified or assigned to the task. Analytically sophisticated organizations typically designate a group called devops (sometimes called DataOps or a similar name) for this purpose. The key is that someone (or a group of people) is assigned responsibility. This person or group may also be responsible for monitoring the model when it is in production (see below).

PUTTING MODELS TO WORK As illustrated in Figure 1, an important component of the digital transformation is putting ML models into production (that is, operationalizing them). What good is an ML model if it isn’t used to make a decision or take action? Organizations can put models into production in many ways. They can embed them internally in their own business processes and systems. They can also embed the models into external apps, devices, and other products. In this way, the organization can actually change how it operates.

ANTICIPATE CHALLENGESDeploying models can be challenging. For the majority of respondents we survey at TDWI, it typically takes at least a few months to put models into production. For 40%, it can take more than six months. Some models are never deployed, wasting money and assets. Additionally, when we survey those using machine learning, at least 25% of respondents are dissatisfied with the amount of effort needed to put models into production in their organization. Less than half are satisfied; the rest are undecided.

Challenges to think about include:

• MODEL VALIDATION. Any model that will be used for decision making or put into production needs to be validated to ensure it is accurate and reasonable. Organizations building ML models typically have a process in place to do so. For instance, some organizations require data scientist sign-off on a model built by a business analyst before putting it into production. Some organizations have checklists for this purpose while others run surrogate models.

DEPLOY MODELS FOR ACTION3

A QUARTER OF ENTERPRISES USING MACHINE LEARNING ARE DISSATISFIED WITH THE AMOUNT OF EFFORT IT TAKES TO PUT MODELS INTO PRODUCTION.



• FEATURE ENGINEERING. A hallmark of ML model development is identifying and building features that are used in the model. These features may be raw attributes or they may be engineered (such as a ratio or an engagement metric). Once the ML model goes into production, fresh data that is scored must be transformed before it flows through the model. Those same features may need to be computed on the fly with new data. Make sure your tooling and processing can handle this.

GETTING STARTEDMost of the organizations we see at TDWI are either building machine-learning models or plan to do so in the near future. This is an important part of their digital transformation.

To get started, it makes sense to identify a real business problem with clear objectives and measurable impact. Make sure you use data that is easily accessible and can be trusted. A bad model can stop future ML efforts dead in their tracks, but success will build on itself.

Trust will be key—trust in the data, trust in the models, and trust by the organization. To ensure trust, models will also need to be registered, monitored, and managed.

Initially, many organizations hire a few data scientists who work with the business stakeholders and others to make sure that the models being built provide business value.

USE BEST PRACTICES FOR MODELS IN PRODUCTION• REGISTER/VERSION THE MODEL. As

organizations build more models, it will be more important to register them to keep track of and version the models. It is one thing to keep track of a model if your organization is managing a few models, but think about what happens once your organization starts to manage dozens of them. Available tools allow organizations to register a model and collect metadata about it, including its age, the developer’s name, what data was used to develop the model, and how it is being used. A registry can help keep track of the models, ensuring that users and processes are deploying the most up-to-date model.

• MANAGE MODEL WORKFLOWS. It is important to manage the workflows associated with ML models. As we’ve discussed, many vendors have drag-and-drop interfaces that let users build out a workflow, from data sourcing and preparation through model building and export. Alternatively, they may allow packaging of a notebook environment. In either case, the workflows should be shareable, reusable, and able to be scheduled. This can be helpful to ensure consistency and efficiency.

• MONITOR THE MODEL. Models get stale. Enterprises must monitor a model once it is in production to see if it needs to be retrained. Some models automatically retrain themselves with fresh data, but that is not the norm. Some tools have features allowing a model builder to specify rules to alert the organization that the model is degrading and needs retraining. Other tools go further and automatically detect model degradation based on lift or some other parameter.

DEPLOY MODELS FOR ACTION CONTINUED3

ENTERPRISES MUST MONITOR MODELS IN PRODUCTION TO DETERMINE WHICH MUST BE RETRAINED.



ABOUT OUR SPONSOR

MicroStrategy (Nasdaq: MSTR) is a worldwide leader in enterprise analytics and mobility software. A pioneer in the BI and analytics space, MicroStrategy delivers innovative software that empowers people to make better decisions and transform the way they do business. MicroStrategy provides enterprise customers with world-class software and expert services so they can deploy unique intelligence applications.

With MicroStrategy, data scientists can utilize an enterprise platform with an open architecture to create machine learning algorithms in the coding language of their choice and deploy those results at scale, all powered by a single version of the truth. Using MicroStrategy and popular statistical libraries, data scientists can access trusted data to build complex models that benefit the entire enterprise.

To learn more about MicroStrategy’s machine learning capabilities and see them in action, visit https://www.microstrategy.com/us/product/analytics/advanced-analytics.

ABOUT TDWI CHECKLIST REPORTSTDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, analytics, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects.

ABOUT THE AUTHORFern Halper, Ph.D., is vice president and senior director of TDWI Research for advanced analytics. She is well known in the analytics community, having been published hundreds of times on data mining and information

technology over the past 20 years. Halper is also coauthor of several Dummies books on cloud computing and big data. She focuses on advanced analytics, including predictive analytics, text and social media analysis, machine learning, AI, cognitive computing and big data analytics approaches. She has been a partner at industry analyst firm Hurwitz & Associates and a lead data analyst for Bell Labs. Her Ph.D. is from Texas A&M University. You can reach her by email ([email protected]), on Twitter (twitter.com/fhalper), and on LinkedIn (linkedin.com/in/fbhalper).

ABOUT TDWI RESEARCHTDWI Research provides research and advice for BI professionals worldwide. TDWI Research focuses exclusively on analytics and data management issues and teams up with industry practitioners to deliver both broad and deep understanding of the business and technical issues surrounding the deployment of business intelligence and data management solutions. TDWI Research offers reports, commentary, and inquiry services via a worldwide membership program and provides custom research, benchmarking, and strategic planning services to user and vendor organizations.

https://www.microstrategy.com/us/product/analytics/advanced-analytics

https://www.microstrategy.com/us/product/analytics/advanced-analytics

mailto:[email protected]

https://twitter.com/fhalper

https://www.linkedin.com/in/fbhalper

TDWI Checklist Report: Digital Transformation Using ...

Documents

Transcript of TDWI Checklist Report: Digital Transformation Using ...