Data Science for Business PDF Free Ebook Textbook

Category: Education

Presentation Description

Download for free at: == Tags: Data Mining, data science books, Data Science for Business, data science for business pdf, data science for business pdf download, data science pdf, data science textbook, Data-Analytic Thinking, Foster Provost, Tom Fawcett


Presentation Transcript

slide 1:


slide 2:

Foster Provost and Tom Fawcett Data Science for Business

slide 3:

Data Science for Business by Foster Provost and Tom Fawcett Copyright © 2013 Foster Provost and Tom Fawcett. All rights reserved. Printed in the United States of America. Published by O’Reilly Media Inc. 1005 Gravenstein Highway North Sebastopol CA 95472. O’Reilly books may be purchased for educational business or sales promotional use. Online editions are also available for most titles For more information contact our corporate/ institutional sales department: 800-998-9938 or Editors: Mike Loukides and Meghan Blanchette Production Editor: Christopher Hearse Proofreader: Kiel Van Horn Indexer: WordCo Indexing Services Inc. Cover Designer: Mark Paglietti Interior Designer: David Futato Illustrator: Rebecca Demarest July 2013: First Edition Revision History for the First Edition: 2013-07-25: First release See for release details. The O’Reilly logo is a registered trademark of O’Reilly Media Inc. Many of the designations used by man‐ ufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book and O’Reilly Media Inc. was aware of a trademark claim the designations have been printed in caps or initial caps. Data Science for Business is a trademark of Foster Provost and Tom Fawcett. While every precaution has been taken in the preparation of this book the publisher and authors assume no responsibility for errors or omissions or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-36132-7 LSI

slide 4:

Dream no small dreams for they have no power to move the hearts of men. —Johann Wolfgang von Goethe CHAPTER 1 Introduction: Data-Analytic Thinking The past fifteen years have seen extensive investments in business infrastructure which have improved the ability to collect data throughout the enterprise. Virtually every as‐ pect of business is now open to data collection and often even instrumented for data collection: operations manufacturing supply-chain management customer behavior marketing campaign performance workflow procedures and so on. At the same time information is now widely available on external events such as market trends industry news and competitors’ movements. This broad availability of data has led to increasing interest in methods for extracting useful information and knowledge from data—the realm of data science. The Ubiquity of Data Opportunities With vast amounts of data now available companies in almost every industry are fo‐ cused on exploiting data for competitive advantage. In the past firms could employ teams of statisticians modelers and analysts to explore datasets manually but the vol‐ ume and variety of data have far outstripped the capacity of manual analysis. At the same time computers have become far more powerful networking has become ubiq‐ uitous and algorithms have been developed that can connect datasets to enable broader and deeper analyses than previously possible. The convergence of these phenomena has given rise to the increasingly widespread business application of data science principles and data-mining techniques. Probably the widest applications of data-mining techniques are in marketing for tasks such as targeted marketing online advertising and recommendations for cross-selling. 1

slide 5:

Data mining is used for general customer relationship management to analyze customer behavior in order to manage attrition and maximize expected customer value. The finance industry uses data mining for credit scoring and trading and in operations via fraud detection and workforce management. Major retailers from W almart to Amazon apply data mining throughout their businesses from marketing to supply-chain man‐ agement. Many firms have differentiated themselves strategically with data science sometimes to the point of evolving into data mining companies. The primary goals of this book are to help you view business problems from a data perspective and understand principles of extracting useful knowledge from data. There is a fundamental structure to data-analytic thinking and basic principles that should be understood. There are also particular areas where intuition creativity common sense and domain knowledge must be brought to bear. A data perspective will provide you with structure and principles and this will give you a framework to systematically analyze such problems. As you get better at data-analytic thinking you will develop intuition as to how and where to apply creativity and domain knowledge. Throughout the first two chapters of this book we will discuss in detail various topics and techniques related to data science and data mining. The terms “data science” and “data mining” often are used interchangeably and the former has taken a life of its own as various individuals and organizations try to capitalize on the current hype surround‐ ing it. At a high level data science is a set of fundamental principles that guide the extraction of knowledge from data. Data mining is the extraction of knowledge from data via technologies that incorporate these principles. As a term “data science” often is applied more broadly than the traditional use of “data mining ” but data mining tech‐ niques provide some of the clearest illustrations of the principles of data science. It is important to understand data science even if you never intend to apply it yourself. Data-analytic thinking enables you to evaluate pro‐ posals for data mining projects. For example if an employee a con‐ sultant or a potential investment target proposes to improve a partic‐ ular business application by extracting knowledge from data you should be able to assess the proposal systematically and decide wheth‐ er it is sound or flawed. This does not mean that you will be able to tell whether it will actually succeed—for data mining projects that often requires trying—but you should be able to spot obvious flaws unrealistic assumptions and missing pieces. Throughout the book we will describe a number of fundamental data science principles and will illustrate each with at least one data mining technique that embodies the prin‐ ciple. For each principle there are usually many specific techniques that embody it so in this book we have chosen to emphasize the basic principles in preference to specific techniques. That said we will not make a big deal about the difference between data 2 | Chapter 1: Introduction: Data-Analytic Thinking

slide 6:

1. Of course What goes better with strawberry Pop-Tarts than a nice cold beer science and data mining except where it will have a substantial effect on understanding the actual concepts. Let’s examine two brief case studies of analyzing data to extract predictive patterns. Example: Hurricane Frances Consider an example from a New York Times story from 2004: Hurricane Frances was on its way barreling across the Caribbean threatening a direct hit on Florida’s Atlantic coast. Residents made for higher ground but far away in Ben‐ tonville Ark. executives at Wal-Mart Stores decided that the situation offered a great opportunity for one of their newest data-driven weapons … predictive technology. A week ahead of the storm’s landfall Linda M. Dillman Wal-Mart’s chief information officer pressed her staff to come up with forecasts based on what had happened when Hurricane Charley struck several weeks earlier. Backed by the trillions of bytes’ worth of shopper history that is stored in Wal-Mart’s data warehouse she felt that the company could ‘start predicting what’s going to happen instead of waiting for it to happen ’ as she put it. Hays 2004 Consider why data-driven prediction might be useful in this scenario. It might be useful to predict that people in the path of the hurricane would buy more bottled water. Maybe but this point seems a bit obvious and why would we need data science to discover it It might be useful to project the amount of increase in sales due to the hurricane to ensure that local W al-Marts are properly stocked. Perhaps mining the data could reveal that a particular DVD sold out in the hurricane’s path—but maybe it sold out that week at Wal-Marts across the country not just where the hurricane landing was imminent. The prediction could be somewhat useful but is probably more general than Ms. Dill‐ man was intending. It would be more valuable to discover patterns due to the hurricane that were not ob‐ vious. T o do this analysts might examine the huge volume of W al-Mart data from prior similar situations such as Hurricane Charley to identify unusual local demand for products. From such patterns the company might be able to anticipate unusual demand for products and rush stock to the stores ahead of the hurricane’s landfall. Indeed that is what happened. The New York Times Hays 2004 reported that: “ … the experts mined the data and found that the stores would indeed need certain products —and not just the usual flashlights. ‘We didn’t know in the past that strawberry Pop- Tarts increase in sales like seven times their normal sales rate ahead of a hurricane’ Ms. Dillman said in a recent interview. ‘And the pre-hurricane top-selling item was beer. ’” 1 Example: Hurricane Frances | 3

slide 7:

Example: Predicting Customer Churn How are such data analyses performed Consider a second more typical business sce‐ nario and how it might be treated from a data perspective. This problem will serve as a running example that will illuminate many of the issues raised in this book and provide a common frame of reference. Assume you just landed a great analytical job with MegaTelCo one of the largest tele‐ communication firms in the United States. They are having a major problem with cus‐ tomer retention in their wireless business. In the mid-Atlantic region 20 of cell phone customers leave when their contracts expire and it is getting increasingly difficult to acquire new customers. Since the cell phone market is now saturated the huge growth in the wireless market has tapered off. Communications companies are now engaged in battles to attract each other’ s customers while retaining their own. Customers switch‐ ing from one company to another is called churn and it is expensive all around: one company must spend on incentives to attract a customer while another company loses revenue when the customer departs. You have been called in to help understand the problem and to devise a solution. At‐ tracting new customers is much more expensive than retaining existing ones so a good deal of marketing budget is allocated to prevent churn. Marketing has already designed a special retention offer. Y our task is to devise a precise step-by-step plan for how the data science team should use MegaT elCo ’ s vast data resources to decide which customers should be offered the special retention deal prior to the expiration of their contracts. Think carefully about what data you might use and how they would be used. Specifically how should MegaTelCo choose a set of customers to receive their offer in order to best reduce churn for a particular incentive budget Answering this question is much more complicated than it may seem initially . W e will return to this problem repeatedly through the book adding sophistication to our solution as we develop an understanding of the fundamental data science concepts. In reality customer retention has been a major use of data mining technologies—especially in telecommunications and finance business‐ es. These more generally were some of the earliest and widest adopt‐ ers of data mining technologies for reasons discussed later. Data Science Engineering and Data-Driven Decision Making Data science involves principles processes and techniques for understanding phe‐ nomena via the automated analysis of data. In this book we will view the ultimate goal 4 | Chapter 1: Introduction: Data-Analytic Thinking

slide 8:

Figure 1-1. Data science in the context of various data-related processes in the organization. of data science as improving decision making as this generally is of direct interest to business. Figure 1-1 places data science in the context of various other closely related and data- related processes in the organization. It distinguishes data science from other aspects of data processing that are gaining increasing attention in business. Let’ s start at the top. Data-driven decision-making DDD refers to the practice of basing decisions on the analysis of data rather than purely on intuition. For example a marketer could select advertisements based purely on her long experience in the field and her eye for what will work. Or she could base her selection on the analysis of data regarding how con‐ sumers react to different ads. She could also use a combination of these approaches. DDD is not an all-or-nothing practice and different firms engage in DDD to greater or lesser degrees. The benefits of data-driven decision-making have been demonstrated conclusively. Economist Erik Brynjolfsson and his colleagues from MIT and Penn’s Wharton School conducted a study of how DDD affects firm performance Brynjolfsson Hitt Kim 2011. They developed a measure of DDD that rates firms as to how strongly they use Data Science Engineering and Data-Driven Decision Making | 5

slide 9:

If you can’t explain it simply you don’t understand it well enough. —Albert Einstein CHAPTER 14 Conclusion The practice of data science can best be described as a combination of analytical engi‐ neering and exploration. The business presents a problem we would like to solve. Rarely is the business problem directly one of our basic data mining tasks. We decompose the problem into subtasks that we think we can solve usually starting with existing tools. For some of these tasks we may not know how well we can solve them so we have to mine the data and conduct evaluation to see. If that does not succeed we may need to try something completely different. In the process we may discover knowledge that will help us to solve the problem we had set out to solve or we may discover something unexpected that leads us to other important successes. Neither the analytical engineering nor the exploration should be omitted when con‐ sidering the application of data science methods to solve a business problem. Omitting the engineering aspect usually makes it much less likely that the results of mining data will actually solve the business problem. Omitting the understanding of process as one of exploration and discovery often keeps an organization from putting the right man‐ agement incentives and investments in place for the project to succeed. The Fundamental Concepts of Data Science Both the analytical engineering and the exploration and discovery are made more sys‐ tematic and thereby more likely to succeed by the understanding and embracing of the fundamental concepts of data science. In this book we have introduced a collection of the most important fundamental concepts. Some of these concepts we made into head‐ liners for the chapters and others were introduced more naturally through the discus‐ 331

slide 10:

sions and not necessarily labeled as fundamental concepts. These concepts span the process from envisioning how data science can improve business decisions to applying data science techniques to deploying the results to improve decision-making. The con‐ cepts also undergird a large array of business analytics. We can group our fundamental concepts roughly into three types: 1. General concepts about how data science fits in the organization and the compet‐ itive landscape including ways to attract structure and nurture data science teams ways for thinking about how data science leads to competitive advantage ways that competitive advantage can be sustained and tactical principles for doing well with data science projects. 2. General ways of thinking data-analytically which help us to gather appropriate data and consider appropriate methods. The concepts include the data mining process the collection of different high-level data science tasks as well as principles such as the following. • Data should be considered an asset and therefore we should think carefully about what investments we should make to get the best leverage from our asset • The expected value framework can help us to structure business problems so we can see the component data mining problems as well as the connective tissue of costs benefits and constraints imposed by the business environment • Generalization and overfitting: if we look too hard at the data we will find patterns we want patterns that generalize to data we have not yet seen • Applying data science to a well-structured problem versus exploratory data mining require different levels of effort in different stages of the data mining process 3. General concepts for actually extracting knowledge from data which undergird the vast array of data science techniques. These include concepts such as the following. • Identifying informative attributes—those that correlate with or give us informa‐ tion about an unknown quantity of interest • Fitting a numeric function model to data by choosing an objective and finding a set of parameters based on that objective • Controlling complexity is necessary to find a good trade-off between generalization and overfitting • Calculating similarity between objects described by data 332 | Chapter 14: Conclusion

slide 11:

Once we think about data science in terms of its fundamental concepts we see the same concepts underlying many different data science strategies tasks algorithms and pro‐ cesses. As we have illustrated throughout the book these principles not only allow us to understand the theory and practice of data science much more deeply they also allow us to understand the methods and techniques of data science very broadly because these methods and techniques are quite often simply particular instantiations of one or more of the fundamental principles. At a high level we saw how structuring business problems using the expected value framework allows us to decompose problems into data science tasks that we understand better how to solve and this applies across many different sorts of business problems. For extracting knowledge from data we saw that our fundamental concept of deter‐ mining the similarity of two objects described by data is used directly for example to find customers similar to our best customers. It is used for classification and for re‐ gression via nearest-neighbor methods. It is the basis for clustering the unsupervised grouping of data objects. It is the basis for finding documents most related to a search query. And it is the basis for more than one common method for making recommen‐ dations for example by casting both customers and movies into the same “taste space ” and then finding movies most similar to a particular customer. When it comes to measurement we see the notion of lift—determining how much more likely a pattern is than would be expected by chance—appearing broadly across data science when evaluating very different sorts of patterns. One evaluates algorithms for targeting advertisements by computing the lift one gets for the targeted population. One calculates lift for judging the weight of evidence for or against a conclusion. One cal‐ culates lift to help judge whether a repeated co-occurrence is interesting as opposed to simply being a natural consequence of popularity. Understanding the fundamental concepts also facilitates communication between busi‐ ness stakeholders and data scientists not only because of the shared vocabulary but because both sides actually understand better. Instead of missing important aspects of a discussion completely we can dig in and ask questions that will reveal critical aspects that otherwise would not have been uncovered. For example let’s say your venture firm is considering investing in a data science-based company producing a personalized online news service. You ask how exactly they are personalizing the news. They say they use support vector machines. Let’s even pretend that we had not talked about support vector machines in this book. You should feel confident enough in your knowledge of data science now that you should not simply say “Oh OK. ” Y ou should be able to confidently ask: “What’ s that exactly” If they really do know what they are talking about they should give you some explanation based upon our fundamental principles as we did in Chapter 4. Y ou also are now prepared to ask “What exactly are the training data you intend to use” Not only might that impress data scientists on their team but it actually is an important question to be asked to see The Fundamental Concepts of Data Science | 333

slide 12:

whether they are doing something credible or just using “ data science” as a smokescreen to hide behind. You can go on to think about whether you really believe building any predictive model from these data—regardless of what sort of model it is—would be likely to solve the business problem they’re attacking. Y ou should be ready to ask whether you really think they will have reliable training labels for such a task. And so on. Applying Our Fundamental Concepts to a New Problem: Mining Mobile Device Data As we’ve emphasized repeatedly once we think about data science as a collection of concepts principles and general methods we will have much more success both un‐ derstanding data science activities broadly and also applying data science to new busi‐ ness problems. Let’s consider a fresh example. Recently as of this writing there has been a marked shift in consumer online activity from traditional computers to a wide variety of mobile devices. Companies many still working to understand how to reach consumers on their desktop computers now are scrambling to understand how to reach consumers on their mobile devices: smart phones tablets and even increasingly mobile laptop computers as WiFi access becomes ubiquitous. We won’t talk about most of the complexity of that problem but from our perspective the data-analytic thinker might notice that mobile devices provide a new sort of data from which little leverage has yet been obtained. In particular mobile devices are associated with data on their location. For example in the mobile advertising ecosystem depending on my privacy settings my mobile device may broadcast my exact GPS location to those entities who would like to target me with advertisements daily deals and other offers. Figure 14-1 shows a scatterplot of a small sample of locations that a potential advertiser might see sampled from the mobile advertising ecosystem. Even if I do not broadcast my GPS location my device broadcasts the IP address of the network it currently is using which often conveys location information. 334 | Chapter 14: Conclusion

slide 13:

Figure 14-1. A scatterplot of a sample of GPS locations captured from mobile devices. As an interesting side point this is just a scatterplot of the latitude and longitudes broadcast by mobile devices there is no map It gives a striking picture of population density across the world. And it makes us wonder what’s going on with mobile devices in Antarctica. How might we use such data Let’s apply our fundamental concepts. If we want to get beyond exploratory data analysis as we’ve started with the visualization in Figure 14-1 we need to think in terms of some concrete business problem. A particular firm might have certain problems to solve and be focused on one or two. An entrepre‐ neur or investor might scan across different possible problems she sees that businesses or consumers currently have. Let’s pick one related to these data. Advertisers face the problem that in this new world we see a variety of different devices and a particular consumer’s behavior may be fragmented across several. In the desktop world once the advertisers identify a good prospect perhaps via a cookie in a particular consumer’s browser or a device ID they can then begin to take action accordingly for example by presenting targeted ads. In the mobile ecosystem this consumer’s activity The Fundamental Concepts of Data Science | 335

authorStream Live Help