Big Data : A Revolution That Will Transform How We Live,Work and Think , the 2013 book by Victor Mayer-Schonberger and Kenneth Cuckier which I read recently can be divided into three broad parts (division mine).
Part 1 : "By changing the amount, we change the essence"
The authors begin by arguing that the sheer volume of data today marks three fundamental transformations which they cover in three succinctly titled chapters , namely : (a) "More": The ability to collect all or nearly all of the datapoints makes sampling-based extrapolation redundant. Today , with increasingly more data available, the sample size is the universe , or , N= All. (b) "Messy" : With this, the need for exactitude recedes as error margins become relatively more insignificant. You can live with messy datasets as long as they are big enough to deliver insights and results that smaller ones can't. (c) "Correlation" : What matters are associations ,not their explanations. Predictive quality is all and causality nothing. It is good enough to know that there's a correlation between variables even if it may not be clear why.
Part 1 : "By changing the amount, we change the essence"
The authors begin by arguing that the sheer volume of data today marks three fundamental transformations which they cover in three succinctly titled chapters , namely : (a) "More": The ability to collect all or nearly all of the datapoints makes sampling-based extrapolation redundant. Today , with increasingly more data available, the sample size is the universe , or , N= All. (b) "Messy" : With this, the need for exactitude recedes as error margins become relatively more insignificant. You can live with messy datasets as long as they are big enough to deliver insights and results that smaller ones can't. (c) "Correlation" : What matters are associations ,not their explanations. Predictive quality is all and causality nothing. It is good enough to know that there's a correlation between variables even if it may not be clear why.
Part 2 : Data as the 'oil of the information economy'
The second part covers the increasing 'datafication' of our world in general , the value created by the datafication of business specifically and the implications of that on the ecosystem. The book draws an interesting distinction between 'datafication' as the process of quantifying the world in analyzable formats and 'digitization' as the means that "turbocharges" that process. The value of data is likened to an iceberg , most of it is below the surface.Value accrues both from the primary use for which it was collected and from its resuse and extension beyond that purpose. This is the "option value of data" and is a key driver of the ecosystem today. 'Data exhaust' - information from users' usage and interactions online - is being used to "train" the system to drive improvements in areas like speech recognition, translations , etc. The book identifies three types of players in the data value chain : those who own the data (often only incidentally) , the analytics experts who apply their skills on others' data and those with the 'mindset' - the entrepreneurs who see the opportunity and build businesses around data. All are trying to position themselves at the centre of maximum leverage and data owners are unlocking value by processing and selling information to outside parties. The authors argue that over time as data skills and mindsets become more common , it's the data itself and data owners who will be the winners in the chain. The authors also discuss the end of domain specialists with data scientists 'letting the data speak' and making the decisions.
Part 3 : The dark side of big data
The third section covers privacy, data protection and related legal / regulatory issues. It talks about the risks of a big data world to individual liberty. On the controls side, as it can not be known in advance how exactly individuals' data is going to be used, the authors argue for a move from "privacy by consent" to "privacy by accountability" whereby data users take on the responsibility of ensuring that it is not misused. Interesting but iffy ! This last section didn't grip as much as the first two.
The book is laced throughout with interesting examples which fuelled the arguments and concepts very well. Some are the familar generic types - predictive analysis based recommendations on E-Commerce sites like Amazon or Netflix , sentiment analysis from social media data , etc. Then there are specific well-known examples - Google and flu trends, Walmart stocking Pop Tarts before hurricanes, Target and the pregnant teenager . And there are some lesser known but equally fascinating examples. Consider how the pressure applied by a person on the car seat can be mapped through sensors to assign a unique digital id and used as an anti-theft feature (and for other purposes like safety , to boot). Or how in 2008 the Billion Prices Project by MIT in the US used web crawling software to track five times as many more product prices in a day than the official CPI system did in a month and predicted the post-Lehman deflationary swing a couple of months in advance. A standout example was the automaker who having discovered a faulty part through data collected from their cars goes on to sell the patent of the fix to the supplier !
On the flip side, the book tends to be a bit repetitive. I can also see how it could perhaps be too basic for hard core practitioners. Finally, the book had a bit of an All or Nothing ring to it, especially with some of the bolder arguments like the deprioritization of causality or the end of specialists. That they are valid only within a context is not made clear and the exceptions are perhaps glossed over. However, it presents those contexts as they are and doesn't tip over into exaggeration as a book like this easily could have.A paradigm shift is undeniably on the way and that this is the starting point is very well brought to light by the authors.
The second part covers the increasing 'datafication' of our world in general , the value created by the datafication of business specifically and the implications of that on the ecosystem. The book draws an interesting distinction between 'datafication' as the process of quantifying the world in analyzable formats and 'digitization' as the means that "turbocharges" that process. The value of data is likened to an iceberg , most of it is below the surface.Value accrues both from the primary use for which it was collected and from its resuse and extension beyond that purpose. This is the "option value of data" and is a key driver of the ecosystem today. 'Data exhaust' - information from users' usage and interactions online - is being used to "train" the system to drive improvements in areas like speech recognition, translations , etc. The book identifies three types of players in the data value chain : those who own the data (often only incidentally) , the analytics experts who apply their skills on others' data and those with the 'mindset' - the entrepreneurs who see the opportunity and build businesses around data. All are trying to position themselves at the centre of maximum leverage and data owners are unlocking value by processing and selling information to outside parties. The authors argue that over time as data skills and mindsets become more common , it's the data itself and data owners who will be the winners in the chain. The authors also discuss the end of domain specialists with data scientists 'letting the data speak' and making the decisions.
Part 3 : The dark side of big data
The third section covers privacy, data protection and related legal / regulatory issues. It talks about the risks of a big data world to individual liberty. On the controls side, as it can not be known in advance how exactly individuals' data is going to be used, the authors argue for a move from "privacy by consent" to "privacy by accountability" whereby data users take on the responsibility of ensuring that it is not misused. Interesting but iffy ! This last section didn't grip as much as the first two.
The book is laced throughout with interesting examples which fuelled the arguments and concepts very well. Some are the familar generic types - predictive analysis based recommendations on E-Commerce sites like Amazon or Netflix , sentiment analysis from social media data , etc. Then there are specific well-known examples - Google and flu trends, Walmart stocking Pop Tarts before hurricanes, Target and the pregnant teenager . And there are some lesser known but equally fascinating examples. Consider how the pressure applied by a person on the car seat can be mapped through sensors to assign a unique digital id and used as an anti-theft feature (and for other purposes like safety , to boot). Or how in 2008 the Billion Prices Project by MIT in the US used web crawling software to track five times as many more product prices in a day than the official CPI system did in a month and predicted the post-Lehman deflationary swing a couple of months in advance. A standout example was the automaker who having discovered a faulty part through data collected from their cars goes on to sell the patent of the fix to the supplier !
On the flip side, the book tends to be a bit repetitive. I can also see how it could perhaps be too basic for hard core practitioners. Finally, the book had a bit of an All or Nothing ring to it, especially with some of the bolder arguments like the deprioritization of causality or the end of specialists. That they are valid only within a context is not made clear and the exceptions are perhaps glossed over. However, it presents those contexts as they are and doesn't tip over into exaggeration as a book like this easily could have.A paradigm shift is undeniably on the way and that this is the starting point is very well brought to light by the authors.
Overall , I found the book to be an excellent layperson introduction to this important and very current topic- and a very enjoyable read at that !