The perils of big data

A shorter version of this article first appeared in issue 238 of .net magazine – the world's best-selling magazine for web designers and developers.

In case you missed it, Barack Obama won the US election a while back and will be in power for another four years. The news seemed to come as a shock to many, though not to Nate Silver of the New York Times who, along with a handful of others, predicted the result with 100% accuracy. And it wasn't a fluke; he declared himself 90% confident before the election that Obama didn''t need to pack up his toaster. His secret? Big data.

Well, big data and some surgically honed analytical chops; because the difference between Silver's predictions and those of less fortunate crystal-ball-gazers is that he took better account of the sentimental and prejudicial biases in his source data. As he tweeted when the results came in:

Indeed. He deserves to capitalise on his current fame. But, much as I admire Nate Silver, would you want to be him at the next US election? The pressure will be enormous. Could he get it 100% right next time? The probability is he'll do very well. But 100%? People are fickle. Vote-influencing issues can come out of nowhere. Hurricane Sandy, anyone?

In 1992 at the UK general election, the Conservative party landed back in office with 8% more of the vote than polls had predicted (if the polls had been accurate they would have lost by 1%). The phenomenon is well known, as the “Shy Tory Factor” or the “Bradley Effect” – people telling pollsters that they'll vote for the candidate they perceive to be more socially acceptable, when in truth they intend to vote for someone else. As Robert L. Glass might have said, if he hadn't been talking about reuse: data-in-the-large remains a mostly unsolved problem.

Small amounts of data can certainly be misleading: a sample population of one that's inaccurate is 100% inaccurate. But big data sets provide a whole new scope for delusive conclusions. Data collected about human beings can be infused with bias and self-conscious moderation. Data analysed by humans can be tainted with politics, subjectivity, confusion over correlation and causation, and subtle shifts that lead us inexorably to the very answers we wanted at the beginning.

We have name for that: it's called wishful thinking. And nowhere is wishful thinking more commercially pernicious than Enterprise IT.

In the 1990s, architecture teams told us that point-to-point integration – application A talks directly to Application B – was bad. The future was Enterprise Application Integration (EAI) – connecting applications to one another via an asynchronous message bus.

Unfortunately, though well intentioned, EAI was a disaster. The centralised command and control, overbearing governance, frailty of using messages for all communication, vendor lock-in, and cost of writing bus adaptors created unmanageable complexity.

But before the cost of this could be counted, we got Service Oriented Architecture (SOA) - deriving a set of abstract business services across all applications. Sharp vendors added XML-powered web services to their old EAI products and called them Enterprise Service Buses. And complexity increased.

With all this confusion over how to integrate systems it was pretty hard for businesses to get a consolidated picture of just what was going on. The architecture team's answer to that was Enterprise Data Warehousing (EDW) – an all-knowing repository based on a build-it-and-they-will-come mentality, promoting the extraction and standardisation of data from disparate business systems. Again, not a bad idea in principle, as long as the cost of doing so is lower than the value of the insight obtained. It isn't. But CIOs and vendors need not worry because “Big Data”, the latest in a procession of over-hyped initiatives, is here to stick another layer of obfuscation in the way of those pesky financial controllers.

The Big Data team

Businesses are generating more data than ever. It's cheaper to store and with the rise of NoSQL we're finally breaking free of the dominion of the RDBMS. Any product manager worth their salary knows that insight into market forces, the social web, consumer behaviour and the competition is critical to success. Ideal conditions then for vendors, and architects coming up for their annual performance review, to espouse the seductive promise of perspicacity. Thus the Enterprise Data Warehouse team becomes the Big Data team; the vendors throw in some Hadoop integration and they're off and running again. Befuddled financial controllers in their wake.

Businesses are more complex these days, no doubt about it. Society's cultural ebbs and flows are more subtle, yet more effecting. It's also a tough time financially, which is why to me the basic tenet of "solutions must cost less than the cost of the problem they solve" is so critical.

Point-to-point integration between applications can cause serious issues, but EAI didn't make that better, it made it worse. SOA was a valuable concept with a great deal of promise, but it was hijacked by vendors and large-systems integrators, and all but destroyed. Businesses that are complex unfortunately have to face the fact that they will also have complex IT infrastructures. The best they will ever get is an infrastructure that is only as complex as their business model. Businesses with complex, but appropriate, IT might be looking at a sign they need to simplify their model.

One way to simplify management information infrastructure, and to avoid the pitfalls of the Big Data Illusion, is to be consistently problem-driven. What are the business questions that make all the difference to trading effectiveness in the early 21st century? For most it won't be on the scale of “Who will win the next election?”

What about who your best customers are? Where they live? What your top products are? What customers like about competing products? These questions do not require big data, they need accurate data. Data that probably only resides in a handful of systems.

That's not to say big data is a crazy idea. It's a natural progression that well-run business should be able to take in their stride – not use to explain away the failure of their Enterprise Data Warehouse. Large and seemingly unconnected data sets, coupled with careful data analysis can tell us remarkable things, but the investment has to be targeted at the questions we need answered so that we don't lose sight of the goal in the pursuit of the solution.

Discover the 20 best tools for data visualisation at our sister site, Creative Bloq.