Big data is getting bigger: Each day, Facebook processes 2.5 billion peices of content. Numbers like this are impressive, but the story behind this exciting new concept is more subtle - and so is the way that forward-thinking organizations handle it.
Tony Lock, program director at strategic consulting company Freeform Dynamics, cites the “three Vs” when describing big data: volume, variation and velocity.
There is certainly far more data than there was before. Everything from social media to components of the growing “Internet of Things” is contributing to the data pile.
The data being produced is also vastly different. Tweets and LinkedIn updates rub shoulders with geolocation data from phones and GPS devices. Sensors in everything from cars to airplane engines contribute different kinds of data (there are currently 100.4 million machine-to-machine connections worldwide). Documents, audio and video calls add to the mix, with 72 hours of video being uploaded to YouTube alone every minute.
Finally, velocity is at an all-time high. The data is being produced faster than ever as devices gather information in real time and Internet users leave a constant stream of information about what they’re doing.
making sense of it all
These three factors are what makes big data analytics fundamentally different than conventional business intelligence based on data warehouses, explains Shishir Garg, Director, IT Strategy Development at Orange Silicon Valley.
Whereas data warehouses work with a well-understood selection of enterprise data, big data systems are designed to gather everything, says Garg. “The logic of a big data project is to try and not throw away anything,” he says. “It’s for people that are looking to ask more questions of the data.”
Velocity also plays a part. The type of things that you want to ask big data systems needs a quick turnaround,” he says. “Data warehousing has a longer cycle. Every time you need to bring up a new set of questions, there’s a project owner and an IT owner. It’s an involved process, and at a minimum, it’s a few weeks before you get those answers.”
Complex queries on large data volumes have generally been the purview of scientists, using supercomputers in academic circles to map genomes and simulate the insides of stars. Now they’re transitioning into the commercial world, as forward-thinking companies tune into the possibilities of big data analytics.
customer insight
Walmart is a huge pioneer in big data. The company is building its own big platform using an open source analysis tool called Hadoop, which it will roll out globally across the data centers serving its online store. This will unify large sets of data from each location and give it better analytical capabilities.
Walmart’s big data project will give it insights into its customers and their buying patterns, providing more accurate feedback on what products are successful. The company also wants to incorporate social media information about the customers.
Online retail projects like this are good examples of the potential for big data, says Jason Alexander, Senior VP at London, Ontario-based firm Info-Tech Research. “At Amazon, they know what you bought, what you didn’t buy, what you looked at before you bought, and what you want to buy,” he says. Big data can render that information useful immediately.
If that intelligence was married with social media sentiment, data from your phone about what you’re doing, information about the weather in your area, what flights you had booked in the future, and what kinds of people you recently befriended on LinkedIn, who knows what insights a retailer might get?
new way of looking at data
That’s the point with big data: it enables you to find new patterns and correlations between datasets that you didn’t know about before. But to find them, you need as much data as possible, including perhaps data from outside your own organization – and you need to be able to query it in new ways.
The hardware technology needed to support this concept isn’t revolutionary, say experts. “You could be talking about putting together a collection of Wintel boxes,” muses Garg, adding that many companies may simply find a system that’s out of commission following a technology refresh.
This will make sense for many businesses, because many of them will want to start small, testing the benefits of big data on non-mission critical data. This will also give companies insights into the kinds of skills that they need to build as they scale up their big data operation.
The real difference comes in the software. A new generation of software designed to handle big data queries has emerged specifically targeted at the distributed programming of large datasets. MapReduce, developed by Google, is a programming model supporting this model. Hadoop is an open-source derivative of MapReduce created by the Apache Software Foundation, which is being used in large-scale projects by companies like Walmart.
Clive Longbottom, co-founder of analyst firm Quocirca, points to MapReduce and Hadoop as a means of dealing with non-standard data. They can filter extremely large datasets in near real time, he says. “Analytic and intelligence engines can now reach across the different data stores and produce meaningful and useful visualizations of what is going on.”
The technology used to query this data also expands beyond traditional relational database technology. NoSQL, a broad class of non-relational databases, uses a structure of data collections, rather than traditional tables and rows, to house far greater volumes of data. They are particularly good for statistical and real-time analyses.
succeeding with big data analytics
Companies can build these tools themselves from readily available components, although it takes extensive expertise in the field. Alternatively, big data appliances can be used for the initial extraction and loading, data processing and analytics. Such appliances are starting to use large amounts of solid state memory for in-memory data processing instead of pulling data from disk. This makes data processing faster, enabling companies to cope with the velocity of data feeds.
While many talk about big data as a discrete market, however, the line between big data and conventional solutions is blurry, as many companies attempt to adapt their existing infrastructures to cope with big data demands. Gartner predicts that $34 billion of IT spending will be big data-related this year, but that only $4.3bn of it will be spent on new software.
This transition from traditional relational database systems to big data systems will be a slow one. Many companies won’t want to walk before they can run. “We asked organizations whether they were trying to exploit their traditional datasets, and almost no one thinks that they are,” says Tony Lock. “They are trying to get more value out of their Oracle, DB2, SQL tools. The vast majority of businesses still see big data as a niche sport.”
What will it take for companies to truly grasp big data and use it to their advantage? This concept, perhaps more than any other in corporate computing, has to be a business issue as well as an IT one. Big data can provide the answers to a lot of questions, but the business must understand which questions to ask and have the appetite to ask them. Most importantly of all, they must be willing to act on the answers.
“If you don’t have that corporate will to let the system suggest how many snowshoes you ship to Boise, Idaho,” concludes Alexander, “then there is no point.”