We’ve all heard of Big Data, a buzzword being thrown around along with IoT and Cloud Computing. But what is big data exactly?
There’s a running joke about it; “Big Data is data that you cannot fit into an excel spreadsheet.” However, there is some truth to that statement- there’s much more to Big Data than most people realize.
Let’s start to understand Big Data by breaking it down. IBM has done a great job categorizing the four primary characteristics:
- Velocity: The rate at which you receive the data – it could be 500MB of data an hour or 20 petabytes of data in a day.
- Volume: The amount of data that needs to be processed and stored. Most companies in the U.S. have at least 100 terabytes of data stored and a lot of it has to be processed.
- Variety: The different types of data that is received – the statistics, images, youtube videos and audio recordings that are posted.
- Veracity: The reliability of your data and how many false positives are located in your data set. Imagine making a large business investment based on false data!
Why does it matter? Data is exploding
We are awash in data. And a lot of it is valuable to individuals and companies alike. Google received over 2 million search queries per minute in 2012. Today, Google receives over 4 million search queries per minute from the 2.4 billion strong global Internet population. That’s just one statistic that tells the story of the data explosion that we’re living through. Here are some more statistics that might surprise you:
- Twitter users tweet nearly 300,000 times
- Instagram users post nearly 220,000 new photos
- YouTube users upload 72 hours of new video content
- Apple users download nearly 50,000 apps
- Email users send over 200 million messages
- Amazon generates over $80,000 in online sales
- Facebook users send on average 31.25 million messages and view 2.77 million videos
Imagine being one of the people who has to help manage these servers…
Data is everywhere
With the exponential growth of the Internet, there have been many corporations that have built their business around managing and presenting this data. To give you an idea as to how much data is out there— according to Cisco’s Visual Networking Index initiative, the Internet is now in the “zettabyte era.” A zettabyte equals 1 billion terabytes. By the end of 2016 global Internet traffic will reach 1.1 zettabytes per year. All of that data comes with a cost, especially when it comes to being discovered.
How to be discovered, manage, and scale with big data
Google has changed the way that most companies present their data. A Google search has become the doorway to discovering new websites on the internet. To rank on the first page your website must have some SEO tactics in place in order to draw in a larger crowd and stick out in the pool of data.
How do we manage all of these transactions with large amounts of data? Let’s talk about Hadoop. Hadoop is a scalable framework developed by the Apache foundation. It is used to distribute and process large amounts of data among computers that are clustered together. Hadoop is NOT a database management system – it is more of a data warehousing system. Why do we use Hadoop for big data? Hadoop is a very robust system that allows your Big Data applications to continue to run even when there is a failure of an individual server or cluster. It can also retrieve data from your SQL or No SQL databases and perform analysis on your data.
Welcome to the world of Big Data!