i

Hadoop Tutorial

Characteristics of Big Data

We can characterize Big Data by 3Vs: the volume of data, the wide variety of data, and mostly the velocity at which the data must be processed.

Volume:

Volume refers to the enormous amounts of data generated per second. It is not Terabytes, but Zettabytes or Brontobytes of data. All the data generated in this world between the beginning of time and 2008, the same volume of data will soon be generated every minute. This will make most data sets too large to store and analyze using traditional relational database technology. Big data technology use distributed systems so that anyone can store and analyze data across databases that are placed around anywhere in the world.

 By the year 2020, we will have 50 times the amount of data that we had in the year 2011. The volume of the data is enormous, and a substantial contributor to the ever-expanding digital universe is the Internet of Things (IoT) with sensors all over the world in all devices, creating data every second.

Variety:

Variety explains different formats of data that do not lend themselves to storage in traditional structured relational database systems, which includes a long list of data such as documents, social media text, emails, messages, still images, audio, video, graphs, and the output from all types of machine-generated data from sensors, cell phone GPS signals, devices, RFID tags, machine logs, DNA analysis devices, and more. This type of data is primarily characterized as unstructured or semi-structured and has existed all along. It is estimated by some studies to account for 90% or more of the data in organizations.

 Variety is also explained as the data from many different sources, both inside and outside of the company. 

Fig: 3v's of Big-Data

Velocity

Data scientists like to discuss data-at-rest and data-in-motion. The meaning of Velocity is to describe data-in-motion, for example, the stream of readings taken from a sensor or the weblog history of page visits and clicks by each visitor to a web site. This can be thought of like a fire hose of incoming data that needs to be captured, stored, and analyzed. Consistency and completeness of fast-moving streams of data are the first concern. Matching them to a specific outcome, a challenge raised under Variety is another. It also integrates the characteristics of timeliness or latency.

The second dimension of Velocity is how long the data will be valuable. Data is permanently valuable, or does it rapidly lose its importance and meaning. Realizing this dimension of Velocity in the data, we should store the data until we confirm that data is no longer meaningful and in fact, may mislead.