In today’s technological world where almost depends on
the computer systems and the internet. Technology has taken over every field
today resulting in huge data growth. All of this data is valuable. 3 to 4
million data is used every day. One machine can’t store and process this huge
amount of data therefore the need to understand big data and methods to store
this data arises. Big data is a huge
amount of data which can’t be processed using traditional systems of approach (computer
system) in a given time frame.
Now how big does this data need to be? There’s a
common misconception while referring the word big data. There’s not a
threshold of data above which data will be considered as big data. It
is referred to data that is either in gigabytes, terabytes, petabytes, exabytes
or size even larger than this. This definition is wrong. Big data depends
purely on the context it is being used in. Even a small amount of data can be
referred to as big data. For example, you can’t attach a file to an email with
a size of 100 MB. Therefore for the email, this 100 MB is referred to as big
data. Some more examples are listed below:
There is 100 TB of videos to be
resized and edited within a given time frame. Using a single storage system, we
won’t be able to accomplish this time within the given time frame.
Popular social networking sites
have a lot of data coming in every minute like facebook receives 100 TB data
per day, around 400 millions tweets are tweeted on twitter everyday and almost
48 videos are uploaded every minute on
YouTube. This data is very important and it needs to be processed in a given
Data is classified into three main categories:
1. Structured data
– databases, XLS files etc.
2. Semi structured data
– email, log files, doc files etc.
3. Unstructured data
– images, videos, music files etc.
need new techniques, new tools, new architecture for the management that is for
storage, processing within a time frame.
V’s of big data
There are 3 V’s of big data, 4 have been recently
added making them 7 in total.
It refers to the huge amount of data that is created in places ranging from
data created by social networking sites, banks (accounts, credit and debit
It is referred to different types of data being used for as
discussed above (structured, semi structured and unstructured).
While processing, more and more data keeps on coming and it has to be processed
efficiently and within the time frame. For example, every minute new videos are
being uploaded on YouTube.
This is referred to the authenticity of the data. For example
twitter uses hash tags, abbreviations in user’s tweets. The accuracy of all
this content is checked by twitter.
The type of data that is visible.
Referred to the validity of data. For example, in 1998 different kinds of files
were than that are being used now.
for big data
Increased processing speed
Increased network speed
– Distributed servers/ cloud. Eg-Amazon EC2
system – Sealable, distributed. Eg-HDFS
mode/ Paradigm- Hadoop
NoSQL. Eg- HBase,
indexing, analytics. Eg- Data mining, info retrieval
An algorithm was developed that allowed for large data
calculation to be chopped up into smaller chunks or pieces and then mapped to
many computers, and after calculation the large data was brought back together
to produce the resulting data set. This algorithm was called MapReduce. With
this algorithm, Hadoop was created.
Data in this way is being processed in parallel rather than serial.