We will try to describe perspective of big data from the different parties: the basic principles of work with data, tools, examples of solution of practical tasks. We will pay separate attention to subject of machine learning.
It is necessary to begin from simple to difficult, therefore the first article – about the principles of work with big data and MapReduce paradigm.
Historical background and definition of the term
The term Big Data has appeared rather recently. Google Trends shows the beginning of active growth of the use of the phrase since 2011 (link):
Thus already now the term does not use only the lazy. Especially often not on business the term is used by marketing specialists. So such Big Data actually? Time I have decided to state and consecrate systemically question – it is necessary to decide on concept.
In the practice I met different definitions:
· Big Data is when it is more than data, than 100gb (500gb, that is pleasant to 1TB, lump)
· Big Data are such data which cannot be processed in Excel
· Big Data are such data which cannot be processed on one computer
And even such:
· Big Data is in general any data.
· Big Data does not exist, it was thought up by marketing specialists.
In this cycle of articles I will adhere to definition with wikipedia:
Big data (English big data) — series of approaches, tools and methods of processing of the structured and unstructured these huge volumes and considerable variety for receiving the results perceived by the person, effective in the conditions of the continuous gain, distribution on numerous nodes of the computer network created at the end of the 2000th years alternative to traditional database management systems and solutions of the class Business Intelligence.
Thus I will understand not some specific data volume and even not data, but methods of their processing which allow as Big Data is distributed to process information. These methods can be applied as to huge data arrays (such as contents of all pages on the Internet), and to small (such as contents of this article).
I will give some examples of that can be data source for which work methods with big data are necessary:
· Log of behavior of users on the Internet
· GPS signals from cars for transport company
· The data which are taken off from sensors in Large Hadron Collider
· The digitized books in the Russian State Library
· Information on transactions of all clients of bank
· Information on all purchases in large network retail, etc.
The quantity of data sources promptly grows, so technologies of their processing become more and more demanded.
The principles of work with big data
Proceeding from definition of Big Data, it is possible to formulate the basic principles of work with such data:
1. Horizontal scalability. As data can be as is wished much – any system which means processing of big data, has to be expanded. Data volume has twice grown – have twice increased amount of iron in cluster and everything has continued to work.
2. Fault tolerance. The principle of horizontal scalability means that machines in cluster can be much. For example, Hadoop-cluster Yahoo has more than 42000 machines (on thisto the link it is possible to look at cluster sizes in the different organizations). It means that the part of these machines will fail with guarantee. Work methods with big data have to consider possibility of such failures and endure them without any significant effects.
3. Locality of data. In big distributed systems data are distributed on large number of machines. If data physically are on one server, and are processed on other – expenses on data transmission can exceed expenses on processing. Therefore one of the most important design philosophy of BigData-solutions is the principle of locality of data – whenever possible process data by the same machine on which we store them.
All modern means of work with big data anyway follow these three principles. To follow – it is necessary to think out some methods, ways and paradigms of development of development tools of data. I will sort one of the most classical methods in today's article.
About MapReduce on Habré already wrote (time, two, three), but time cycle of articles applies for system statement of the questions Big Data – in the first article not to do without MapReduce J
MapReduce is the model of distributed processing of data offered by the Google company for processing of large volumes of data on computer clusters. MapReduce is not bad illustrated by the following picture (it is taken according to the link):
MapReduce assumes that data are organized in the form of some records. Data handling happens in 3 stages:
1. Map stage. At this stage data are pretreaten by means of the map function () which is defined by the user. Work of this stage consists in preprocessing and data filtering. Work is very similar to the operation map in functional programming languages – the user function is applied to each input record.
The map function () applied to one input record also gives out set of couples key value. The set – i.e. can give out only one record, can not give out anything, and can give out some couples key value. That will be is in key and in value – to solve to the user, but key – very important thing as data with one key in the future will get to one copy of the reduce function.
2. Shuffle stage. Passes imperceptibly for the user. In this stage output of the map "understands on baskets" function – each basket corresponds to one key of output of stage of map. Further these baskets will serve as input for reduce.
3. Reduce stage. Each "basket" with values created at shuffle stage gets on reduce function input ().
The reduce function is set by the user and calculates final result for separate "basket". The set of all values returned by the reduce function (), is final result of MapReduce-task.
Some additional facts about MapReduce:
1) All starts of the map function work independently and can work in parallel, including at different machines of cluster.
2) All starts of the reduce function work independently and can work in parallel, including at different machines of cluster.
3) Shuffle in itself represents parallel sorting therefore can also work at different machines of cluster. Points 1-3 allow to execute the principle of horizontal scalability.
4) The map function is, as a rule, applied by the same machine on which data are stored are allows to reduce data transmission on network (the principle of locality of data).
5) MapReduce is always full data scanning, no indexes are present. It means that MapReduce is badly applicable when the answer is required very quickly.
Examples of the tasks which are effectively solved by means of MapReduce
Let's begin with classical task – Word Count. The task is formulated as follows: there is big body of documents. Task – for each word, at least once meeting in the body to count the total number of times which it has met in the body.
Time the big body of documents is had – let one document will be one input record for MapRreduce-task. In MapReduce we can only set the user functions that we also will make (we will use python-like pseudo-code):
The map function turns the input document into set of couples (the word, 1), shuffle is transparent for us turns it into couples (the word, [1,1,1,1,1,1]), reduce sums up these edinichka, returning the final answer for the word.
Processing of logs of advertizing system
The second example is followed from real practice of Data-Centric Alliance.
Task: there is csv-log of advertizing system of look:
<user_id>,<country>,<city>,<campaign_id>,<creative_id>,<payment></p> 11111,RU,Moscow,2,4,0.3 22222,RU,Voronezh,2,3,0.2 13413,UA,Kiev,4,11,0.7 …
It is necessary to calculate the average cost of show of advertizing on the cities of Russia.
The map function checks, whether this record – and if it is necessary is necessary to us, leaves only the necessary information (the city and the amount of payment). The reduce function calculates the final answer on the city, having the list of all payments in this city.
In article we have considered some input moments about big data:
· What is Big Data and from where undertakes;
· What basic principles are followed by all means and paradigms of work with big data;
· Have considered paradigm of MapReduce and have sorted some tasks in which it can be applied.
The first article was more theoretical, in the second article we will pass to practice, we will consider Hadoop – one of the most known technologies for work with big data and we will show how to start MapReduce-tasks on Hadoop.
In the subsequent articles of cycle we will consider more complex problems solved by means of MapReduce, we will tell about restrictions of MapReduce and about what tools and technicians it is possible to bypass these restrictions.
Thanks for attention, are ready to answer your questions.
Links to other parts of cycle:
Big Data from And to I. Part 2: Hadoop
This article is a translation of the original post at habrahabr.ru/post/267361/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: email@example.com.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.