Developers Club geek daily blog

2 years, 11 months ago
We bring to your attention materials based on Alexander Serbul's performance at the BigData Conference conference. I as the author and the speaker, the text edited a little and added modern thoughts and actual problems therefore I hope a post will bring to you as additional practical useful knowledge in the industries, and food for reflections — where to move with the knowledge. So — in fight!


In we wash understanding, BigData is something mad, i.e. the term is, and there are no crinkles in it —. On the one hand, administrators puzzle where to put considerable information volume. On the other hand, it is the high-loaded development. It is necessary to create a cluster, to look for hardkorny developers, very experienced. Besides, it is necessary to use the higher mathematics: differential calculations, linear algebra, probability theory, machine learning, columns and training at them, logistic regression, linear discriminant analysis and so on. How to survive in this situation what tools to take what else crude (unfortunately, the majority) how to bring to firm money. It is a main objective of use of BigData, and all the rest — populism.


MapReduce is the paradigm of convolution of functions for execution of big and advanced queries by data offered Google. It allows to process in parallel information in the spirit of data parallelizm. Ideally, it is necessary all algorithms used for work with big and not only big data to remake on MapReduce. But nobody does it free of charge ;-) In the old book, at the grandmother on the shelf you quickly enough will find good algorithm of a clustering of K-means in a web and it reliable will work. But suddenly, before release or when data appear more, than you expected, you will find out that this algorithm does not work in "the parallel mode", does not work through MapReduce — they can load only one kernel of the processor. Therefore to you it will be necessary to invent urgently and again one more algorithm which is able to work in parallel and to think up as it has to work in MapReduce paradigm. And not everyone manages it - it is destiny of Computer Science (actually sometimes it is possible to remake without PhD of knowledge algorithm on MapReduce algorithm by method of strong coffee and a board with markers).
We in the projects began with use of algorithms on the Hadoop MapReduce platform, but then passed to Spark because it appeared more reasonably and is practical arranged. Classical Hadoop MapReduce works very simply: performed a task, put result in the file. Took, executed, put. Spark — a beret, performs all tasks, and then unloads result. In my opinion and not only mine, Hadoop MapReduce as do not twist — an outdated conglomerate which constantly and konvulsionno tries to change because of what developers and system administrators constantly should be retrained, and to business – to play the Russian tape measure with "crude technologies". But … we have not enough choice (yes, we watched Apache Storm — but it absolutely from other area: task parallel computng).

Alternatives of Apache Spark if it is fair, are not visible yet. At it and the most active open-source the project in infrastructure of Apache, it and object for imitation — look at least on Prajna from Microsoft.

It is possible to throw in reply to Apache Tez or to find something small in the zoo Apache — but, believe, for decrease in risks it is better to use mainstream-technologies which develop in step with the market.

Somewhere is nearby, not absolutely from this area, but from interesting and if there is a strong wish — look also at Apache Giraph and TensorFlow. Also ask a question: it is TaskParallel or DataParallel of technology — and everything will become clear.

For what tasks we use Spark

We use technologies of parallel processing of data approximately so. On MySQL-servers hundreds of thousands of databases of clients companies, with staff from units to thousands of employees turn. Spark is used generally in service of personal recommendations about which it was told in one of last publications.

We organize events — viewings of orders, adding in a basket, payments of orders — we process, we put in Amazon Kinesis, we process worker'ami, we save in NoSQL (DynamoDB) and, at last, we send to Spark for generation of models and analytics.

At first sight, the scheme looks excessively complicated, but the understanding gradually comes what all this is necessary for. It is a minimum of necessary tools which are necessary for a survival.

We unload the received results in Apache Mahout and on an output we receive specific recommendations for the client. The total amount of the events registered by service of recommendations reaches tens of million a day.

So far we intensively use and we develop algorithms of collaborative filtering, but we see approximately such road map on development of algorithms:
• Multimodality
• Recommendation Content-based
• Clustering
• Machine learning, deep learning
• Targetting

Now, more than ever earlier, the multimodality — i.e. use of several, different algorithms, for the answer to a question is appreciated (in our case — issues of the personal recommendation). It is insufficiently simple to start service based on Apache Mahout, it will not give any prize before competitors. Today you will surprise nobody with collaborative filtering any more. It is necessary to consider also tag cloud of the user when he went on shop and obtained some information. The clustering allows to organize targeting of information more flexibly. Here, of course, not to do without machine learning. Deep learning is, simple words, the "high-quality" machine learning meaning very detailed studying of a problem by machine and, often, use of multilayer recurrent neural network. At competent application it helps to increase even more conversion, to work more effectively with clients.

Reverse side of a variety

Today in the market there is a set of software environments, means, tools and products for development and data analysis. opensource thanks (yes, it is complete of crude glyukovaty vampires there, but there are also excellent solutions)! On the one hand, a variety is an undoubted benefit. With another — development process can become quite chaotic and confused. For example, at first try to use Apache Pig when something is impossible, address Apache Hive. Looked, were played, begin to write on Spark SQL. When requests begin to fall — rush to Impala (and there it is still worse). Under the threat of a suicide some damn the world of BigData and return to old kind RSUBD. It sometimes impression that the heap of toys for "specialists" is created, frequent same "specialists". And business does not understand all these searches, it needs to earn money, it demands a reality in time.
Today, perhaps, only Apache Hive can consider as the classical and reliable tool for work with SQL queries on the distributed data, as well as HDFS is classics among clustered file systems (yes, there is of course still GusterFS, CEPH). Nevertheless, many pass to Spark SQL because this framework is written (as there is a wish to trust) taking into account interests of business. Also, HBase, Casandra, Mahout, Spark MLLib are more and more actively used. However to demand that each developer and/or the system administrator freely was guided in all these tools — silly. It is profanation. Technologies — deep, with a heap of settings and everyone demands deep immersion on a month. You do not hurry — quality will inevitably suffer because of race behind quantity.

What to esteem

First of all I want to recommend to all who work or intend to begin to work with parallel algorithms and MapReduce, to read the book "Mining of Massive Datasets" which is in open access. It should be read several times, with a notebook and a pencil, otherwise not to avoid a mental jam. At first nothing most likely will be clear (to me began to open time with 3). But it is one of basic and that is important, available to engineers not possessing a black belt on mathematics, the book on algorithms of work with big data. In particular, chapter 2.3.6 is devoted to relational algebra and methods of projection of its operations on MapReduce. Very useful material, in fact, ready councils for developers are provided here, it is only enough to implement of them attentively.

Reading the literature complete of mathematical parts, remember a joke and smile :-)
Двое летят на воздушном шаре, попали в туман, заблудились. Вдруг их
прижимает к земле, и они видят человека. Один из них кричит вниз: "Где
мы?". Человек, подумав, отвечает: "Вы на воздушном шаре...". Очередным
порывом ветра шар уносится ввысь.
- Он идиот?
- Нет, математик.
- ??????
- Во-первых, он подумал, прежде чем ответить. Во-вторых, дал абсолютно
точный ответ. А в-третьих, этот ответ совершенно никому не нужен.

Apache Spark delicacies

• DAG (directed acyclic graph) vs Hadoop MapReduce vs Hadoop Streaming. It is possible to write the big SQL query representing several MapReduce-operations in a chain which will be correctly executed. Streaming is implemented in Spark much better, than in Hadoop, it is much more convenient to them to use and works often more effectively, at the expense of a caching of data in memory.
• Spark Programming Guide + API. Documentation very sensible and useful.
• It is possible to program on Java. With ++ was considered as difficult language, but Scala much … is not present, not more difficult, rather vysokoumny in my opinion. Scala — cool language, despite certain academic protukhshest and inorganic communication with living dead persons with widely protruding eyes like Haskel. I very much love Scala, but from it it is possible to go crazy, and compilation time leaves much to be desired by itself and the children. Therefore, at desire, it is possible to be connected to Spark both from Java, and from Python and from R.
• Convenient abstraction: Resilient Distributed Dataset (RDD). The fine, just divine concept from the world of functional programming allowing to parallelize files of huge volume — on hundreds of gigabytes, or even terabytes.
• Convenient collections: filter, map, flatMap. Convenient approach in Spark — a collection and operation over them: filter, map, flatMap. From where it undertook? It came from functional programming which is actively preached by Scala now (and not only in it).

Java7 and Spark

Historically it developed so that we write to Spark on Java 7, but not on Java 8. Unfortunately, in "seven" there is no normal support of function entities therefore we are forced to be engaged in a sado-masochism and to create objects like PairFunction, Tuple2, TupleN. Generally prepare — when Java7 is integrated from Scala, terribly unreadable code turns out. I, of course, exaggerate a little, but in is mute all is mixed and there is a wish to put on glasses with 13 eyepieces.
JavaRDD<String> combinedRowsOrdered = combinedRows.mapToPair(new PairFunction<String,String,String>() {…
	public Tuple2<String,String> call( String row ) {
		…return new Tuple2<String, String>…

If you do not want to climb in Scala jungle, then use Java 8 better. The code turns out more readable and well.

A little more on Scala

The name Scala came from the English scalable. It is the language for development of big component applications created by scientists-mathematicians. I personally have an impression that Scala is a trend. Trend of revival of functional programming (again hi Haskel, F#...). As that "suddenly" turned out (though scientists guessed it much earlier) that to process data arrays in Data-Parallel paradigm – more conveniently in functional style, wow! :-) Spark actively uses also Scala Actors (hi Erlang). They allow to write a simple, readable, one-line code which is executed on a large number of servers. You get rid of risks of multithreaded programming in which are forced to be engaged in Java — it is difficult, long and expensively (but abruptly). Besides, because of complexity of multithreaded programming there are many errors. And thanks to actors life "suddenly" becomes simpler.


For expansion in Amazon Spark offers us a script under the name Spark-es2. It downloads a half of a repository of Berkeley, something creates on Ruby under a cowl (hi Japan) and sets a lot of some software. The turned-out system is quite fragile, sensitive to configuration changes of machines. Also there are complaints on logging and updating of components of system.
Some time we existed with Spark-ec2 script, but it turned out that it is better to write Spark installer independently. Besides, the installer will be able to consider a possibility of connection of new spot-machines.
All this for us is painful as we have no big staff of system administrators — we are more programmers If we had 30 system administrators, I would tell: "Children, I will program on Scala, and you here, please, do not sleep at the nights and be engaged in Spark clusters". The sheaf from Spark and Elastic MapReduce was much more attractive option. Also colleagues praise solutions from Spark from Cloudera and HortonWorks — can they to you too will be useful.

Amazon EMR

Amazon suggests us not to waste time and to unroll Spark cluster at them with ElasticMapReduce service use. Here almost everything will work from a box. Spark is integrated into a Yarn-cluster, there is a lot of software, there is a podglyuchivayushchy monitoring, it is possible to add the machine, to scale HDFS, to change cluster size, to increase and reduce quantity of tasks at the expense of spot-machines. Spot-machines in Amazon cost 5-10 times cheaper. We use them always because it is convenient, cheap and quickly.

Spark is professionally integrated into EMR with S3. It is correct, exactly there you will store most likely files in Amazon. We compared storage of big files in S3 and HDFS, and it turned out that access rate approximately identical. Then why to contact HDFS, to suffer with clustered file system if there is a ready service. Also in Elastic MapReduce to giblets of Sparlk/Hadoop it is possible to prokinut through a ssh-tunneling web adminki and to get used to them (though I did not get used).

Amazon EMR cost

Normal machines, difference about 10% turn out a little more expensively, than. At the same time you receive "in a set" a lot of ready, the truth of a little buggy software (Hue is buggy most) and an opportunity to scale a cluster. At the same time even the administrator is not necessary to you — you as developers, there God-almighty.

Types of machines

Machines happen three types here:
• Master-machines which control in general all cluster. On them Spark Master is set.
• Core-machines on which the file system — HDFS is unrolled. Them there can be several pieces. However, it is only recommended to increase the number of core-machines, but not to reduce, otherwise data are lost.
• For all the rest task-machines are used. These are normal Spark-servers at which vorker work. The number of spot-machines can be changed freely, creating park though from hundreds of machines.


• Spark. In the previous versions of images of Amazon spark.dynamicAllocation.enabled so you have to speak to it how many it is necessary machines for execution of a task is not supported yet. If the cluster partially stands idle, then Spark will not occupy the remained machines for execution. You have to register strictly how many machines are necessary for it. It is inconvenient. But since AMI 4 this function already works.
• Hadoop/Yarn/HDFS. In Yarn-clusters, as well as in Oracle, the set of settings, and, on good is used, the administrator who very well understands it is necessary. But, despite pain, Hadoop-clusters surely solve the problems. Most of all I do not like in Yarn and Hadoop the fact that there a mess with logging. In a log everything is written, settings of levels of logging are scattered by different parts of cluster giblets and therefore their quantity very quickly expands. And there is no normal solution of this problem. It is quite good to learn from old kind inhabitants of unix – for example at mysql, apache.
• Ganglia. It is time-series software which builds diagrams on different metrics: loading, quantity of tasks, etc. Helps to gain an impression about a cluster status. The convenient thing, but is shortcomings – the machines "killed" a spot continue to hang and encumber diagrams.
• Hive. It is support of the SQL commands which works at files in HDFS and S3. The quite good tool, but its opportunities sometimes are not enough. We use. But when is necessary more — come into Spark and directly we solve problems of relational algebra.
• Pig. We do not use it therefore I find it difficult to give some assessment.
• Hbase. NoSQL option, we do not use yet.
• Impala. Very interesting thing about which it is possible to write a separate post. So far makes impression of crude software. So use at own risk.
• Hue. It is the adminka to "bigdata" written on Python. Its GUI allows to integrate together both Impala, and Hbase, and Pig, and Hive. That is it is possible to make the corner of the analyst in a web :-) I used it week, it began to be buggy, hang up, then ceased to open in general — generally, the left unfinished software

The main problems of Spark which we met

Falling on memory

What is Map? We take something, we scatter on keys, and already we scatter them on a cluster. Here nothing has to fall — algorithmically.
What is Reduce? When in one worker data grouped in one key are collected. If it is good to program, then it is possible to transmit a la carte to reducer all values in one key and nothing will fall. But in practice it turned out that Spark can fall in different points – there was not enough buffer for serialization, for a vorker of memory was not enough. And, in my opinion, it is the main problem. But nevertheless it is possible to be attached accurately. Our Spark does not fall now though we achieved it by means of magic.

It is necessary to put reasonable Executor Memory: - executor-memory 20G, - conf spark.kryoserializer.buffer.mb=256, - conf spark.default.parallelism=128, - conf
KryoSerializer allows to squeeze objects (spark.serializer org.apache.spark.serializer.KryoSerializer). Without it they consume much more memory. Also I do not recommend to reduce value of the constant spark.default.parallelism=128, otherwise can often fall on memory. As for memoryFraction, we do not use a caching.

Unloading of results

Let's say you need to unload from Spark data in model. If volume is big, then it will be executed very long.
• Thanking - driver-memory 10G you understand that you unload from the driver.
• When using Collect () all result gathers in memory in the driver and it can fall. Therefore I recommend to use toLocalIterator (). Alas, its performance is very much low. Therefore we had to write a code for assembly of partition. To whom it is interesting, I will tell in more detail.


This code — the only thing that helped us to cope with a logging problem:
export SPARK_LOCAL_DIRS="/mnt/spark,/mnt2/spark"
export SPARK_WORKER_DIR="/mnt/spark_worker"
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=1800 -Dspark.worker.cleanup.appDataTtl=259200"
#Worker executor stdout/stderr logs
spark.executor.logs.rolling.maxRetainedFiles 2
spark.executor.logs.rolling.strategy time
spark.executor.logs.rolling.time.interval daily


I hope was and it is useful and it is interesting. It is more and more and more actively in our life drive parallel algorithms on MapReduce. It is not enough of them, they are looked for, but there is no other output probably (well something can it will turn out to count one Apache Giraph faster and TensorFlow and through Task-Parallel paradigm). The platform which became classics – Hadoop MapReduce, gives way functionally written and in the modern language and mathematics to the Apache Spark platform. Most likely you will be forced to begin to understand, at least at the level of terms, inhabitants of a Hadoop-zoo: Hive, Pig, HDFS, Presto, Impala. But constantly to study – and to be ahead of competitors it is necessary to know our everything more, to write quicker and to think – more brightly. All of good luck and with coming New Year!

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus