After short break, our FlyElephant team renews the publication of the digest including selection of links to news and materials in the directions: artificial intelligence, big data and high-performance calculations. Also we now conduct survey among scientists in which we ask to answer couple of questions of carrying out researches connected with their process. We will be grateful to all for participation in poll and pleasant reading materials of the fresh digest!
3 years ago
Hi, Habr! In the previous article we have considered paradigm of parallel computings of MapReduce. In this article we will pass from the theory to practice and we will consider Hadoop – powerful tools for work with big data from Apache foundation.
In article it is described, what tools and means are included by Hadoop how to set Hadoop at itself, instructions and examples of development of MapReduce-programs for Hadoop are given.
3 years ago
Analysts sometimes need to answer questions like such: "how many the sites use WordPress, and how many Ghost", "what covering at Google Analytics and what at the Metrics", "as often the site of X refers to the site Y". The most honest way to answer them — to walk on all pages on the Internet and to count. This not so mad idea as can seem. There is Commoncrawl project which publishes every month fresh dump of the Internet in the form of gzip-archives the total size in ~ 30tb. Data lie on S3 therefore for processing MapReduce from Amazon is usually used. There is mass of instructions how to do it. But with current rate of dollar such approach became a little expensive. I would like to share in the way how to reduce the price of calculation approximately twice.
3 years agoNote of the translator:In our blog we write about creation of cloud service 1cloud much, but much interesting it is possible to gather and from experiment on work with infrastructure of other companies. Today we present to yours the second and last part of adapted translation of note of the engineering Twitter team about creation of file system for work with Hadoop clusters. The first part is available according to the link.
High availability to Wednesday from many data-centers
In addition to everything described in the first part of article, engineers of Twitter have created the project code-named of Nfly (N of elements in N TsOD) by means of which the most part of functionality of HA and work with set of data-centers in ViewFs is implemented that allows to avoid duplication of code. Nfly can create one reference of way of ViewFs for set of clusters.
When using Nfly interaction is carried out as if with single file system while actually, each event of record is applied to all slinkovanny clusters, and reading is made or from the next cluster (according to NetworkTopology) or from this where the freshest copy is stored.
I have gone somehow to courses on BigData, according to the recommendation of friends and I was lucky to participate in competition. I will not tell about training on course, and I will tell about MyMediaLite library on .Net and how I used it.
3 years agoNote of the translator:In our blog we write about creation of cloud service 1cloud much, but much interesting it is possible to gather and from experiment on work with infrastructure of other companies. Today we present to yours the first part of adapted translation of note of the engineering Twitter team about creation of file system for work with Hadoop clusters.
Twitter uses set of large clusters of Hadoop which are one of the largest in the world. Hadoop creates work platform kernel from data of microblog service and provides volume storage of analytical information on actions of users of Twitter. In today's material we will tell about work with ViewFs, client-side as file system of Hadoop.
ViewFS facilitates component interaction of HDFS infrastructure of Twitter, creating the single name space covering all data-centers and clusters. Helps to scale file system HDFS Federation, and NameNode High Availability helps to increase reliability in name space. Sharing of these functions enters considerable complexity to administrative process and uses of several Hadoop-clusters different versions. VewFS allows engineers of Twitter to remember difficult URI, instead simple designations of ways are used.
At Twitter scale, configuring of ViewFs in itself is complex challenge therefore specialists of the company have developed own version of file system — TwitterViewFs. It dynamically generates new configuration that allows to achieve complete view of status of file system.
3 years ago
In recent years NoSQL and BigData became very popular in the IT industry, and on the basis of NoSQL thousands of projects are successfully implemented. Often at different conferences and forums listeners ask question how to upgrade or transfer old systems (legacy) to NoSQL. Fortunately, we had experience of transition from SQL to NoSQL in the large project of SMEV 2.0 about which I also will tell under cat.
3 years ago
Recently it is very frequent, both in team, and outside of it, I often meet different interpretation of the concepts "Big Data" and "Data mining". Because of it misunderstanding between the Performer and the Customer of rather offered technologies and desirable result for both parties grows. Situation aggravates lack of accurate definitions from some standard standartizator, and also different order of cost of works in the opinion of the potential buyer.
In the market there was opinion that "Data mining" — it when to the Performer have shipped dump, it has found couple of trends there, sgenerit the report and has received the one million rubles. With "Big Data" everything is much more interesting. People think that it something from black magic, and the magic costs much.
The purposes of this article are the proof of lack of essential distinctions between interpretation of these concepts, and also explanation of the main clouds in understanding of subject.
Visiting once again the favourite sites, I have found abrupt article of Tom Hayden about use of Amazon Elastic Map Reduce (EMR) and mrjobfor calculation of statistics of the relation prize/loss in data set with statistics on chess matches which it has downloaded from the site millionbase archive and with which it has started being played using EMR. As data volume was only 1.75GB, describing 2 million games of chess, I was skeptical about use of Hadoop for this task though were and its intentions simply to be played and study more densely, on real example, the utility of mrjob and infrastructure of EMR are clear.
3 years, 1 month ago
Dear colleagues! We are glad to tell you that in month, on October 21, in Moscow already for the fourth time will pass the Forum of solutions of Dell! "Redisson Slavyanskaya" will become place of our meeting again, and we invite everyone to take in it part.