2 years, 6 months ago
Hi, Habr! In this article it will be a question of such not really pleasant aspect of machine learning as optimization of hyper parameters. Two weeks ago in very famous and useful Vowpal Wabbit projectthe vw-hyperopt.py module able to find good configurations of hyper parameters of the Vowpal Wabbit models in spaces of big dimension was poured. The module was developed in DCA (Data-Centric Alliance).
For search of good configurations of vw-hyperopt uses algorithms from pitonovsky library Hyperopt and can optimize hyper parameters it is adaptive by means of the Tree-Structured Parzen Estimators (TPE) method. It allows to find the best optimum, than simple grid search, at the equal number of iterations.
This article will be interesting to all who deal with Vowpal Wabbit, and especially by that who was annoyed with absence in the source code of methods of tuning of numerous handles of models, and or tyunit them manually, or incensed optimization independently.
2 years, 6 months ago
Yandex. A metrics today it not only system of web analytics, but also AppMetrica — system of analytics for applications. On an input in the Metrics we have a data stream — the events which are taking place on the websites or in applications. Our task — to process these data and to present them in the form, suitable for the analysis.
But data handling is not a problem. The problem is in that as well as in what type to save results of processing that it was possible to work with them conveniently. In development process we had to change approach to the organization of data storage several times completely. We began with the tables MyISAM, used LSM trees and eventually came to column-oriented to the database. In this article I want to tell what us forced to do it.
Yandex. The metrics works since 2008 — more than seven years. Every time change of approach to data storage was caused by the fact that this or that solution worked too badly — with an insufficient stock on performance, it is insufficiently reliable also with a large number of problems at operation, used too many computing resources, or just did not allow us to implement that we want.
If long ago it was interesting to you how Big Data is applied in different areas of business, science and public administration and it there was a wish to hear from people who are engaged in it, then welcome to the Festival of Data which will take place on December 19 at the Exhibition of the High SMIT Technologies in the Museum of Moscow.
During several business hours of the Festival leading experts of the industry from Yandex, "Schools of data Beeline", Data-Centric Alliance, Avito, state unitary enterprise "NI and PI of the General plan of Moscow, Higher School of Economics National Research University will tell guests of an exhibition about perspectives of use of data analysis in the next several years.
So-called machine learning does not cease to surprise, however for mathematicians the success reason is still not absolutely clear.
Somehow few years ago at supper, to which I was invited, the outstanding specialist in the field of differential geometry Eugenio Calabi volunteered to devote me in a subtlety of very ironic theory about a difference between adherents of pure and applied mathematics. So, having reached a deadlock in the researches, supporters of a pure mathematics quite often narrow a perspective, trying to bypass an obstacle thus. And their colleagues specializing in applied mathematics come to a conclusion that current situation indicates the need to continue studying of mathematics for the purpose of creation of more effective tools.
I always liked such approach; thanks to it it becomes clear that applied mathematicians will always manage to involve new concepts and structures which continually appear within fundamental mathematics. Today, when the question of studying of "big data" – too volume or difficult information blocks which do not manage to be understood is on the agenda, using only traditional methods of data handling – the tendency especially does not lose the relevance.
This post — about quality of air which we breathe. It is considered to be that in general, air of big cities is unhealthy. It also is clear — here to you both a traffic and plants and you never know what else. Generally, all this keeps residents of the megalopolis in permanent concern about "an adverse ecological situation".
2 years, 6 months ago
In this article I will tell about use of not relational MongoDB base for monitoring of journal files. For monitoring of log-files there is a set of tools, from monitoring by the shell-scripts tied on cron to apache hadoop cluster.
Approach with monitoring by scripts of text files is convenient only in the elementary cases when, for example, problems come to light existence in the journal file of the lines "ERROR", "FAILURE", "SEVERE", etc. For monitoring of big files it is convenient to use the Zabbix system where Zabbix Agent (active) will read out only new data and with a certain frequency to send them to the server.
Good morning and welcome to GovCon7. My name is Abdulli's Hundreds and I am the leading engineer on implementation of Palantir Technologies and it is Palantir 101. I would like to tell in the next half an hour or forty five minutes that it who we are that such Palantir and that it does for the organizations with which we work, and also, closer to the end of action, we will hold the small presentation.
Before passing to all this, I want to begin with couple of stories which have to shed light on the fact that we and Palantir, think of an analysis problem in the world of Big Data.
The first history, is a story about chess.
Many of vases know that in 1997 I participated in development of a chess supercomputer of Deep Blue which overcame Garry Kasparov, at that time, who was the best chess player in the world. Now chess playing at the tournament level and a question of the one who is stronger in chess, the person or the computer can be installed in the simple mobile phone, it is not actual any more.
New interesting question: "What will be if the person and the computer play chess together as command?"
First, such commands showed high performance, and, actually, it is quite expected as people are good in chess, computers are very good in chess, but they are good for various reasons: computers have serious tactical advantage, they can evaluate many thousands of combinations every second; people have an experience, capability to tricks, intuition and ability to experience the opponent that hard is given to the computer.
These forces are combined and the command the person/computer is capable to win and teams of the strongest players and consolidation of the strongest supercomputers.
2 years, 6 months ago
In the first part I told that a hash the table is a few LIST, SET and SORTED SET. You judge — LIST consists of ziplist/linkedlist, SET consists of dict/intset, and SORTED SET is ziplist/skiplist. We already considered the dictionary (dict), and in the second part of article we will consider structure of ziplist — the second most often applicable structure under Redis cowl. Let's look at LIST — the second part of its "kitchen" is simple implementation of the chained list. It is useful to us attentively to consider often mentioned council about optimization a hash of tables through their replacement by lists. Let's consider how many it is required to memory on overhead costs when using these structures what price you pay for economy of memory. Let's sum up the results during the work about a hash as tables, when using the coding in ziplist.
Last time we finished that ziplist of 1,000,000 keys saved with use occupied 16 MB of random access memory whereas in dict the same data demanded 104 MB (ziplist 6 times less!). Let's understand what price:
2 years, 6 months ago Kudu was one of the innovations provided by the Cloudera company at the Strata + Hadoop World 2015 conference. It is the new engine of storage of big data created to cover a niche between two already existing engines: distributed file system of HDFS and columnar Hbase database.
The engines existing at the moment are not deprived of shortcomings. HDFS which is perfectly coping with operations of scanning of large volumes of data shows bad results on search operations. With Hbase all exactly the opposite. Besides HDFS possesses additional restriction, namely, does not allow to modify already written data. The new engine, according to developers, has advantages of both existing systems: — operations of search with a fast response — possibility of modification — high performance when scanning large volumes of data