And the forecast will be separately calculated for each point from which you request it and to be recalculated every time when you look at it to be the most actual.

In this post I want to tell a little about how presently the world of weather models is arranged, than our approach differs from normal why we decided to build own forecast and why we believe what at us will turn out better, than at all others.

We constructed own forecast with use of traditional model of the atmosphere and the most detailed grid, but also tried to collect all possible sources of data on atmospheric conditions, statistics on how weather in practice behaves, and applied machine learning to these data to reduce error probability.

Now in the world there are several main models on which forecast the weather. For example, model open source WRF, the GFS model which initially were the American development. Now the NOAA agency is engaged in its development.

The WRF model is supported and develops scientists worldwide, however and it has an official version — the American scientific institute NCAR which is in Bolder, Colorado is engaged in its development and support. Initially WRF developed as two parallel branches — ARW and NMM which is nowadays abolished. The GFS and WRF models have a little different vector of development (GFS extends the global and oriented to the USA products). WRF first of all local model which can be configured under a certain district.

## Part 1. About classical meteorology

In essence WRF is open source the program written on the Fortran (see insert) and reflecting the current understanding scientists of laws of physics and dynamics of the atmosphere and, respectively, weather. As any representative of open source software respecting himself, does not work with WRF "from a box". That is probably the majority of linuksoid will manage to start it, but only after the fair amount of time spent for reading manuals and compilation. At the same time quality of weather forecastings by means of the crude version can unpleasantly surprise. WRF is created to describe complex dynamic system – the atmosphere of Earth and therefore needs accurate setup.

Lyrical digression about the Fortran

It is clear, that the Fortran, probably, not the best choice for creation of big open source code systems. But there are two serious reasons not to rewrite WRF on other languages. The first – the code is that is called checked by time: not one generation of scientists made the contribution to forming of physical model. Besides, this code is widely supported by scientific groups worldwide. The second reason is that for the description of such complex system as the Wednesday surrounding us, fair computing resources are required. Modern compilers, the Intel Fortran type, allow to collect the performed files so that those were executed with the maximum performance.

All process of work of model can be separated into two conditional parts: prediction of physics and prediction of dynamics. Physical WRF modules trace amount of heat which is selected and absorbed in the atmosphere, and also formation of rainfall in due time and in the right place. Dynamics is the movement of air masses, a wind rose, forming of cyclones and other. For physics the set of semi-empirical models, on one answers this or that process, for dynamics – the parametrized version of the equation of Euler.

On the picture the cut of an estimated grid of model is shown. Results of calculations of temperature are displayed by color of cells and are generally a consequence of physical processes – heating and cooling. Strelochki is shown by transfer of air masses, that is result of calculation of dynamics.

Euler's equation is a differential equation in private derivatives. It is clear, that the computer cannot solve differential equations in private derivatives without assistance. The help in this case consists in decomposition of the equations of a mathematical model on finite difference schemes. That is, presenting derivatives in the form of differences, it is possible to receive the most correct solution of the equation.

Complexity, however, not only in that as not found analytical can bring closer more precisely a numerical solution to meanwhile. It also in adequately to parametrize the processes managing the atmosphere from the outside. Solar radiation, heat radiation of the soil, influence of greenhouse gases, phase transitions of water vapor – here the incomplete list of all that needs to be considered in attempts to predict weather.

Possibly, this task would be absolutely very heavy if not supervision. For those years during which the mankind was interested in future meteoconditions (and it is big term) some experience of carrying out atmospheric measurements was saved up. Such things as meteorological stations, satellite spectrometers, aircraft devices, a radar, lidars — yes you never know what else were thought up. That data volume which is minimum necessary for drawing up the forecast at the level of accuracy conforming to modern standards, it is necessary to use all available sources of information: more than 10000 meteorological stations worldwide, more than 80 satellites on Earth orbit, about 1500 stations of radiological sounding.

Now the data on supervision over the atmosphere on all globe received in one fixed timepoint represent terabytes of station supervision, radar scannings and satellite pictures. It is not enough of them completely to describe a current status of the atmosphere, but they can be used for refining of initial conditions of model.

*Some of methods of supervision over the atmosphere available to mankind*

As the accuracy of a solution of differential equations in many respects depends on the accuracy of a task of initial conditions, all data of supervision are used for drawing up the most exact field of atmospheric parameters. For combination of calculations of model and separate experimental data there is a technology of assimilation of data which inherited the worked mathematical apparatus from the theory of management and calculus of variations. Thus, in addition to numerical models, with a weather forecast we are helped by data of synchronous supervision from an orbit and terrestrial stations.

Now our system of estimated areas of a numerical model is designed so that to cover the territory of our immense homeland with forecasts of two types – on a grid with the rough permission (6 by 6 km) and on a grid with the small permission (2 by 2 km). These grids are enclosed each other and interact, transferring among themselves data on edge and initial conditions.

To calculate, process and store atmosphere parameters at such scales, huge computational capabilities are necessary. The day volume of the forecasts coming to storage makes more 10tb. The weather forecast for 48 hours with the set detail level for the Moscow region even on clusters of Yandex takes about 6 hours.

The weather forecast is calculated every time when the user addresses service. It is dictated not only desire to give the most exact actual weather forecast especially for the user's coordinates, but also severe need. The matter is that data volume about weather with high resolution is so big that precalculation of such forecast over the territory of all our homeland will take several hours, and any rantaymovy base will not be able to respond to the requests for acceptable time. It is also right not only for Meteum's results, but also for some models which are its part. For example, the GFS and WRF models tell such amount of information that for their transfer to API microservice which, unlike many databases to store, update and give data directly from memory of the machines entering microservice was organized.

So at one of early stages of work the arrangement of estimated areas in the central region of the Russian Federation looked. Red – the external domain with a grid of 6 by 6 km, blue – the enclosed areas with higher resolution (2 by 2 km)

The system of obtaining, processing and data analysis, calculations of model and their combination in algorithm of assimilation of data is a chain of many links. In addition to the correct processing of the arriving data, it is necessary to configure correctly their assimilation in initial conditions for differential equations. The theory of assimilation of data — science about a correct combination of supervision to forecasts of mathematical models is responsible for development and physical justification of these algorithms. However supervision over an atmosphere fortune which we come into from different sources can be useful also for other purposes.

## Part 2. About machine learning

WRF is the standard of the industry of meteoforecasting, however there are also other models: we receive the forecasts made by means of them from our partners. Despite all knowledge of mankind of atmospheric processes, satellites and supercomputers, all these models as you know, are mistaken. The fact that in their forecasts there are systematically reproduced patterns is interesting.

In addition to actually WRF model which is calculated on Yandex clusters we receive the forecasts for 12000 cities worldwide made by one of our partners — the Foreca company. More detailed information on a global status of the atmosphere to us comes from the American Global Forecast System model which is considered one of the most exact global models in the world and has permission in 0:25 degrees.

The behavior of these models in different meteorological situations allows Meteum to estimate more precisely adjustments which need to be made to the forecast and to optimum pick up a combination of basic data.

Some of forecasting models of weather overestimate quantity of rainfall which dropped out on the earth, others — underestimate night time temperature within the city. Having on hand archive of forecasts of models, it is possible to select many such patterns including much less obvious, than about what I spoke above. To the person to make it rather difficult because of huge data volumes, and here algorithms of machine learning perfectly cope with it. For detection of patterns and interrelations between forecasts of models and a real meteorological situation, we use algorithm of machine learning of Matriksnet known to you.

Matriksnet accepts on an input in a special way the processed archives of weather forecasts and compares them with data on a real meteorological situation. As data on real weather the supervision received on thousands of professional meteorological stations worldwide are used.

As a result of such comparisons the formula of adjustment of the forecast which, depending on a meteorological situation, makes an optimum combination of forecasts of these models provided by our suppliers turns out.

Meteum uses a huge number of data, and it only the beginning. For example, in a number of regions of Russia exact meteomeasurements are rather rare.

They are not enough for creation of the hyper local forecast. Therefore Meteum is capable to use a large number of the data which are indirectly indicating a meteosituation. We already use data of Yandex. Cards which help us to consider a district landscape. I will give one more example. In many phone models barometers are built in. They are not such exact, as in meteorological stations, but their millions. Also they are distributed where there live people. Next year we will begin to use their data for refining of an actual state of the atmosphere.

Besides, already now in applications of Yandex. Weathers are an opportunity to tell about what weather now where you are. We initially created Meteum expanded under different classes of data which indirectly testify to weather therefore we will be able to consider also your supervision.

## Part 3. Behind blue eyes

Habr's readers, probably, especially clearly imagine that behind the processes described above there has to be appropriately designed infrastructure.

In real time the forecasts of models necessary for Meteum's work gather from several different sources. The forecasts of WRF and GFS which are in microservice weigh more than 60 GB and are updated every minute. And they do it atomarno, big pieces. Such requirements made impossible use of traditional rantaym-bases. Forecasts of the Foreca company are stored in PostgreSQL as their volumes and refresh rate are much lower. After processing and demonstration to the user, results of a formula together with components (the forecasts of suppliers and other factors transferred to Matriksnet), go to MapReduce cluster. These data are used afterwards for verification and additional setup of work of Meteum.

## Part 4. Total rantaym

All those processes which we described above happen every time when the user comes into Yandex. Weather. Making request, you send the geographical coordinates to Meteum. It collects all data, necessary for the forecast, analyzes a meteorological situation, type of the spreading surface and makes on the basis of these data the own forecast especially for your provision.

## Part 5. What turned out

On ours own to estimates (alas, there are no independent measuring instruments in this area yet), for today our weather forecast most precisely than competitors famous to us. For example, the temperature forecast for 24 hours at us is mistaken 35% less closest competitor.

But we understand that we to an ideal still far, and hope that after a while we will manage to increase even more accuracy, thanks to data from users of the application, and also to additional sources of data on the atmosphere.

This article is a translation of the original post at habrahabr.ru/post/271725/

If you have any questions regarding the material covered in the article above, please, contact the original author of the post.

If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.

Shared knowledge makes the world better.

Best wishes.