Developers Club geek daily blog

1 year, 11 months ago
Many people face a question of purchase or property sale, and important criterion here, as if not to purchase more expensively or not to sell cheaper concerning other, comparable options. The elementary method — comparative to be guided by the average price of meter in the specific place and ekspertno adding or reducing percent from cost for merits and demerits of the specific apartment.image But this approach is labor-consuming, inexact and will not allow to consider all variety of differences of apartments from each other. Therefore I decided to automate selection process of real estate, using data analysis by a prediction of the "fair" price. In this publication the main stages of such analysis are described, the best predictive model from eighteen tested models on the basis of three criteria of quality is selected, as a result the best (underestimated) apartments are marked on the card at once, and all this using one web application created by means of R.



Data collection


With a problem definition clear, there is a question from where to take data, in the Russian Federation there are several main websites on search of real estate, and there is a WinNER base which contains the maximum quantity of declarations in rather convenient interface and allows to perform unloading in csv a format. But if earlier this base allowed time access (minutes, hours, days), then now only the minimum 3rd monthly access that is a little excessive for the normal buyer or the seller (though if seriously to approach, then it is possible to incur these expenses). With this option would be absolutely simply therefore we will go some other way, parsing from some website of real estate. From several options, I selected the most convenient well-known cian.ru. At that time, data mapping was represented in a simple tabular form, but when I tried to rasparsit all data on R, here I was comprehended by failure. In some cells, data could be underfilled, and I could not be hooked by standard means of functions R for anchor words or characters, and it was necessary or to use cycles (that in R for such task incorrectly) or to use regular expressions with which I am not familiar at all therefore it was necessary to use alternative. This alternative was the remarkable import.io service which allows to retrieve information from pages (or the websites), has the REST API and gives result of JSON. And R can also request to API to perform and to sort JSON.
Having quickly accustomed with this service, created on it an extractor which retrieves all required information (all parameters of each apartment) from one page. And R already touches all pages, causing this API for everyone and sticking together the data obtained by JSON in the uniform table. Though import.io allows to make also completely autonomous API which would pass according to all pages, I considered what is logical what I cannot make (parsing of one page) correctly to assign to third-party API, and to continue to do all the rest in R. So, data source is selected, about restrictions of future model now:
  1. City of Moscow
  2. One type of apartments in model (that is odnushka or two-room flats or threes)
  3. Within one metro station (as geocoding, and distance to the next subway is used)
  4. Vtorichk (as new buildings all essentially different, and qualitatively from each other differ, and in declarations it either is not told, or something is written in comments, but neither that, nor another, allows to create adequate model)


Overview of data


As it often happens, in data there can be both emissions, and admissions, and just deception when new buildings issue for a vtorichka, or in general sell the land.
Therefore originally it is necessary to lead all data to a "tidy" type.

Check on deception

Here the main emphasis on the description of the declaration if words meet, these supervision which are obviously not relating to the selection demanded to me are excluded, it is checked by the R grep function. And as in R calculations vector, this function returns a vector of values of right supervision at once, having applied it, we will filter selection.

Check on the passed values

The passed values occur rather seldom (and generally in the "left" declarations, with new buildings and the earth which by this moment are already excluded), but with those that remain it is necessary to do something. Options, in general, two to exclude such supervision or to replace these passed values. As there was no wish to endow supervision, and proceeding from the assumption that data are not completed proceeding from the principle "that it to fill, and it is so clear that it as everything surrounding", decided to replace qualitative variables with a mode of their values, and quantitative (metric area) on median values. Of course, it is not absolutely correct, and it would be absolutely correct to carry out the correlation relation between supervision and to fill in admissions according to the received results, but for this task counted it excessive, besides it is not enough such supervision.

Check on emissions

Emissions meet even less often, and can be only quantitative, namely at the price and on metric areas. Here made an assumption that the buyer (ya) specifically knows at what price and approximately what metric area there has to be its apartment therefore setting starting values of the upper and lower price, limiting metric area, we automatically get rid of emissions. But even if it not so (not to do restriction), then at receipt of result or having looked at the chart of dispersion and having seen that there is an emission, it is possible to perform request with the specified data, thereby having cleaned these supervision that will improve model.

Main premises of the theorem of Gauss-Markow

more
So, incorrect supervision we liquidated, further before carrying out the analysis, we will estimate them on the main premises of the theorem of Gauss-Markow.
Running forward, I will tell that we do not want from model receipt of treatment of coefficients, or a correct assessment of their credible intervals, it is necessary for us for forecasting of the "theoretical" price therefore some premises to us are not critical.
  1. The data model is correctly specified. Generally yes, by an exception of emissions, incorrect declarations, replacement of the passed values, the model is quite adequate. Some mild multicollinearity can be present (for example the five-storey building and is not present the elevator or the area – general/residential), but as wrote above, for forecasting it is not critical, moreover it does not break also the main premises. For the purposes of creation of test models, all values were transferred to dammi-variables correctly, the strict multicollinearity is excluded.
  2. All regressor are determined and are not equal. Yes, it is right too.
  3. Errors have no systematic character. Truly, as in MNK the absolute term which equalizes errors is used.
  4. Dispersion of errors is identical (homoscedasticity). As it is used restrictions on the size of regressor and dependent variable (scale comparable), heteroscedasticity is minimum and for forecasting it is not critical again (standard errors are insolvent, and they are not interesting to us)
  5. Errors are not correlated (endogennost). Here not, the endogennost, most likely, is (for example, apartments - "neighbors" from one site or an entrance), there is some external unaccounted factor, but again for forecasting, record with an endogennost is not basic, moreover, we do not know this unaccounted factor.

Data in general correspond to premises therefore it is possible to build models.


Set of regressor


In addition to the variables (received directly from the website), decided to add additional regressor, namely the five-floor house or not (as this very qualitative difference is normal), and distance to the next subway (similarly). For range sensing it is used API service of geocoding of Google (selected it as the most exact, loyal to restrictions and ready function in R is), at first addresses of apartments and the subway, by the geocode function from ggmap packet are geocoded. And the distance is determined by a formula of a gaversinus, the ready distHaversine function from geosphere packet.
The final quantity of regressor made 14 pieces:
  1. Distance to the subway
  2. Total area
  3. Living space
  4. Area of kitchen
  5. House type
  6. Existence and types of elevators
  7. Existence and types of balconies
  8. Quantity and types of bathrooms
  9. Where there are windows
  10. Existence of phone
  11. Sale type
  12. First floor
  13. Last floor
  14. Five-storey building


The tested models of the predictive analysis


more
In addition to practical personal value, it was interesting also to test different models, solved for the choice of the best model, to check different selections for all simple models of regressions which I know namely, the following models were tested:

1. MNK on all regressor
2. MNK with logarithming (different options: logarithming of the price and/or the areas and/or in distance to the subway)
3. MNK with inclusion and an exception of regressor
a) to consecutive step-by-step exceptions of regressor
b) algorithm of direct search
4. Models with penalties (for heteroscedasticity influence reduction)
a) A lasso regression (with 2 methods of determination of parameter of fractionation — minimization of Cp-criterion of Mellou and cross validation)
b) A ridge regression (with 3 methods of finding of parameter of a penalty — the HKB, LW method and cross validation)
5. Method main component
a) with all regressor
b) with a step-by-step exception of regressor
6. Quantile (median) regression (for heteroscedasticity influence reduction)
7. Algorithm of the accidental wood
Total number of the tested models made 18 pieces.

In the course of preparation of models materials were partially used:
Mastitsky S.E., V. K. Schitiks (2014) Statistic analysis and data visualization by means of R.
– E-book, access address: r-analytics.blogspot.com


Criteria of efficiency of models


more
All models in essence essentially different, and for many of them are not present credibility function therefore it is impossible to define internal criteria of quality and it is not quite correct to rely on them for the choice of effective model as they serve first of all for an assessment of adequacy of model. Therefore we will be guided by an assessment of quality of models on an average difference of the empirical and predicted values. And that it was more interesting and for benefit determination (efficiency at the price) the averaged square error (as some supervision can distort it) will be not always indicative, used for an assessment of criteria not only normal RMSE – an averaged square error, but also MAE – a sredneabsolyutny error, and MPE – a sredneprotsentny error.


Results of testing of models


more
As models are evaluated by different functions and syntax of the predicted values at them different too, the simple instruction that the part of regressor is factor is not suitable for all models therefore was the additional frame of data in which all qualitative variables were transformed to dammi-variables is created and in such type models were under construction. It allows is uniform to evaluate all models, to predict the new prices and to define errors.
On different random test checks (different metro stations, different type of apartments, other parameters), all above-mentioned models by 3 criteria of efficiency were evaluated. And almost absolute (92%) the algorithm of the accidental wood was the winner by all 3 criteria. Also on different selections by some criteria quite good results were shown by median regression, MNK with logarithming of the price, complete MNK and sometimes Ridge with Lassus. Results are a little surprising as I believed that models with penalties can be better, than complete MNK, but so was not always. So the simple model (MNK) can is the best alternative, than more difficult. For the reason that on different selections, by different criteria of the place since the second, occupied different models, and the accidental wood remained the winner, for further work decided to use it.
As one model is used already, there is no need for the explicit indication of dammi-variables therefore we return to an original frame, having specified that qualitative variables are factor, it will simplify the subsequent treatment on the chart and to algorithm it will be simpler (though to it in fact all the same). For test modeling the randomForest function was used (from the packet of the same name) with values by default, having tried to change key parameters of complexity of trees of nodesize, maxnodes, nPerm, defined what slightly the best minimization of errors of the forecast at different selections is reached at change of the nodesize parameter (the minimum number of nodes) in 1. So, the model is selected.


Display on the card


The winner is defined (the accidental wood), this model predicts the "theoretical" prices for all supervision with the minimum errors. Now it is possible to consider absolute, relative underestimation and result to display in the form of the sorted table, but in addition to a tabular style, there is a wish at once for informational content therefore we will bring several best results to the card at once. And for this purpose in R there is googleVis packet intended for integration with the cartographic Google system (however, there is a packet and for Leaflet). I continued to use also Google as the received coordinates from their geocoding are forbidden to be displayed on other cards. Display on the card is performed by one gvisMap function from googleVis packet.
code of display of the card
output$view <— renderGvis ({ #view — htmlOutput output element
if ((err ()!="")) return(NULL)

formap3 <-formap ()
$desc formap3 <-paste0 (row.names(formap3),
". №",
$number formap3,
" ",
$address formap3,
"it is underestimated on",
format (-formap3 of $abs.discount, big.mark = ""),
"rubles (",
as.integer ($otn.discount formap3),
"%)")
gvisMap (formap3, "coord", "desc", options=list (
mapType='normal',
enableScrollWheel=TRUE,
showTip=TRUE))

})



Web-based graphical user interface


To transmit all required parameters through the console slowly and inconveniently therefore there was a wish to make everything it is automated. And it is traditional, for this purpose again it is possible to use R with shiny frameworks, and shinydashboard which possess sufficient controls of input-output.
complete code of client part of the interface
dashboardPage (
dashboardHeader (title = "Mining Property v0.9"),

dashboardSidebar (
sidebarMenu (

menuItem ("Source data", tabName = "Source"),
menuItem ("Summary", tabName= "Summary"),
menuItem ("Raw data", tabName= "Raw"),
menuItem ("Tidy data", tabName= "Tidy"),
menuItem ("Predict data", tabName= "Predict"),
menuItem ("Plots", tabName= "Plots"),
menuItem ("Result map", tabName= "Map")
)
),
dashboardBody (
tags$head (tags$style (HTML ('.box { overflow: auto; }'))),

tabItems (

tabItem ("Source",





box (width=12,




fluidRow (
column (width = 4,


selectInput ("Metro", "Subway", "", width= '60%'),
# br (),
hr (),

#checkboxInput ("Kind.home0", "everything", TRUE),
checkboxGroupInput ("Kind.home", "House Type", with (
"panel" = 1,
"Stalin" = 7,
"panel board" = 8,
"brick" = 2,
"monolithic" = 3,
"brick and monolithic" =4,
"block" = 5,
"derev." =6), selected=c(1,2,3,4,5,6,7,8)),
hr (),
sliderInput ("Etag", "Floor", min=0, max=100, value=c(0, 100), step=1),
checkboxInput ("EtagP", "not the last"),
sliderInput ("Etagn", "In the House of Floors", min=0, max=100, value=c(0, 100), step=1)


,


submitButton ("to Analyze", icon ("refresh"))


),

column (width = 4,
selectInput ("Rooms", "Rooms", with
("",
"1" = "&room1;=1",
"2" = "&room2;=1",
"3" = "&room3;=1"), width= '45%'),
# br (),
hr (),
# br (),
selectInput ("Balcon", "Balcony",
c ("it is possible without balcony" = "0",
"only with a balcony" = "&minbalkon;=1",
"only without balcony" = "&minbalkon;=-1"),
width= '45%'),

br (),
hr (),
br (),
#br (),
sliderInput ("KitchenM", "Area of Kitchen", min=0, max=25, value=c(0, 25), step=1),
sliderInput ("GilM", "Cores. area", min=0, max=100, value=c(0, 100), step=1),
sliderInput ("TotalM", "Obshch. Square", min=0, max=150, value=c(0, 150), step=1)




),

column (width = 4,


sliderInput ("Price", "Price", min=0, max=50000000, value=c(0, 50000000), step=100000, sep=""),
# hr (),
selectInput ("Deal", "Transaction Type",
c ("any" = "0",
"svobodn." = "&sost;_type=1",
"alternative" = "&sost;_type=2"),
width= '45%'),
br (),
hr (),
#br (),
# br (),
# br (),
radioButtons ("wc", "Bathroom",
c ("it is not important" = "",
"separate" = "&minsu;_r=1",
"combined" = "&minsu;_s=1")),

hr (),
selectInput ("Lift", "Elevators (at least)",
c ("0" =0,
"1" = "&minlift;=1",
"2" = "&minlift;=2",
"3" = "&minlift;=3",
"4" = "&minlift;=4"

),
width= '45%'),
hr (),
selectInput ("obs", "Display Apartments on the Card:", c(1:10), selected=5, width=250),
textOutput ("flat")


)

),
fluidRow (htmlOutput ("hyperf1")),
fluidRow (textOutput ("testOutput"))
)




),
tabItem ("Raw", box (dataTableOutput ("Raw"), width=12, height=600)),
tabItem ("Summary", box (verbatimTextOutput ("Summary"), width=12, height=600)),
tabItem ("Tidy", box (dataTableOutput ("Tidy"), width=12, height=600)),
tabItem ("Predict", box (dataTableOutput ("Predict"), width=12, height=600)),
tabItem ("Plots", box (width=12, plotOutput ("RFplot", height=275), plotOutput ("r2", height=275))),
tabItem ("Map", box (width=12, htmlOutput ("view"), DT:: dataTableOutput ("formap2"), height=600))

)
)

)



The convenient application with the graphic interface, with actually two (other points for control) the main points of the side menu – the first and the last becomes result of all this. In the first (Source data) point of the side menu (fig. 1), all required parameters (similarly to cian) on search and estimates of apartments are set.

image
Fig. 1 Window of the selected Source data menu

In other points of the side menu it is displayed:
  • summary report (Summary) on regressor
  • data sheets (crude (Raw data) – initial after parsing, tidy (Tidy data) – after reduction of parameters in a tidy look and adjustments of parameters and adding of a geolocation, and the final (Predict data) table with the predicted price values)
  • three charts (Plots) (fig. 2) – the accuracy of model and importance of regressor in algorithm of the accidental wood (almost always all regressor are important) and the chart of dispersion of the initial and predicted prices.


image
Fig. 2 Window of the Selected Plots Menu

And in the last point (Result map) (fig. 3) for the sake of what all and was started is displayed, the card with the selected best results and is given the table with the calculated predicted price and the main characteristics of apartments.

image
Fig. 3 Window of the Selected Result Map Menu

Also in this table the reference (*) for transition to this declaration is had at once. DT packet allows to make this integration (inclusion of the JS elements in the table).

Conclusion


Summarizing all above as all this works:
  1. On the first page by means of controls the initial request is set
  2. On the basis of the choice of these elements of input, the line of request forms (also it is specified as a hyperlink for check)
  3. This line with indication of the page is transferred to API import.io (in the course of all this creation, cian began to change the output interface, thanks to import.io I retrained an extractor literally within 5 minutes)
  4. The received JSON from API is processed
  5. All pages (in the course of work the status bar on processes is displayed) get over
  6. Tables stick together, checked (incorrect values are excluded, the passed values are replaced), provided to the uniform type suitable for the analysis
  7. Geocoding of addresses and range sensing is carried out
  8. The model on algorithm of the accidental wood is under construction
  9. The predicted prices, absolute, relative deviations are defined
  10. On the card and in the table under it the best results are displayed (the number of the displayed apartments is specified on the first page)

All operation of application (from the beginning of request before display on the card) is performed less than in a minute (the most part of time leaves on a geolocation, restriction of Google for household use).

This publication there was a wish to show how for simple domestic needs, within one application it was succeeded to decide many small, but essentially different interesting subtasks:
  • krauling
  • parsing
  • integration with third-party API
  • processing of JSON
  • geocoding
  • work with different models of regressions
  • assessment of their efficiency by different methods
  • geolocation
  • display on the card

To all other all this is implemented in a convenient graphic application which can be as local, and posted online, and all this is made on one R (apart from import.io), with a minimum of code lines with simple and graceful syntax. Of course, something is not considered, for example, the house near the highway or an apartment status (as there is no it in declarations), but the final, ranked list of the options at once displayed on the card and with reference to the initial declaration, considerably facilitate the choice of apartments, well and plus to everything learned a lot of new in R.

This article is a translation of the original post at habrahabr.ru/post/264407/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus