Developers Club geek daily blog

Contingency tables and factorization of non-negative matrixes

1 year, 11 months ago
Factorization of non-negative matrixes (NMF) — this representation of matrix of V in the form of work matrixes of W and H in which all elements of three matrixes are not negative. This decomposition is used in different knowledge domains, for example, in biology, computer sight, referral systems. In this publication the speech about contingency tables of sociological and marketing data which factorization helps to understand data structure of these tables will go.


Read more »


The analysis of open data in R, part 1

1 year, 11 months ago

Introduction


At the time of writing of article the majority of applications on the basis of open data (on the official sites of data.mos.ru/apps and data.gov.ru) represent interactive reference books on infrastructure of the city or settlement with visual visualization and is frequent with option of choice of optimum route. The purpose of this and the subsequent publications consists in drawing attention of community to discussion of strategy of the analysis of the open data including directed on forecasting, creation of statistical models and information extraction, not provided in an explicit form. As tools language R and development environment of RStudio is used.

Read more »


Visualization of static and dynamic networks on R, part 1

1 year, 11 months ago
Very many systems and the phenomena are representable in the form of networks, i.e. a set of objects and communications between them. A network — not only abstraction, but also the visual instrument of data visualization. It is possible to display importance of this or that object, weight of each communication, to specify key groups of elements, to select them and to emphasize communications between them. The main task of visualization — to submit key information on properties of system or the phenomenon the easiest method for perception. Ideally the analysis of system and visualization of its results can be made within one tool. R with its extensive set of packets allows it.

Read more »


R and Python — worthy rivals?

1 year, 12 months ago


All kind Friday, dear readers!

In the history of computer editorial office of St. Petersburg publishing house there will be a few so successful books as "We program on Python" Michael Dawson no more than such inconsistent subjects as the amazing language R firmly fixed in number of the bestsellerny subjects "Amazona". Now we agree with owners about the new remarkable book on Python, but at the same time wanted to check public opinion about R — whether it is reasonable to publish new books about this elite language for the guru of big statistics, or Python easily it will overcome not that Apollo?

Welcome under kat!

Read more »


Statistic analysis of associative rules in results of polls

2 years ago
In the previous part of article the method of search of associative rules in data of the European social research has been considered. This part about statistic analysis of the received rules. The key moment that classical statistical techniques, for example, criterion of consent chi-square, have no basis to be used for results of poll. But for what reasons? And how to check hypotheses? About it the speech in this publication will go.



Read more »


Black archeology of datamayning: are how dangerous? drainings? big data

2 years ago
In 2014 in network the base of passwords of different mail services has flowed away big, on 6 million records. Let's look as far as these passwords are actual now, in 2015.


Read more »


Search of associative rules in results of polls

2 years ago
Search of the associative governed well-known method of data analysis. On Habré already there was publication with the historical background about this method and the general definitions. In this article the speech about adaptation of the search algorithm of associative rules in the data obtained by polls of respondents will go. Results of work of algorithm are shown on data of the European social research (ESS).


Foto: Owen Humphreys/AP

Read more »


The analysis of tonality of expressions in Twitter: implementation with example on R

2 years ago
Social networks (Twitter, Facebook, LinkedIn) — perhaps, the most popular free site available to the general public for the expression of thoughts in different occasions. Millions of tweets (posts) daily — there are covered huge number of information. In particular, Twitter is widely used by the companies and ordinary people for the description of state of affairs, advance of products or services. Twitter is also fine data source for carrying out the intellectual analysis of texts: since logic of behavior, events, and finishing tonalities of expressions with prediction of trends on securities market. There the huge array of information for the intellectual and contextual analysis of texts is covered.

I will show in this article how to carry out the simple analysis of tonality of expressions. We will load twitter message on certain subject and we will compare them to database of positive and negative words. The relation of the found positive and negative words call the tonality relation. We will also create functions for finding of the most often found words. These words can give useful contextual information about public opinion and tonality of expressions. The data array for the positive and negative words expressing opinion (tone words) is taken from Hugh and Lew, KDD-2004.

Implementation on R using twitteR, dplyr, stringr, ggplot2, tm, SnowballC, qdap and wordcloud. Before application it is necessary to set and load these packets, using teams install.packages() and library().

Read more »


Black archeology of datamayning: what can be more effective than attack according to the dictionary?

2 years ago
For those to whom is lazy to read further, at once I will tell the answer: attack "login is equal to the password". According to the statistics, login equal to the password meets more often than the most widespread password from the dictionary. Further in article there will be some statistical researches on this subject, and history with which everything has begun.



Read more »


Visualization of results in R: first steps

2 years ago
In one of the previous posts we already wrote about the central concept of statistics — p-significance value. And while in the scientific environment disputes on interpretation of p-value do not cease, the considerable part of researches is conducted with use of p-value for determination of the importance of the distinctions received in research. Today we will talk about the most creative processing stage of data — as significant distinctions to visualize.

Read more »