Developers Club geek daily blog

3 years, 3 months ago
In the previous part of the publication the method of factorization of non-negative matrixes as decrease in dimension and visualization of contingency tables has been considered. In this part statistic analysis of the received charts with use of loglinear models will be carried out. I will remind, examples are shown for complex survey of the data — stratified, clustered and the weighed selections. This circumstance assumes application of ad hoc methods of assessment and choice of models. Markov networks — the convenient instrument of graphical representation of interaction of factors of loglinear models are applied to visualization of the received results.

Briefly about the previous series. On ESS this 2012 for population of "The man of age of 25-40 years" the table about extent of support of human values in each of the countries of poll has been constructed. NMF conversion of rank 5 has been made for lowering of dimension of representation of the matrix of the size 29х21 determined by the table. I will repeat the final heatcard of positioning of all 29 countries in the received space that it was before eyes

Problem definition
The constructed card prompts between what countries (or clusters of the countries) the hypothesis of independence of distribution of shares of valuable variables of the countries (clusters of the countries) can be rejected. It is required to confirm statistically arising hypotheses. For examples we will use the following groups of the countries
• Russia and Slovakia, by results of hierarchical clustering — neighbors;
• France and Russia as options of the countries with different representations.

Certainly the choice is not limited only to these examples and the researcher can select those countries or clusters of the countries matching its interests.
In addition to check of hypotheses there is question — as valuable factors depending on group of the selected countries interact? It is required to reveal these possible distinctions.

It is a little about contingency tables
All valuable variables in the table for execution of NMF conversion were perceived as one variable with multiple selection (multiple response variable). It was necessary for data representation in the form of the two-dimensional table, that is the table formed by two variables. Actually we have a bit different situation, complete set from 21 valuable variables and 1 variable specifying the country define the 22-dimensional contingency table.
Possibly it will seem surprising, but from the point of view of creation of statistical models, multidimensional contingency tables (with single response variable and without the passed answers) — simpler situation, than tables with multiple response variable. Besides, by means of NMF dimension of the table 1 variable with the country has been lowered to 6 — 5 latent variables +.

Loglinear models
Classical method of the analysis of the multidimensional contingency table — creation of its loglinear model. The loglinear analysis it is possible to perceive as generalization criterion chi-square on case of multidimensional tables. It is possible to look at definition of loglinear models in Wikipedia (eng). On this subject materials with examples in Russian, for example, here or here, and also detailed lectures in English are available here.

Before passing to calculations we will note that generally multidimensional contingency tables define multinomial distribution. But when the marginal sums of this distribution on one measurement or several measurements are fixed, we receive so-called product-multinomial distribution. Therefore it is required to impose additional restrictions on parameters of loglinear models for such tables. Details can be found in chapter 12 of the book [1]. In our case the marginal sums are fixed on one measurement — the sizes of populations in each of the countries are constants. It means that the main effect answering to variable with the country cannot be excluded from model.

Last note. We will lower question of what tables for survey of data are considered rarefied and, as a result, we will not carry out the corresponding inspections.

We define and compare models
Still we use packet of survey [2] of the environment R for accounting of effects of stratification, clustering and weighing of selection. In more detail about it it was reported in one of last publications. Parameters of loglinear models for complex survey of data exactly the same, as for tables without design of research. Correction of formulas of the model parameters calculating the importance is required (both separately, and in total).

We load data, we select the gen. set, we add latent variables to base and we set design of research.
``````library(foreign)
library(data.table)
library(survey)

srv.variables <- data.table(name = names(srv.data), title = attr(srv.data, "var.labels"))
srv.data <- data.table(srv.data)
setkey(srv.data, cntry)
setkey(srv.variables, name)

ru.dt[,psu:=psu+150]  # psu values are changed to avoid their intersections between countries

sddf.data <- rbind(fr.dt, ru.dt, sk.dt)
setkey(sddf.data, cntry, idno)

cntries.data <- srv.data[J(c("FR", "RU", "SK"))]
cntries.data[ ,weight:=dweight*pweight]
setkey(cntries.data, cntry, idno )

cntries.data <- cntries.data[sddf.data]
cntries.data <- cntries.data[gndr == 'Male' &agea >= 25 &agea<=40, ]

# add the latent variables<b> a.1, a.2, ..., a.5</b> to the cntries.data
answers <- c('Very much like me', 'Like me')

# define survey design
srv.design.data <- svydesign(ids = ~psu, strata = ~stratify, weights = ~weight, data = cntries.data)
``````

Example 1, the elementary — the table for Russia and Slovakia from one latent variable "money | success".

We build two models: the assuming independence of factors and saturated.
Calculations show...
``````ru.sk.data <- subset(srv.design.data, cntry %in% c("RU", "SK"))
srv.loglin.model.ind <- svyloglin(~a.1+cntry, ru.sk.data)
srv.loglin.model.sq <- update(srv.loglin.model.ind, ~.^2)
anova(srv.loglin.model.ind, srv.loglin.model.sq)
``````

Analysis of Deviance Table
Model 1: y ~ a.1 + cntry
Model 2: y ~ a.1 + cntry + a.1:cntry
Deviance = 0.1240613 p = 0.4737981
Score = 0.1217862 p = 0.4778766

that the saturated model is not significantly the best in comparison with the model assuming independence.
That is, we cannot reject null hypothesis about independence of variables in the table.
For comparison it is the table with results of independent model

Example 2. Let's consider the table with all five latent variables for France and Russia.
The loglinear model assuming paired independence of all factors is rejected. The model with all elements of the second order is acceptable. This model can (and it is necessary) simplify — to be discarded by results of wald and likelihood ratio of criteria, parameters of the second order for the variable defining the country and the last two latent variables of the heatcard.
Calculations
``````fr.ru.data <- subset(srv.design.data, cntry %in% c("FR", "RU"))

srv.loglin.model.ind <- svyloglin(~ a.1 + a.2 + a.3 + a.4 + a.5 + cntry, fr.ru.data)
srv.loglin.model.sq <- update(srv.loglin.model.ind, ~.^2)
srv.loglin.model.tri <- update(srv.loglin.model.ind, ~.^3)
srv.loglin.model.four <- update(srv.loglin.model.ind, ~.^4)

anova(srv.loglin.model.ind, srv.loglin.model.sq)\$dev\$p[3]  #5.745843e-50
c( anova(srv.loglin.model.sq, srv.loglin.model.tri), anova(srv.loglin.model.sq, srv.loglin.model.four) ) #  0.7335668 0.7427429

sapply(paste('cntry:a.',1:5,sep=""), function(x) round(regTermTest(srv.loglin.model.sq, x)\$p, 3) )
``````

cntry:a.1 cntry:a.2 cntry:a.3 cntry:a.4 cntry:a.5
0.000 0.000 0.000 0.437 0.524

``````anova(update(srv.loglin.model.sq, ~. -cntry:(a.4 + a.5)), srv.loglin.model.sq)\$dev\$p[3]
``````

0.6066181

Conditional independence. Why mathematical abilities and the size of footwear — dependent factors?
This variation on classical example. Let's assume, mathematical abilities of the respondent are defined by the following gradation---high, average or low. We build the contingency table with these two variables, say, for the population of all Russia. The hypothesis of independence of these variables can be safely rejected. At people with big size of footwear mathematical abilities are higher. In what the reason? In lack of the latent variable — age. It is clear, that till certain moment the age positively correlates both with mathematical abilities, and with footwear size. If to fix age (Age = k), for any k the table of joint distribution of values M (mat. abilities) and S (the footwear size) will not specify about availability of significant dependence between them. In that case say that the values M and S are conditionally independent. This result is expressed naturally as the Markov network — the nondirectional graphic model.

I will add that on Habré there is excellent article about Bayesian networks — the directed graphic models.

Graphical representation of loglinear models
The previous example it is possible to generalize and extend it to any hierarchical loglinear models, as it has been implemented in work [3]. Let's consider number of possible options for three A, B and C variables.

These Markov networks correspond to the following loglinear models

Let's notice that not any hierarchical loglinear model can be presented in the form of the Markov network. For example — the AB/AC/BC model. But any model can be unambiguously enclosed in the minimum Markov network. Details of compliance of loglinear and graphic models can be found in the book [1] or article [3].

Aggregate results
Markov networks allow to be guided rather easily in relationship of variables and to compare results of different tables.

We see that in case of Russia and Slovakia the significant interrelation between the country and the "search of adventures and risk or opportunity to have fun is important" variable is observed. With other valuable qualities the Country variable is conditional is independent.
Whereas in France and Russia distinction in the relation to three statements is significant: "it is important to be rich or to make success", "it is important to have a good time" and "it is important to be simple and modest".
Both of these outputs will be coordinated with results of the heatcard.
As for interrelation between latent variables, graphs for these couples of the countries differ only in one edge. For Russia and Slovakia the variables "it is important to have a good time" and "it is important to conform to the rules or it is important to help people around" are conditionally independent.

In summary I will note that in loglinear models for complex survey of data the step-by-step choice of model based on AIC or BIC results is not implemented yet. Articles with adaptation of these criteria to such data began to appear only in recent years. In particular, this year there was article [4], one of which coauthors — T. Lumley, the creator of packet of survey.

Literature:
[1] G. Tutz (2011) Regression for Categorical Data, Cambridge University Press.
[2] T. Lumley (2014) survey: analysis of complex survey samples. R package version 3.30.
[3] J. N. Darroch, S. L. Lauritzen, and T. P. Speed (1980) Markov fields and log-linear interaction models for contingency tables. Annals of Statistics 8(3), 522–539.
[4] T. Lumley, A. Scott (2015) AIC and BIC for modeling with complex survey data, J. Surv. Stat. Method. 3 (1), 1-18.