Developers Club geek daily blog

1 year, 4 months ago
How to define officials, the most suspicious from the point of view of corruption? Simplest — having compared their income and standard of living.

I want to show possibilities of the websites with open information on officials in this article, to look at how these officials live and to try to define those who are most suspicious from the point of view of corruption.

Why open information on the income of officials is important? Because it allows to control them.

image
Photo from an Instagram of the daughter of the former head of GAI of Ukraine Alexander Yershov. On a photo the daughter of Yershov in Cannes near Paris Hilton. As a result of scandal because of mismatch of the declared income and a way of life of a family Yershov resigned.

From where data?


Data on declarations of officials are taken from the website declarations.com.ua, and about owners of elite real estate — from the website garnahata.in.ua. Both websites — projects of the community of journalists and volunteers "Clerical one hundred" which is initially organized for digitization of documents from Yanukovych's house.

At the moment on the websites well to about 11 thousand declarations of officials from different departments and about 9 thousand records about owners of elite real estate. Among customs applicants there are generally representatives of the different ministries (including services on places), employees of courts and prosecutor's office. Data do not apply for a representativeness (officials in Ukraine about 400 thousand), but to rummage in them all the same interestingly.

Both websites have open API, given in the JSON format it is possible to download by means of a script on python. The scheme of a data object for declarations is on github here, and the scheme of object for data on owners of elite real estate — here. For an example and understanding of a data structure — the scan copy of one of declarations of the website.

Having loaded data, I rasparsit them in R, aggregated some and left only declarations for 2013 and 2014.

Code sample for data acquisition from the JSON file
#Загружаем в R данные из файла JSON — получаем объект типа list
decl_raw<-rjson::fromJSON(file="feed.json")


#Создаем dataframe с количеством строк, равным кол-ву деклараций
decl_df<-data.frame(matrix(NA,nrow=length(decl_raw), ncol = 0))

#
#Простой случай: должность и место работы декларанта
#

#Добавляем колонки
  decl_df$general.post.region<-“”
  decl_df$general.post.office<-“”
  decl_df$general.post.post<-“”

#Считываем данные из списка

for (i in 1:length(decl_raw))
{
  #
  #ДАННЫЕ О ДОЛЖНОСТИ
  #
 
  #Регион, в котором работает декларант
  decl_df$general.post.region[i]<-decl_raw[[i]]$general$post$region
  #Учреждение
  decl_df$general.post.office[i]<-decl_raw[[i]]$general$post$office
  #Должность
  decl_df$general.post.post[i]<-decl_raw[[i]]$general$post$post
}

#
#Более сложный случай: подсчет кол-ва транспортных средств и считывание их названий
#

#Добавляем колонки
decl_df$vehicle35<-0
decl_df$vehicle36<-0
decl_df$vehicle37<-0
decl_df$vehicle38<-0
decl_df$vehicle39<-0
decl_df$vehicle40<-0
decl_df$vehicle41<-0
decl_df$vehicle42<-0
decl_df$vehicle43<-0
decl_df$vehicle44<-0

#Считываем данные из списка
for (i in 1:length(decl_raw))
{ 
  #
  #Кол-во транспортных средств по статьям декларации (пп.35-44)
  #
  
  for (unit in 35:44)
  {

    j = 0
    col_name<-paste0("vehicle", unit)
    raw_col_name<-paste0("decl_raw[[",i,"]]$vehicle$`",unit,"`")
    
    if (length(eval(parse(text=raw_col_name)))!=0)  
    {
      for (k in 1:length(eval(parse(text=raw_col_name))))
      {
        if (length(eval(parse(text=paste0("decl_raw[[",i,"]]$vehicle$`",unit,"`[[",k, "]]$brand"))))!=0 &&eval(parse(text=paste0("decl_raw[[",i,"]]$vehicle$`",unit,"`[[",k, "]]$brand")))!="")
        {j = j+1}
      }
    }
    decl_df[i, grep(col_name, colnames(decl_df))]<-j
  }

}  

#Добавляем колонку для названия всех ТС
decl_df_all$vehicle_names<-""

for (i in 1:length(decl_raw))
{ 
  
  vname<-""
  
  for (unit in 35:44)
  {
    col_name<-paste0("vehicle", unit)
    raw_col_name<-paste0("decl_raw[[",i,"]]$vehicle$`",unit,"`")
    
    if (length(eval(parse(text=raw_col_name)))!=0)  
    {
      for (k in 1:length(eval(parse(text=raw_col_name))))
      {
        if (length(eval(parse(text=paste0("decl_raw[[",i,"]]$vehicle$`",unit,"`[[",k, "]]$brand"))))!=0 &&eval(parse(text=paste0("decl_raw[[",i,"]]$vehicle$`",unit,"`[[",k, "]]$brand")))!="")
        {
          vname=paste(vname,eval(parse(text=paste0("decl_raw[[",i,"]]$vehicle$`",unit,"`[[",k, "]]$brand"))), sep=";")
        }
      }
    }
  }
  decl_df$vehicle_names[i]<-vname
}



Let's begin with the most obvious and simple — the income.

Income of officials


Customs applicants specify the income and the income of family members. For a start we will look at the income counting on one family member.

Code for calculation of the income on the family member
#decl_df — это dataframe с данными деклараций из JSON файла
#Тут и далее цифры в названии переменной обычно указывают на пункт декларации
#Чтобы избежать ошибок, когда суммарный доход подсчитывается неправильно, 
#суммируем все статьи личного дохода декларанта

decl_df$income.own<-decl_df$income.own.6+decl_df$income.own.7+decl_df$income.own.8+
decl_df$income.own.9+decl_df$income.own.10+decl_df$income.own.11+
decl_df$income.own.12+decl_df$income.own.13+decl_df$income.own.14+
decl_df$income.own.15+decl_df$income.own.16+decl_df$income.own.17+
decl_df$income.own.18+decl_df$income.own.19+decl_df$income.own.20+
decl_df$income.own.21

#Если декларант указал общую сумму, но не указал доход по статьям, считаем указанную общую сумму суммарным доходом
#В иных случах считаем суммарным доходом подсчитанную сумму по статьям

for (i in 1:nrow(decl_df))
{
  if (decl_df$income.own[i]==0 &&decl_df$income.own.5[i]>0)  
  {decl_df$income.own[i]<-decl_df$income.own.5[i]}
}

#Аналогично считаем общий доход семьи

decl_df$income.family<-decl_df$income.family.6+decl_df$income.family.7+
decl_df$income.family.8+decl_df$income.family.9+decl_df$income.family.10+
decl_df$income.family.11+decl_df$income.family.12+
decl_df$income.family.13+decl_df$income.family.14+
decl_df$income.family.15+decl_df$income.family.16+
decl_df$income.family.17+decl_df$income.family.18+
decl_df$income.family.19+decl_df$income.family.20+
as.numeric(gsub(",", ".", decl_df$income.family.22))

for (i in 1:nrow(decl_df))
{
  if (decl_df$income.family[i]==0 &&decl_df$income.family.5[i]>0)  
  {decl_df$income.family[i]<-decl_df$income.family.5[i]}
}

#Считаем доход на одного члена семьи

decl_df$income_per_member<-rowSums(cbind(decl_df$income.own,decl_df$income.family), na.rm=TRUE)
decl_df$income_per_member<-decl_df$income_per_member/decl_df$number_of_family_members_incl_decl

#Переводим его в тыс.грн.
decl_df$income_per_member_ths<-decl_df$income_per_member/1000


The glance for 10% percentiles shows that in upper 10% there are some super-rich men: the average income for upper 10% — 305,8 million UAH on the family member (about 12 million dollars), at the same time on a 90% percentile value of only 382 thousand UAH.

quantile(decl_df$income_per_member_ths, probs=seq(0,1,0.1))

Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

By departments:

Code for the diagram
qplot(data=decl_df, x=office_g, y = income_per_member_ths, 
      geom="boxplot",
      xlab="Ведомство",
      ylab="Доход на члена семьи, тыс.грн.",
      main="Все декларанты")


Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

Suddenly parliament. On all other departments the income on the family member does not exceed 50 million UAH a year. Let's eliminate autlayer with very high income and we will look on officials with the income at up to 1 million UAH on the family member in a year (such 97%):

Code for the diagram
qplot(data=decl_df[decl_df$income_per_member_ths<1000,], 
      x=office_g, y = income_per_member_ths, geom="boxplot",
      xlab="Ведомство",
      ylab="Доход на члена семьи, тыс.грн.",
      main="Доход до 1 млн.грн.")


Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

It is visible that the average income is higher in vessels (231 thousand) and in parliament (209 thousand). In other departments the average income about 75-100 thousand UAH on the family member.

Income of officials vs income of families


Let's look how the income of a family and the income of the customs applicant corresponds. Here already we look at the absolute sums without binding to the number of family members.

Code for the diagram
#Создаем dataframe только семейных чиновников
decl_family<-decl_df[decl_df$number_of_family_members_incl_decl>1,]

qplot(data=decl_family, y=income.own/1000, x=income.family/1000,
      xlim=c(0,800000), ylim=c(0,800000),
      xlab="Доход семьи, тыс.грн.", ylab="Доход декларанта, тыс.грн.")


Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

The same parliamentarians-autlayery disturb perception. Let's look how the income of customs applicants and members of families for bulk of officials corresponds (we will be limited to 1 million UAH of a revenue both for the customs applicant, and for members of families — among family customs applicants of such 94%):

Code for the diagram
nrow(decl_family[decl_family$income.own<1000000 &decl_family$income.family<1000000,])/nrow(decl_family)

qplot(data=decl_family, y=income.own/1000, x=income.family/1000,
      xlim=c(0,1000), ylim=c(0,1000),
      xlab="Доход семьи, тыс.грн.", ylab="Доход декларанта, тыс.грн.",
      main="Доход до 1 млн.грн.")


Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

It is visible that the income of the customs applicant is more often above the income of a family (accumulation of points along a vertical axis), but it can be explained also with the fact that 77% of family customs applicants — men, and the income of men in Ukraine above the income of women on average for 30% (according to International Labour Organization)

In different departments a ratio approximately identical (see the diagram below). In parliament there are slightly more people whose families earn more. In vessels — on the contrary (perhaps, because of rather high salary of judges).

Code for calculations and the diagram
#Создаем переменную-фактор из 4 категорий:
#1.Нет дохода у семьи
#2.Доход семьи меньше 75% дохода декларанта
#3.Доход семьи соизмерим с доходом декларанта (составляет от 75% до 150% дохода декларанта)
#4.Доход семьи превышает доход декларанта в 1,5 и больше раза

decl_family$family.own.income.ratio<-""

for (i in 1:nrow(decl_family))
{
  if (decl_family$income.family[i]==0) 
  {decl_family$family.own.income.ratio[i]<-"1.Нет дохода у семьи"}  
  
  else
  {
    if (decl_family$income.family[i]<=0.75*decl_family$income.own[i]) 
    {decl_family$family.own.income.ratio[i]<-"2.Доход семьи меньше (<0.75x)"}
    
    else
    {
      if (decl_family$income.family[i]<=1.5*decl_family$income.own[i]) 
      {
        decl_family$family.own.income.ratio[i]<-"3.Доход семьи соизмерим (0.75-1.5х)"
      }
      if (decl_family$income.family[i]>1.5*decl_family$income.own[i]) 
      {
        decl_family$family.own.income.ratio[i]<-"4.Доход семьи больше, >1.5x"
      }
    }
  }
}

decl_family$family.own.income.ratio<-as.factor(decl_family$family.own.income.ratio)

#Создаем таблицу с % по каждому ведомству
y<-as.data.frame(100*prop.table(table(decl_family$family.own.income.ratio,decl_family$office_g), margin=2))

#Строим график
ggplot(y, aes(x = Var2, y = Freq, fill = Var1)) +
  geom_bar(stat="identity")+
  ylab("%") +
  xlab("")+
  theme(text = element_text(size=14), legend.title=element_blank(),axis.text.x = element_text(angle=90, size=12,vjust=1,hjust=1))+
  geom_text(aes(label = round(Freq,0),ymax=100),size=4,vjust=1.5,position="stack")+
  scale_fill_brewer()



Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

From what sources gain income of a family of officials?


Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

On the first place dividends and percent — but it only thanks to parliament. Further there is income from business activity, and this type is more characteristic of local government and prosecutor's office that casts some suspicions.

The salary only in the third place — though in general across Ukraine it wins first place in structure of the income of members of the family of officials, making about 40% of the income of the population (according to Goskomstat).

Index of suspiciousness of the official


So, we passed the short overview of the income of officials and their families.

However the purpose at us another — to evaluate probability that the official is a corrupt official. It is clear, that it is impossible to make only of these declarations of it, even mismatch of the income and expenses is not the proof of bribery. Therefore a task now — to define a certain index of corruption suspiciousness of the official.

Unfortunately, the task cannot be solved by methods of machine learning as information on a target variable — the official the corrupt official is or not — at us is not present. It is necessary to work in the expert way.

What can demonstrate corruption and can be checked on data of declarations? There are some options. For simplicity of calculations for each point there will be 1 point.

  • The large sums on accounts in banks at the low income of the customs applicant and family members

The penal point is appropriated to those who has a sum on accounts in five or more times exceeds the total family revenue. Such 294 persons.

Code for calculations
#Считаем суммарный доход (личный и семьи)
decl_df$income.own.and.family<-decl_df$income.own+decl_df$income.family

#Считаем суммы на счетах в банках (пп.45-53 декларации)
decl_df$banks<-decl_df$banks45+decl_df$banks47+decl_df$banks49+
              decl_df$banks51+decl_df$banks52+decl_df$banks53

#Делим суммы на счетах на доход
decl_df$banks.income.ratio<-decl_df$banks/(decl_df$income.own.and.family+1)

#Подсчитываем переменную подозрительности по этому пункту
#Присваиваем ее единице в тех случаях, когда сумма на счетах
#в 5 и более раз выше, чем суммарный годовой доход

decl_df$susp1<-0

for (i in 1:nrow(decl_df))
{
  if (decl_df$banks[i]>5*decl_df$income.own.and.family[i])
  {decl_df$susp1[i]<-1}
}


  • Zero income of a family and customs applicant. It, of course, can be a sign of inaccurately completed declaration — but it, first, too is bad, and secondly, can say about what to the official is what to hide.

Here everything is simple. Such there were 50 people.

Code for calculations
decl_df$susp2<-0

for (i in 1:nrow(decl_df))
{
  if (decl_df$income.own.and.family[i]==0)
  {decl_df$susp2[i]<-1}
}


  • If the property is written generally on family members, especially if they have low income

The penal point was appropriated to those at whom the family owns real estate the area more than average on selection, and at the same time the customs applicant of real estate has less, than at a family.

Such 478 people turned out. If at the same time the income of a family was in the lower 25% of the income of families, then the point was multiplied on 2 — such 49 people.

Here I considered apartments, at home, dachas, garages, etc. — but did not consider the land plots because because of prohibition on sale of the agricultural earth many natives of villages have shares of the former collective-farm lands in ownership, and in fact the person can possess earth hectares, without having an opportunity to receive from it benefit.

Code for calculations
#Суммарная площадь недвижимости в собственности декларанта
decl_df$estate.own<-decl_df$estate24+decl_df$estate25+
                    decl_df$estate26+decl_df$estate27+decl_df$estate28
#Суммарная площадь недвижимости в собственности семьи
decl_df$estate.family<-decl_df$estate30+decl_df$estate31+
                    decl_df$estate32+decl_df$estate33+decl_df$estate34

#Считаем верхнюю границу первых 25% по семейному доходу
x<-quantile(decl_df[decl_df$number_of_family_members>0,]$income.family, probs=seq(0,1,0.25))[2]


#Считаем среднюю по выборке суммарную площадь недвижимости семьи
y<-mean(decl_df[decl_df$number_of_family_members>0,]$estate.family)

#Если недвижимости у семьи больше среднего по выборке и больше чем у декларанта
#присваиваем штрафной балл

decl_df$susp3<-0

for (i in 1:nrow(decl_df))
{
  if (decl_df$estate.family[i]>y &decl_df$estate.family[i]>decl_df$estate.own[i])
  {
    #Если при этом доход семьи находится в нижнем квартиле, умножаем балл на два
    if (decl_df$income.family[i]<x)
    {decl_df$susp3[i]<-2}
    else
    {decl_df$susp3[i]<-1}
  }
  
}


  • The large sums of the income from abroad (can confirm washing of means)

In total 128 people who had income from abroad were found (personal or family). From them at 44 people this income exceeded the income in Ukraine — them and we recognize suspicious.

Code for calculations
#Доходы из-за границы (декларанта и семьи)
decl_df$income.from.abroad<-decl_df$income.own.21+as.double(decl_df$income.family.22)

decl_df$susp4<-0

for (i in 1:nrow(decl_df))
{
#Если доход из-за границы больше, чем доход внутри страны — присваиваем балл  
if (decl_df$income.from.abroad[i]> 
decl_df$income.own.and.family[i]-decl_df$income.from.abroad[i])
  {decl_df$susp4[i]<-1}
}


  • Existence of several cars in the absence of housing

Let's take those who have more than two cars and there is no housing. Such 31 person.

Code for calculations
#Количество авто (легковых и грузовых) в семье
decl_df$vehicles<-decl_df$vehicle35+decl_df$vehicle36+
                  decl_df$vehicle40+decl_df$vehicle41

decl_df$susp5<-0

for (i in 1:nrow(decl_df))
{
 if (decl_df$vehicles[i]>2 & decl_df$estate.own[i]==0 &decl_df$estate.family[i]==0)
 {decl_df$susp5[i]<-1}
}



  • Existence of luxury cars

I did not find any approved classification of cars with the list of brands and models which it is possible to carry luxury to a class. Therefore used the viki-article Luxury vehicle.

As a result the list turned out such: Acura, Alfa Romeo Giulia, Audi A4, Audi A6, Audi A7, Audi A8, Bentley, BMW 3, BMW 5, BMW 7, Cadillac, Ferrari, Hummer, Infinity, Jaguar, Lamborghini, Land Rover, Lexus, Maserati, Mercedes-Benz C, Mercedes-Benz E, Mercedes-Benz GL, Mercedes-Benz S, Porsche, Rolls-Royce, Saab 9-3, Saab 9-5, Volkswagen Phaeton, Volvo S60, Volvo S80.

The penalty was charged by that who has at least one of these cars, but not charged if this only car in a family (you never know, suddenly saved all life). In total such 653 persons.

Code for calculations
#Вектор с названиями авто
luxury_cars<-c('Acura',	'Lexus',	'Cadillac',	'Alfa Romeo Giulia',	'Jaguar',	'Volvo S60',	'Infinity',	'Saab 9-3',	'BMW 3',	'Audi A4',	'Mercedes-Benz C',	'Volvo S80',	'Audi A6',	'Audi A7',	'Mercedes-Benz E',	'Saab 9-5',	'Maserati',	'BMW 5',	'BMW 7',	'Audi A8',	'Mercedes-Benz S',	'Porsche',	'Volkswagen Phaeton',	'Rolls-Royce',	'Bentley',	'Ferrari',	'Lamborghini',	'Mercedes-Benz GL',	'Hummer',	'Land Rover')

for (j in (1:nrow(decl_df)))
{
  decl_df$susp5.1[j]<-0
  for (i in (1:length(luxury_cars)))
  {
    #Если в списке машин встречается название из вектора
    if (grepl(luxury_cars[i], decl_df$vehicle_names[j], 
                  ignore.case=TRUE)==TRUE)
        {
            #Считаем кол-во таких машин
            decl_df$susp5.1[j]<-decl_df$susp5.1[j]+
            length(gregexpr(luxury_cars[i], decl_df$vehicle_names[j],ignore.case=TRUE)[[1]])
        }
  }
}

decl_df$susp5.2<-0

#Если есть элитные авто - присваиваем штрафной балл
for (i in (1:nrow(decl_df))) {if (decl_df$susp5.1[i]>0) decl_df$susp5.2[i]<-1}

#Если это единственное авто - снимаем штрафной балл
for (i in (1:nrow(decl_df))) {if (decl_df$vehicles[i]==1) decl_df$susp5.2[i]<-0}


  • High income of family members from business activity.

The penal point was charged by that who to total income had a ratio of the income of a family from business activity above an average on selection. Such there were 419 people.

Code for calculations
#Изначально присваиваем коэффициенту значение 0
decl_df$familyPE.own.income.ratio<-0

#Для тех, у кого не нулевой доход, считаем соотношение дохода 
#семьи от предпринимательской деятельности к общему доходу семьи и декларанта
decl_df[decl_df$income.own.and.family>0,]$familyPE.own.income.ratio<-
decl_df[decl_df$income.own.and.family>0,]$income.family.17/decl_df[decl_df$income.own.and.family>0,]$income.own.and.family

#Среднее по выборке соотношение дохода семьи от предпринимательской деятельности к общему доходу.Считаем только для тех, у кого в семье есть доход от предпринимательской деятельности
x<-mean(decl_df[decl_df$income.family.17>0,]$familyPE.own.income.ratio)

decl_df$susp6<-0

#Если соотношение больше среднего — присваиваем штрафной балл

for (i in 1:nrow(decl_df))
{
  if (decl_df$familyPE.own.income.ratio[i]>x)
  {decl_df$susp6[i]<-1}
}


  • Ownership of elite real estate (on the basis of these garnahata)

The Garnakhata project collects data on owners of expensive real estate - it is official data on the basis of the State register of property rights.

For our purposes I compared the Full Name of owners to the Full Name of customs applicants — at complete coincidence (such there were 80 people) to the customs applicant 1 point was added to suspiciousness.

Besides, I made verification only on a surname (without name and a middle name) the customs applicant or a surname of relatives whom he specified in declarations. As surnames happen widespread, there was a lot of coincidence (more than 2 thousand), but also only 0,5 points were added to an indicator of suspiciousness.

Verification became in Excel therefore without code

Results


Having put together points on all suspicious points, I received the general indicator of suspiciousness.

Code for calculations
decl_df$suspicious<-decl_df$susp1+decl_df$susp2+
                    decl_df$susp3+decl_df$susp4+decl_df$susp5+decl_df$susp5.2+
                    decl_df$susp6+decl_df$hata_own+decl_df$hata_family*0.5


From 10 346 customs applicants it was more than zero for 3971, but it generally due to coincidence of a surname from the register of real estate — the indicator higher than 0,5 is recorded for 1461 customs applicants. The maximum value of an indicator — 5 (from theoretically possible maximum 9,5).

Distribution on departments indicates parliament again:

Find the corrupt official. Data analysis of officials from projects of Clerical one hundred (with examples on R)

This article is a translation of the original post at habrahabr.ru/post/271773/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus