Developers Club geek daily blog

1 year, 1 month ago
Let's consider such task: there are 1000 news websites, for example: it was written by me and children from DCA several months ago. The graph about the news websites will look approximately so:
Semi-automatic classification of the websites

Really, some classes manage to be selected automatically, for example "games" and "technologies":
Semi-automatic classification of the websites

But, for example, the websites about military science, space and science for some reason integrated in one cluster. It is possible even to find in it some sense, but the customer it most likely will not understand and it will be right.
Semi-automatic classification of the websites

Some hybrid technique in which classification happens not absolutely automatically, but also not completely manually arises. I found for myself such method. Successfully used it several times and now I want to share it. The idea is such: we classify manually several tens top websites, then for everyone the subsequent we guess a class on the basis of already classified, we check result manually. The interface on the basis of widgets of Jupyter can look so:
Semi-automatic classification of the websites

How to guess a website class? We look at tops with which the website has a communication, we leave for what the class is put down, we consider what class meets most often, voila, the required class is found. The most remarkable property of this algorithm it: what more websites are classified by, better he guesses classes for new. Through some time process will look approximately so:
Semi-automatic classification of the websites

60-70% of the websites can be missed, the class for them will be specified correctly. It considerably accelerates classification process. As practice shows to process 1000 websites it turns out in one day.

At the end it is possible to color with tops of the graph according to manual annotation, to be convinced that the military and scientific websites are separated correctly.
Semi-automatic classification of the websites

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus