Really, some classes manage to be selected automatically, for example "games" and "technologies":
But, for example, the websites about military science, space and science for some reason integrated in one cluster. It is possible even to find in it some sense, but the customer it most likely will not understand and it will be right.
Some hybrid technique in which classification happens not absolutely automatically, but also not completely manually arises. I found for myself such method. Successfully used it several times and now I want to share it. The idea is such: we classify manually several tens top websites, then for everyone the subsequent we guess a class on the basis of already classified, we check result manually. The interface on the basis of widgets of Jupyter can look so:
How to guess a website class? We look at tops with which the website has a communication, we leave for what the class is put down, we consider what class meets most often, voila, the required class is found. The most remarkable property of this algorithm it: what more websites are classified by, better he guesses classes for new. Through some time process will look approximately so:
60-70% of the websites can be missed, the class for them will be specified correctly. It considerably accelerates classification process. As practice shows to process 1000 websites it turns out in one day.
At the end it is possible to color with tops of the graph according to manual annotation, to be convinced that the military and scientific websites are separated correctly.
This article is a translation of the original post at habrahabr.ru/post/265431/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: firstname.lastname@example.org.
We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.