When we are children, our parents teach us to recognise the things that surround us. We can rapidly distinguish, without too much error, things such as a cat from dog, a fork from a spoon, or a parent from other human beings etc. Whilst we are growing up, we learn about complicated categories, such as true and false, good and bad, beautiful and ugly etc. All of which can be completely subjective rationalisations. But, putting people and things in boxes can be dangerous and unfair activity, especially when it involves unfavourable treatment of certain categories of people or things on unjustified grounds.
Yet, putting things into categories like this is a fundamental activity for humans; we need it to understand and interact with the world around us. For humans, it is not necessarily simple:
The categories are not necessarily well defined:
- There is not necessarily a consensus on what to classify in which category
- Some categories are subjective
- We do not always know what we are trying to categorize
In short, categorization is an activity that is both central and difficult for humans. For machines, it is about the same: one can teach a machine to categorize objects, but it will encounter the same problems as the human that created it, as well as its own difficulties. However, it has considerable advantages: its computing power and its ability to perform repetitive tasks without the slightest sign of weariness.
So, what does Proxem have to do with this? Well, categorization was the subject presented by our founder a few weeks ago at the TALN 2013 scientific conference.
When analysing documents, there are many reasons for wanting to organize documents according to predefined categories. We may want to know which emails are more important, and create alerts for the serious accidents. We can highlight tweets that severely criticise a brand or find scientific publications that could be interesting to us… All these problems can be used to define criteria and therefore, documents. These criteria, in semantic analysis, are broadly divided into two main families: what we speak about (theme) and what we say (polarity). To put it simply, knowing if we are talking about a subject that interests me, and whether it’s positive or negative.
In content analytics, one of the difficulties related to categorization is that the world is vast and that it is very difficult to predetermine all the criteria and categories that may arise. This is an essential step before drawing from the categories that interest us. It is precisely this problem of generic categorization that Proxem is now able to deal with: using Wikipedia’s knowledge organization (ontologies), we are now able to automatically determine what a document is about, that is to say to automatically attach a document (a tweet, an email, an article…) to a set of categories connected to each other.
Therefore, even if you have not read the 226 page thesis by François-Régis Chaumartin (Proxem’s founder), you can, thanks to this graph generated by our tools, understand what it talks about and look at the relations between the different subjects.
Where it is usually necessary to indicate manually by machine the thematic categories that one wishes to follow, for example, as part of a surveillance of a specific subject, our approach makes it possible to automate the configuration of categories and the identification of relevant documents. Once the documents have been classified, the user only has to bring their expert view on the subject, freed from the whole phase of sorting and deletion of uninteresting documents.
This technology is exclusive to Proxem and has been patented.
Chaumartin, F.-R. (2013). Apprentissage d’une classification thématique générique et cross-langue à partir des catégories de la Wikipédia. Actes de la 20e conférence sur le Traitement Automatique des Langues Naturelles (TALN’2013) (Vol. 8, pp. 659–666). Les Sables d’Olonne.