The Grand Débat National: an opportunity for semantic analysis (Episode 1)
The “Grand Débat National” (GDN) is a French initiative aimed at collecting opinions and attitudes of French citizens on four main themes of the French public life. I believe everybody, even non-French people, knows what we are talking about, but, just in case, here are a few links to some general purpose articles explaining the whole exercise:
From our point of view this exercise represents a unique opportunity to apply our natural language processing technologies in a context different from the traditional “customer oriented” context where we usually operate.
There have been a lot of polemics on the interpretation of the hundreds of thousands of spontaneous contributions that this democratic exercise generated, as pointed out by this reportage (https://www.francetvinfo.fr/politique/grand-debat-national/grand-debat-national-polemique-sur-la-restitution_3396177.html) (in French). I found quite surprising the fact that most of the criticisms were about how any contributions were read. But what does it mean “read” in a context where one is facing more than 9 million sentences? Read by humans? What would be the purpose? Interpreted by a computer? But then, with which algorithms? Let’s be clear on this point: without a specification of the interpretation algorithm any claim of “correct interpretation” of French citizen attitudes and wishes is just a handful of meaningless words.
This is the first episode of a series of posts where we will try to provide a transparent analysis of the texts issued of the GDN. I underline the word “analysis” as opposed to “interpretation”: Given our skills, we are not in a position to say what French citizens want: this is then a question involving analysts, sociologists, politicians, etc. We just want to make it explicit what they said and how.
The starting point is, of course, the official interpretation of the GDN, which is provided here https://granddebat.fr/pages/syntheses-du-grand-debat. The analysis was provided by the French company OpinionWay. As for the part concerning the interpretation of textual information they were supported by Qwam. The crucial methodological point was of course: how could we make sense of such a big volume of text? OpinionWay adopted a classical categorization approach. On the official slides it is stated:
QWAM a développé QWAM Text Analytics, un outil d’analyse automatique des données textuelles en masse (big data), faisant appel aux technologies du traitement automatique du langage naturel couplées à des techniques d’intelligence artificielle (apprentissage profond/deep learning).
Grâce à des algorithmes puissants, les notions citées par les répondants ont été relevées, analysées, triées et classées en différentes catégories et sous-catégories.
This categorization approach (a classical one) is kind of mimicking a quantitative analysis based on closed questions. Textual expressions are translated into answers to a virtual closed answers survey. While mainstream, this approach has some major drawbacks:
- The identification of the categories can be dramatically biased by a priori choices. Even though there are clustering techniques that can help in producing categorization trees that reflect the nature of the data, at the end of the story the decision of introducing a category or not is always a matter of human choice, with all the associated bias. By the way, here we notice a crucial difference with respect to closed questions: if an option is missing in a closed question everybody realizes it. If a category is missing in a categorization schema (in the sense that people talk about a certain topic but the topic does not appear as a category) nobody realizes it, unless she read all answers, which is not feasible with corpora of this dimensions.
- In order to be readable, the categorization approach must privilege statistically prominent topics. This means that all signals which are important but not statistically significant (weak signals) are just lost.
- Once we ask people to express themselves, it is extremely reductive to analyze their answers just by putting them into a predefined set of boxes. Most of the information contained in natural language utterances is indeed of a relational type. For instance for the concept “reduction of privileges” we have a relation of the kind reduce(PRIVILEGE_TYPE, ROLE_TYPE), which people might express as “reduce the salary to senators”, “reduce special regimens to public officers”, “reduce indemnity to parliamentarians”, etc. As the combinations of PRIVILEGE_TYPE and ROLE_TYPE are extremely numerous (potentially non-finite), a categorical approach would necessarily flatten this richness to a few categories such as SALARY_REDUCTION, INDEMNITY_REDUCT, etc...
On top of them, the categorization approach specifically adopted in the case of GDN is not made explicit. Of course the sentence “thanks to powerful algorithms the notions mentioned by respondents have been captured etc. “ does not shed any light on “how” the texts have been analyzed. Some workflow detail is given here (
https://granddebat.fr/media/default/0001/01/f73f9c2f64a8cf0b6efa24fdc80179e7426b8cc9.pdf ) but, again, we could not find any mention of the specific clustering and categorization algorithms. This being the case, even with the above-mentioned drawbacks of a categorization approach, we tend to prefer an approach such as the one of https://grandeannotation.fr/: a traditional annotation approach, based on crowdsourcing and human classification, which will have at least the merit of creating a gold standard to be exploited in future research. We also tend to share their criticism about Artificial Intelligence even though we think that the opposition “completely human” vs. “completely artificial” is kind of forced, at least as long as computer algorithms are driven by humanly written rules.
In this post as well on the posts that will follow we will dig into the topic of the interaction between human researchers and computer algorithms in the difficult task of extracting insights out of the GDN corpus. We will also point out the difficulties raised by a certain kind of language and emphasize to what extent parameterization is an important aspect of Natural Language Processing. However, in this post we will just expose the method we will use (not the algorithm as these will be detailed in specific posts) and we will provide a first snapshot of the corpus in terms of volume of concepts. At the end of each post, we will always put some links to excel spreadsheet containing the data on which tag clouds, pies, and other visualizations are based. We will also describe the condition by which we will give full access to all the analytics derived from the GDN.
A Quali-quantitative Approach
In this post, we describe what we call a quali-quantitative approach to text analytics. The basic idea is to avoid the reduction of language expressions to a set of categories as if they were the poor relatives of closed answer questionnaires. On the contrary, we will try to show how we can extract meaningful insights from verbatims without forcing them into predefined categories. This approach has also the advantage of keeping a tight connection with the core of each textual expression, i.e. language. We might agree that two users saying “limiter cumul des mandats” (limit the accumulation of mandates) or “Interdiction du cumul abusif des mandats” (prohibition of the illegal accumulation of mandates) are expressing basically the same thing in terms of categories, but the way in which the second one is formulated tells us much more: the apparent contraction (if something is illegal is already forbidden) is the clear expression not (or not only) of a rational judgment but of an attitude of exasperation towards politician with multiple charges. Capturing this kind of insights is what we call quali-quantitive.
Concretely our approach will be based on the following type of analysis:
- Entity identification: it is the first phase of the analysis when we identify the “objects” (persons, services, roles etc.) people talk about. It is important to stress here that these are in no sense “topics” but mentions of things that exist in real life, like a special committee, a certain person or a local tax. For instance, the top ten entities identified in the GDN are impôt, taxe,transports en commun, déchets, chauffage, pollution, vélo, vote blanc, aides sociales, migrants.
- Feature Identification: How people characterize the entity we just retrieved, using which adjectives, which nouns which verbs, etc. For instance the features associated to the entity “reseaux sociaux” in the GDN are the following:
It goes without saying that these features can be grouped into normalized semantic groups. For instance, we could state that “controle” (control), “regulation” (regulation) et “reglemantation” all contribute to the abstract feature REGULATION.
- Sentiment Identification: it is the user appreciation of a certain entity, possibly with respect to a certain feature. It is characterized just in terms of negative or positive polarity. For instance, in our example, a sentence such as “il faut absolument supprimer l' anonymat des réseau sociaux qui favorise honteusement la lâcheté" would be characterized as « réseaux sociaux »-« anonymat »-negative.
- Emotion Identification: Both in marketing and societal analysis, It is important to capture the emotion a certain entity (or its features) might trigger. In the case of the GDN, we have adopted the six basic Ekman emotions plus some more “text-oriented” emotion, such as TRUST. For instance in the following sentence “Je suis les informations et ça me rend triste , malheureuse , et inquiète . " we detected both SADNESS and FEAR.
- Concepts: these are just concepts appearing in text, with absolutely no domain a priory. Concepts are important for co-occurrence analysis as they allow the identification of links between entities and provide a fast summary of big portions of a corpus. This is, for instance, the concept cloud built on the basis of all the sentences mentioning Castaner:
- Insights identification: This is the phase where qualitative analysis skills are more relevant. The crucial step is starting asking questions to the corpus and gather answers. However, the researcher/analyst must pose the “right” questions and those cannot be retrieved just by a statistical analysis of the corpus. It is a very interactive process where the analyst and the linguist interact closely to extract high-quality insights, where the quality is guaranteed by the fact that the whole corpus is syntactically and semantically preprocessed and the extraction rules are finely tuned by skilled linguists. For instance, we could ask something such as “which kind of persons/roles would French respondents get rid of?” We could include in this category both explicit demands of resigning (mostly for the case of a single person) but also reduction in number of certain institutions. These are the top ten entities people is asking reduction/resigning:
|nombre de députés||4,380|
|nombre de parlementaires||1,897|
It is evident that the main concern here is not individual personalities, it is just that people feel that the political infrastructure of the state is too heavy, especially as long as the Senate is concerned. Coming to individual personalities it emerges that demand of resignation for the French president Macron appears only in 14th position, immediately followed by the prime minister in 15th position. However, this data must be handled carefully. These statistics come from a fine-grained analysis of the whole GDN corpus, but no normalization has been performed yet. So, for instant “Sénat” and “sénateurs” are not considered the same entity as well as “Président”, “Macron” and “chef”. At the time of writing, we are waiting for a pool of experts to decide which kind of categories should be grouped together. This was just an example to show what we mean by “high-quality insights”
In the next weeks, we will provide a fine-grained analysis of the GDN corpus from different perspectives. From the time being, we will just provide some gross-grained facts about it. The dataset was downloaded on https://granddebat.fr/pages/donnees-ouvertes From this link
. No data cleansing was performed during import, but, in general, we have analyzed only texts which represent an answer to a question asking for text only. Mixed questions such as “if not in the list please, mention another one” have not been taken into account (the list of all accounted question is downloadable at the end of this article). Even with this limitation, we end up with 4.572.896 answers which can be further split into 9,229,607 sentences written from 22/01/2019 up to 19/03/2019.
There were four “discussion tracks“. In term of sentences this is the relevant share:
And in absolute terms :
In terms of concepts (meaningful words or group of words (noun phrases)) this is the overall tag cloud (data as tables are provided as an attachment at the end of this post):
According to the different tracks of the GDN, these are the emerging concept clouds:
Make your own analysis
We would like to make our platform with analyzed GDN available to everybody. Unfortunately, our application is designed for a few hundreds of users at the same time and it could not resist the traffic generated by massive access. However if you are a researcher, a market research agency, an analyst or an institution for societal studies, please send me an email (from your professional address) and we will be happy to grant you access to the platform, My email is firstname.lastname@example.org, except that my first name is not Philippe and my last name is not Marlowe J
More to come on the GDN in the next weeks!
Links to the raw values on which concept clouds were built: