One of my PhD Advisors, Hans Uszkoreit, used to say “text is the fabric of the web”. It was true in the nineties and it is even “more true” now with the expansion of social media. If text, i.e. language, is the backbone of the web, it is normal that automatic analysis of online conversations takes an outstanding importance. There are many names for it: “semantic analysis”, “text mining”, “text analytics”, “Natural Language Processing” (NLP), etc. At the end of the story their goal is the same: to transform text into structured data which are understandable by computers and usable by programs.
Recently text analytics technology has been boosted by advances in artificial intelligence (AI), with special importance attributed to deep learning. I will not comment on the over optimistic attitudes that these technologies have inspired, nor on the largely unmotivated buzz they rose. Here you can find a talk on the matter (in French, with English slides). A more important consequence is that nowadays almost all companies in the domain of marketing, customer relationship management, media monitoring etc, claims to have semantic analysis integrated in their platforms. In this article I will try to show that analysis of human conversations is a difficult task, and cannot be seriously tackled just by adopting some available open source technology.
I will start with an almost trivial statement: semantic analysis is about a specific language, there is no such a thing as “language independent semantic analysis”. This brings some suspicion when I see relatively small companies claiming “we deal with 30+ languages”. You what?! Either for “dealing” you mean counting words in a page or you are Google, or IBM, or Nuance Corporation. Dealing with 30 languages means having at least 2 native speaker employees per language, just for maintenance operations, which is an important investment. The answer to this objection is invariably: “no, we use machine learning technologies, which are language independent”. Well, in this article I will prove:
- Language independence of machine learning is an unsupported claim, even if we take into account most recent AI technologies such as deep learning.
- To be effective in real contexts deep learning must be coupled with symbolic analysis, which presupposes the presence of native speaker researchers inside the organization.
The Myth of Language Independence
The concept behind machine learning (including deep learning) is quite simple: you give a set of positive and negative examples to an algorithm and the algorithm learns how to classify successive unseen examples. For instance you give to your algorithm a set of 10.000 tweets with positive sentiments, 10.000 with negative sentiments and you got a program able to tell you if any new tweet is positive or negative. As simple as that? Yes, but you must have the 20.000 manually classified tweets. And that’s a lot of money to develop such a set (we call it “corpus”). And you must have it for any of the language to deal with, which is much more money. Possible objection: you buy it once and then you are done. False: machine learning suffers of a problem called “corpus over fitting”. Let’s see an example.
One of the best system for sentiment analysis is probably Stanford University’s one (cf. “Deeply Moving: Deep Learning for Sentiment Analysis” and “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank”). It scored very high in several international competitions and it is surely one of the best systems in the world. I also know personally some of the involved researchers, and I can assure you that they are very serious guys. Still if you test the system with the sentence “the watch is always 4 minutes ahead, I will ask for a reimbursement.”, this is what you get:
The sentence is considered positive. Why that? Well, the model was trained (i.e. the model learnt from) a corpus oriented towards movie reviews, so it works spectacularly in guessing the tonality of a movie review, but it underperforms in domains which are different from the one it was trained on. This is what we call corpus over-fitting, and this brings back to the issue “buy a corpus in language X and use the learnt model forever”. No: buy/build it for any possible domain you are going to deal with! One for watches, one for food, one for cars, etc. And this is of course not sustainable.
There is another academic proof to the alleged language independence of machine learning methods. If this were the case, when looking at the results of international evaluations taking into account many languages, we should expect grosso modo equivalent results across languages. Now this is not the case and everyone can notice that semantic analysis of English texts is systematically better than other languages. Why this situation? It is a topic largely debated in the scientific community, but the short answer is: because of language resources. There are tons of lexica, thesauri, dictionaries grammars, corpora, etc. that have been developed for English (each one with high development costs) but the situation is different for other languages. And without those resources, the precision of any semantics analysis is forcefully lower, sometimes unusable.
The Myth of Corpus-in Solution-out
The other crucial point about the blind adoption of machine learning/deep learning technologies is their reliability in a real production context. The context here is important: I think that in language evaluations machine learning (ML) methods already proved to be highly effective. The point is: are they effective enough to be used to deliver reliable semantic analysis services?
Again, I would like to start from a concrete example. Vector based semantics (aka “matrix semantics”, aka “word embeddings”) is one of the nicest applications of neural networks in natural language processing (strictly speaking, it is not deep learning, as only one layer is used). They have the enormous advantage of guessing the semantics of words without the need of an annotated corpus: all you need is a reasonably big corpus. After a training phase, you just type a word and you get, for instance its synonyms. A good demo is available on the Rare Technologies site (end of the page). Now, try to type the word “time” and click “Get Most Similar”: you will have quite nice near-synonyms, such as “day”, “moment”, “period” etc. So far so good. Now try the same with the word “gin”. What you got are not synonyms but just related spirits: “whiskey”, “rum”, “brandy”, etc… What about an application which entirely relies on these words embeddings to provide results? Well, it will evidently confuse opinion about gin with opinions about rum, which is far from satisfactory.
Another weak point of word embeddings is their inability (so far) to deal with one of the most complex problems of language, i.e. semantic ambiguity. Take the case of understanding the voice of the customer out of a corpus of reviews of watches. One part of the watch is evidently the case, which is a source of judgment with respect to its material, solidity, weight, aesthetics, etc. Unfortunately, if we ask our word embedding web application synonyms for “case”, we got think like “habeas_corpus_proceeding”, “prosecution”,”acquital”… The explanation is evident: the system confused the meaning of case as mechanical part with the one of case as trial. There are researchers working on the resolution of this problem, but the results are still not applicable in an production context.
The Reality of Hybrid Systems
All I said so far is not meant to be a criticism towards technologies such as deep learning or word embeddings. On the contrary they are both widely used by VoUbehind the scene. But the human must always stay in the loop to prevent “surprising” results. Of course by human I do not mean someone revising the results of the system. Such a human is needed, no doubt, she is called “analyst” and her work is crucial to make sense out of results. I rather mean professionals in the domain of Natural Language Processing, able to track language “traps” in specific domains and write rules to circumvent them. This integration of rules and statistical technologies gives what we call a hybrid system, i.e. a system 1) whose results are always predictable 2) whose behavior is understandable and 3) whose outcomes can be modified at any time without performing training again. As rule writing takes time and money, it is tempting to downgrade to the “statistics” only option. This might also work in specific tasks such as document classification. But it will just fail in a task as delicate as capturing user attitudes from social media for new product development. But I will return on this topic in a future blog.