S'inscrire gratuitement
Des réponses rapides et contextualisées à vos questions de prescription
Mardi 28 juillet 2020
When treating a patient, physicians have to choose the most suitable treatment among 3,600 different molecules. This requires going through long product descriptions to identify, for instance, the right drug for a given diagnosis that is compatible with the patient’s profile.
When treating a patient, physicians have to choose the most suitable treatment among 3,600 different molecules. This requires going through long product descriptions to identify, for instance, the right drug for a given diagnosis that is compatible with the patient’s profile.
Much of this information is publicly available on the French Public Drug Database or the European Medicines Agency, but finding the relevant piece of information is a time-consuming task.
At Posos, we aim to make this information readily available in as little time as possible.
One of the major challenges of finding the correct piece of information given a query in Natural Language is accurately identifying the specific entity or set of named entities in the query. This entails the detection of dosages, drugs and diseases and linking these to a unique identifier. This is exactly the job of a Named Entity Recognition and Linking model! So this is what we will dive into here.
But first, let’s go over the peculiar challenges of building a Machine Learning model based on French medical texts!
The main reason to implement Machine Learning models over a rule-based systems is their ability to generalize to situations that you have not encountered or envisioned. In our case, queries can come in varied form and diseases may be mentioned in a variety of manners.
For example in the query ulcère veineux (venous ulcer), the model should recognize Varices ulcérées, and propose céphalée when asked for mal de tête (headache). Spelling mistakes and diverse phrasing for a given entity make the space of possibilities too large to solely rely on manually defined rules or lists.
The trained model must respond to a list of criteria listed below:
To meet these criteria, we selected a usual Machine Learning pipeline (with a few twists) the first part of which is how words are translated into vectors, in order to be processed by the Name Entity Recognizer.
One major challenge in the definition of the NERL model was the choice of word representation. The rest of this post will focus on the choice we ultimately settled on, in consideration of the following aspects and constraints:
Many more examples of diseases (‘encéphalopathie’ and ‘céphalée’), drug classes (‘déméclocycline’ and ‘chlortétracycline’) show that subwords contain very significant information that can be used to learn useful representations from rare words using their morphological similarity to other more frequent words.
These two important aspects of the medical domain led us to choose the FastText model [Bojanowski et al., 2016].
Straightforwardly, this model is an iteration over the Word2Vec model introduced in [Mikolov et al., 2013], which learns embeddings by iterating over each sentence in a corpus and learns to predict the context of each word. Words that are semantically similar and appear in similar contexts are then set in a close region of the space, which will be helpful for the NERL to detect different types of entities.
With the Word2Vec model, each word is associated to a vector. FastText iterates over this method by introducing a vector for each encountered n-gram. An n-gram is a sequence of n characters in a word,e.g.the wordcoeur(heart) is decomposed in the 3-grams <co, coe, eur, ur>. The ‘<’, ‘>’ characters are added to identify n-grams that start and end a word
This way, representations of words sharing several n-grams will be close in space and thus we can infer that they are probably semantically similar. Furthermore, this makes the model robust to spelling mistakes.
Finally, in the interest of time, we have omitted the description of several important steps we added to the pipeline such as
Here is a short example of how the query “contre-indication à l’insuffisance rénale” is processed in this first step, from a sequence of characters into a sequence of vectors:
In this part, we learnt that FastText requires a moderate amount of unlabeled text data and computing power, which makes it a very good candidate while being well adapted to the particular aspects of the medical domain. This is the heart of the encoding pipeline, as the resulting word vectors are the cornerstone of the ML pipeline, but there are important steps to make this model most effective. These vectors will be used in the NERL model to detect entities, recognize the type of entity and what entity they refer to. All this will be described in a post coming soon!
If you have any feedback or questions regarding this first step of the pipeline, our corpus or anything else, don’t hesitate to contact me :)
References
Des réponses rapides et contextualisées à vos questions de prescription