This is some text inside of a div block.

Building a medical search engine — Step 1: medical word embeddings

When treating a patient, physicians have to choose the most suitable treatment among 3,600 different molecules. This requires going through long product descriptions to identify, for instance, the right drug for a given diagnosis that is compatible with the patient’s profile.


When treating a patient, physicians have to choose the most suitable treatment among 3,600 different molecules. This requires going through long product descriptions to identify, for instance, the right drug for a given diagnosis that is compatible with the patient’s profile.

Much of this information is publicly available on the French Public Drug Database or the European Medicines Agency, but finding the relevant piece of information is a time-consuming task.

One of the major challenges of finding the correct piece of information given a query in Natural Language is accurately identifying the specific entity or set of named entities in the query. This entails the detection of dosages, drugs and diseases and linking these to a unique identifier. This is exactly the job of a Named Entity Recognition and Linking model! So this is what we will dive into here.

But first, let’s go over the peculiar challenges of building a Machine Learning model based on French medical texts!

Why use Machine Learning ?

The main reason to implement Machine Learning models over a rule-based systems is their ability to generalize to situations that you have not encountered or envisioned. In our case, queries can come in varied form and diseases may be mentioned in a variety of manners.

For example in the query ulcère veineux (venous ulcer), the model should recognize Varices ulcérées, and propose céphalée when asked for mal de tête (headache). Spelling mistakes and diverse phrasing for a given entity make the space of possibilities too large to solely rely on manually defined rules or lists.

The trained model must respond to a list of criteria listed below:

  • Trainable with a low amount of text data
  • Fast inference: <100 ms
  • Able to detect several types of entities
  • Able to link a detected named entity to a specific entity among several thousand entities
  • Able to detect previously unseen entities

To meet these criteria, we selected a usual Machine Learning pipeline (with a few twists) the first part of which is how words are translated into vectors, in order to be processed by the Name Entity Recognizer.

Training Medical word embeddings. Not in English

One major challenge in the definition of the NERL model was the choice of word representation. The rest of this post will focus on the choice we ultimately settled on, in consideration of the following aspects and constraints:

  • Relatively low amount of freely available medical text data (in French). Training recent language models (commonly called transformer-based models) requires a very large amount of text (upwards of 10Go). Fine-Tuning, i.e. adapting models trained on a specific domain (such as wikipedia, news reports and so on) to a different domain, would be required however, because the BioMedical domain has a highly specific vocabulary. Our attempts at fine-tuning a pre-trained transformer-based model have not been fruitful, which may be due to too little text data, but this remains in the perspectives for further improvement.
  • Crucial morphologic information. In the medical domain, an approximate meaning of a word can frequently be inferred from its subwords. This is especially useful for rare words (with few occurrences) because training word embeddings otherwise requires as many mentions as possible. For example, the words ‘hypothyroïdie’ and ‘thyroïdite’ share the subword ‘thyroïd’ which indicates that they are diseases related to the same body part, the thyroid.

Many more examples of diseases (‘encéphalopathie’ and ‘céphalée’), drug classes (‘déméclocycline’ and ‘chlortétracycline’) show that subwords contain very significant information that can be used to learn useful representations from rare words using their morphological similarity to other more frequent words.

These two important aspects of the medical domain led us to choose the FastText model [Bojanowski et al., 2016].

The FastText model

Straightforwardly, this model is an iteration over the Word2Vec model introduced in [Mikolov et al., 2013], which learns embeddings by iterating over each sentence in a corpus and learns to predict the context of each word. Words that are semantically similar and appear in similar contexts are then set in a close region of the space, which will be helpful for the NERL to detect different types of entities.

PCA representation of vectors for a few medical specialties and their related parts of the anatomy. The figure highlights two properties of this representation. First, the word vectors group together entities of the same type, with medical specialties on the left and body parts on the right. Second, both groups of entities have similar layouts and thus one can infer the specialty related to an from the word vectors of the organ and of the pair (organ, specialty).

With the Word2Vec model, each word is associated to a vector. FastText iterates over this method by introducing a vector for each encountered n-gram. An n-gram is a sequence of n characters in a word,e.g.the wordcoeur(heart) is decomposed in the 3-grams <co, coe, eur, ur>. The ‘<’, ‘>’ characters are added to identify n-grams that start and end a word

This way, representations of words sharing several n-grams will be close in space and thus we can infer that they are probably semantically similar. Furthermore, this makes the model robust to spelling mistakes.

Finally, in the interest of time, we have omitted the description of several important steps we added to the pipeline such as

  • Tokenization seperates the input sequence in a sequence of tokens
  • Lemmatization, which decreases noise and the number of different words
  • Phrasingwhich enables learning more accurate representations for sets of words that often occur together, such as insuffisance rénale (kidney failure), following [Mikolov et al., 2013]
  • Ontology Sequence Generation complements the text corpus with sequences of similar entities, generated with a random walk on Medical Ontologies, such as ICD-10 (International Classification of Diseases). The representations of entities missing from the text corpus can then be learnt from these sequences, while integrating the knowledge of these ontologies into the word vectors. This is inspired by [Zhang et al., 2019]
Gaph representation of the International Disease Classfication, by [Garcia-Albornoz and Nielsen, 2015] licensed under CC BY 4.0. Extracting sequences of diseases with a random walk improves the detection of entities rarely mentioned in our text corpus.

The encoding pipeline in action

Here is a short example of how the query “contre-indication à l’insuffisance rénale” is processed in this first step, from a sequence of characters into a sequence of vectors:

  1. Tokenized: contre-indication, a, l’, insuffisance, renale
  2. Lemmatized: contre-indication, a, le, insuffisance, renal
  3. Phrase: contre-indication, a, le, insuffisance_renal
  4. Encoding: [v_0v_1v_2v_3] where v_i is the word embedding of the word i

In this part, we learnt that FastText requires a moderate amount of unlabeled text data and computing power, which makes it a very good candidate while being well adapted to the particular aspects of the medical domain. This is the heart of the encoding pipeline, as the resulting word vectors are the cornerstone of the ML pipeline, but there are important steps to make this model most effective. These vectors will be used in the NERL model to detect entities, recognize the type of entity and what entity they refer to. All this will be described in a post coming soon!

If you have any feedback or questions regarding this first step of the pipeline, our corpus or anything else, don’t hesitate to contact me :)


  1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
  2. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics5, 135–146.
  3. Garcia-Albornoz, M., Nielsen, J. (2015). Finding directionality and gene-disease predictions in disease associations. BMC Syst Biol 9, 35
  4. Zhang, Y., Chen, Q., Yang, Z., Lin, H. & Lu, Z. (2019). BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 6, 52

François Plesse
Data Scientist

Essayez Posos gratuitement

Après testé Posos Premium pendant 60 jours, profitez de la version gratuite de Posos, pour toujours
Pour vous renseigner sur nos solutions pour l'hôpital, cliquez ici