Building a medical search engine — Step 2: Identifying medical entities in text.
In this article, you will learn about one of the most important applications of medical word embeddings: a machine learning model that identifies medical entities (as drug names or diseases) in text.
As seen on our previous article, word embeddings can be used to obtain numerical representations of words that capture syntactic and semantic information. If these embeddings are trained on medical text, they give us a good vector representation of medical entities: words in text that have a medical meaning. For instance, “Lyme borreliosis” is a medical entity that belongs to the class of diseases,and “amoxicillin”is an entity in the class of drugs(see figure below). The process of identifying entity types in text is known in the Natural Language Processing (NLP) community as Named-Entity Recognition (NER).
In practice, what the NER does is to predict a label for each word of a given sentence. To formalise the output of the NER model, we can tag each word of the previous example as it follows:
- Lyme → B-Disease
- borreliosis → I-Disease
- can → O
- be → O
- treated → O
- with → O
- antibacterials → B-TherapeuticClass
This is known as the IOB tagging format (for Inside, Outside, Begin). Disease and Therapeutic Class indicate entity types. B- indicates the beginning of an entity (the first word of an entity composed of several words, as “Lyme borreliosis”); I- indicates that the previous entity has not ended yet; and O tell us that a given word is not a medical entity.
Common NER models are able to identify non-medical entity types as “person”or “location”. At Posos, thanks to a thorough annotation campaign of hundreds of thousand of medical extracts, we are able to detect entities that are relevant to medical practitioners.
Some of the entities that we detects at Posos are
- Drugs (e.g., paracetamol; aspirine)
- Therapeutic classes (e.g., antihistamine; antibiotic)
- Diseases or symptoms (e.g., acute pancreatitis; Alzheimer’s disease)
- Doses (e.g., 500 mg/l)
- Dose forms (e.g., capsules; liquid solution)
- Routes of administration (e.g., intradermal; sublingual)
- Side or adverse effects (e.g., fever; headache)
- Population (e.g., age, weight)
From a machine learning’s point of view, the NER algorithm can be seen as a classification problem for every word of the input text (all the entity types plus an extra class O for non medical words). The complexity of the task comes from the fact that the classification of a word depends on its context in text. For instance, the word “Parkinson” should be tagged as Disease in the sentence “A 60 years old patient with Parkinson’s disease”, but as O in “Mr. Parkinson is 60 years old”.
After identifying entity types in text, and in order to gain real knowledge about its content, we need to link the detected entities to entries on medical ontologies. This linking process allow us to know that “Lyme borreliosis” is a specific infectious disease that corresponds, for example, to the code A69.2 in the International Classification of Diseases, 10th Revision (ICD-10). The latter task is know as Named-Entity Linking (NEL) (see figure below).
The NER and NEL tasks are very important for the Posos search engine, as they allow to filter documents by the detected medical entities in the query. This makes Posos resilient to typos or semantical variations of medical expressions (see the example below).
Now that we have a rough idea of how to identify medical entities from text in a two step process, we can describe with more details the main components of the NER and NEL algorithms currently used in the Posos’ search engine.
Machine learning algorithms for NER and NEL
The NER algorithm that we implemented at Posos is a supervised learning model based on [Lample, 2016]. Supervised models require labeled or annotated data to learn. One drawback of supervised models is that, in general, the data annotation process is expensive and time consuming, as it requires expert manual annotation.
At Posos, we launched an annotation campaign with healthcare professionals to manually annotate several hundred of thousands of medical extracts. The data is then transformed into the IOB tagging format and used to train a NER algorithm that relies on complex deep learning tools.
The main components of the NER algorithm are a bidirectional Long Short-Term Memory (bi-LSTM) layers [Huang, 2015] — a type of neural network that considers the context of tokens in text, i.e., the surrounding words — followed by a Conditional Random Field (CRF) [Lafferty, 2001], a probabilistic model that assigns a probability of transition between consecutive words — for example, the probability of finding the word “car” after “Parkinson’s” in medical text is much lower than the probability of finding the word “disease”.
To improve results, the algorithm takes as input data the syntax (embedding based on the character structure) and the semantics (our medical word embeddings) of words, as well as the position of tokens in the sentence (figure below).
Named-Entity Linking (NEL)
As in the NER task, our NEL algorithm uses our custom medical word embeddings to create an embedding representation of medical entities. Because medical entities are often composed of several words, we rely on sentence embedding techniques. The simplest sentence embedding strategy is to average word embeddings of each word in the sentence. For example, imagine that we have the following 3-dimensional embeddings:
- Chronic → (0.1, 0.7, 0.6)
- Conjunctivitis → (0.5, 0.9, 0.2)
Then, the sentence embedding (with the averaging strategy) of the disease “Chronic conjunctivitis”, will be:
- Chronic conjunctivitis → (.1, .7, .6)*.5 + (.5, .9, .2)*.5 = (0.3, 0.8, 0.4)
This strategy is not optimal as it assumes that all words in an entity are equally important. In the example, the word chronic is less important because many diseases can contain it (e.g, chronic pancreatitis, chronic meningitis, etc). Better sentence embeddings can be obtained if we consider the inverse document frequency (idf) of words, a weighting strategy that gives more importance to unfrequent words in a document corpus, as they are usually more informative. For example, if the weights of chronic and conjunctivitis are 0.2 and 0.8 respectively (in terms of their frequency in the ICD-10 ontology), then:
- chronic conjunctivitis → (.1, .7, .6)*.2 + (.5, .9, .2)*.8 = (0.42, 0.86, 0.28)
Notice that this last entity embedding is really close to the embedding of the word conjunctivitis! This improves the performance of the NEL algorithm
Once the sentence embedding strategy has been chosen, we can compute the embedding for every entity in our ontology and index them to perform fast linking on the detected entities in the NER step. For this, we use a nearest neighbour search on the entity embeddings: given an input embedding, this algorithm gives you the closest medical entity in terms of the cosine similarity between the query and the indexed entity embeddings.
Identifying medical entities in text is a crucial step in order to have a robust and reliable medical search engine. To accomplish this, Posos uses a NERL model that has been trained on hundred of thousand of expert-annotated data.
The automatic entity detection allow us to process the queries made by medical practitioners in the Posos’ search engine. The NERL is also used to tag our official sources, with the purpose to facilitate one of the main of Posos’ missions: finding the most pertinent documents to the questions made by medical practitioners.
Now that we have a better idea of the functioning of the NERL algorithm, we can pass to the following step of the ML pipeline: finding the most pertinent documents of a query.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.