The benefits of movements and initiatives of digital pathology will be enlarged by information retrieval technologies such as data search.

I am Na Li, a PhD student in computer science at University of Amsterdam, currently focusing on dataset search and research asset discovery. In this post I will give you a basic introduction to word embeddings in light of semantic medical term representation. 

In CLARIFY project, natural language understanding is necessary when processing metadata, such as textual annotations and clinical data, for computational analysis. And word embeddings customized for medical term representation is an important and interesting direction that I am going to address in my research project.  

Although natural language is one of the most powerful tools for humans to represent and pass knowledge, number is the language for computers. To enable computers to understand natural language, words, phrases and sentences must be firstly translated to computers’ “language”. Representing words with numerical vectors is the basis for downstream natural language processing (NLP) tasks. Conventionally, most NLP models just use one-hot encoding methods to represent the words. It turns each word to a vector that has the size of the number of vocabulary and with only one non-zero element. However, one-hot encoding does not retain the semantic similarity of words in the vector representations. Mathematically speaking, since the dot product between two vector representations is always zero, it means that the vectors are orthogonal and thus the words are independent. But this assumption is obviously incorrect! For example, the word “car” is more semantically similar to “vehicle” than “apple”. A good word representation should take the semantics into account to better understand natural language.  

Word embeddings, a special kind of vector representation based on distributional hypothesis, models the linguistic relationship between the words and integrate the semantics into the generated vectors. Therefore, semantically similar words usually have similar vectors. Word embeddings usually generate dense, distributed and fixed-length vectors and have been widely used in NLP tasks. Well-known models for producing word embeddings include Word2vec, Glove, ELMo, BERT, etc. Furthermore, word embeddings methods are distinguished based on whether the context of the target words is considered. Context-independent embeddings produce a global representation for each word within a large corpora, whereas context-dependent embeddings map words to representations based on the context. Therefore, the same word will be assigned different representations when appearing in different contexts.  

The benefits of movements and initiatives of digital pathology will be enlarged by information retrieval technologies such as data search. Data search in this context will enable searching for similar WSIs, annotations, etc. Though still at its infancy, it can provide reference cases to pathologists in tricky clinical diagnosis and empower image analysis research with relevant datasets.  

The following photo is my workplace — Science Park 904, Amsterdam.

Na Li – ESR1