The authors have declared that no competing interests exist.
A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at
Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public.
All the models and datasets are publicly available via
In the biomedical domain, one primary application of text mining is to extract knowledge within the biomedical literature automatically [
Since 2014, word embedding models have revolutionized how to represent text. In these models, each word is represented as a high dimensional vector [
It is known that biomedical concepts have a high degree of ambiguity [
We present a detailed summary of the existing bio-concept embeddings in
Repository: the scope of concepts. Corpora: the training collection. Note that for EHR (electronic health records) and Claims (medical claims), the size is the number of patients, whereas for Wikipedia, PubMed (abstracts), and PMC (full-text articles), the size is the number of documents. #Concepts: the number of distinct concepts in the embedding. Method: the method for training embeddings. PCA: principle component analysis. PMI: pointwise mutual information. Intrinsic evaluation: a focus on applications that directly use the similarity between the vectors produced by word embeddings, such as word-pair similarity and relatedness. Extrinsic evaluation: a focus on downstream applications that use only word embeddings as an intermediate component. For example, the last study evaluated the effectiveness of concept embeddings for heart-failure prediction. Availability: whether the studies made the embeddings publicly available (we accessed on 04/20/2019).
Study (year) | Repository | Corpora (size) | #Concepts | Method | Evaluation | Availability | |||
---|---|---|---|---|---|---|---|---|---|
Intrinsic | Extrinsic | ||||||||
Vine et al. (2014) [ | UMLS | EHR (<20K) | 52,102 | skip-gram | Concept similarity | N | N | ||
Choi et al. (2016) [ | ICD9CM | EHR (0.55M) | 49,873 | skip-gram | Concept clustering | N | N | ||
Claims (0.85M) | |||||||||
Choi et al. (2016) [ | UMLS | EHR (20M) | 22,705 | skip-gram | Concept clustering | N | Y | ||
Claims (4M) | |||||||||
Yu at al. (2017) [ | UMLS | PubMed (22M) | 310,403 | cbow | Concept similarity | N | Y | ||
Beam et al. (2018) [ | UMLS | EHR (60M) | 108,477 | skip-gram | Concept similarity | N | Y | ||
Claims (20M) | GloVe | ||||||||
PMC (1.7M) | PCA | ||||||||
Cai at al. (2018) [ | UMLS | EHR (2M) | 47,873 | cbow | Concept clustering | N | N | ||
Nguyen at al. (2018) [ | UMLS | Wikipedia (5M) | 659,873 | cbow | Concept similarity | N | N | ||
PubMed (24M) | |||||||||
PMC (3M) | |||||||||
Xiang at al. (2019) [ | UMLS | EHR (50M) | 30,348 | skip-gram | Concept clustering | Y | N | ||
PMI | |||||||||
fastText |
Despite these recent efforts, past studies share some limitations. As shown in
Second, almost no studies had evaluated the effectiveness of concept embeddings in extrinsic evaluations. The evaluation of word embeddings can be broadly categorized into two types (i.e., intrinsic and extrinsic) [
Further, importantly, the existing concept embeddings are designed primarily for concepts and applications in the clinical domain, whereas concept embeddings for the biological domain remain to be developed. As shown in
In response, we propose BioConceptVec, a collection of concept embeddings on primary biological concepts mentioned in the biomedical literature.
To our knowledge, we are the first study to use machine learning-based NER tools to recognize and normalize biological concepts for training bio-concept embeddings. Specifically, we employed PubTator, a state-of-the-art NER system with concept annotations for the entire PubMed abstracts [
We conducted large-scale intrinsic and extrinsic evaluations to quantify the validity and utility of BioConceptVec. The intrinsic evaluations contain ~18 million instances from six datasets. BioConceptVec has significantly higher performance (up to 10% improvement) than the existing concept embeddings and is consistent across multiple datasets. The extrinsic evaluations cover two downstream applications: protein-protein interaction (PPI) prediction, consisting of ~8 million PPIs from the STRING database [
We make all of the embeddings and evaluation datasets publicly available. The embeddings and datasets can be downloaded via
BioConceptVec was trained on PubMed abstracts, which consists of ~30 million documents. (1) We employed PubTator, which contains four NER tools, to annotate and normalize the concepts. (2) We trained four concept embeddings on the normalized corpus. (3) We conducted both intrinsic evaluations on drug-gene interactions and gene-gene interactions, and extrinsic evaluations on protein-protein interaction prediction and drug-drug interaction extraction to evaluate the effectiveness of BioConceptVec.
We trained concept embeddings on the ~30 million abstracts in the entire PubMed. We followed the preprocessing pipeline from [
We trained concept embeddings on the full collection of PubMed abstracts after concept recognition via PubTator, i.e., identified named entities are replaced with bio-entity types and IDs (e.g., Disease_MESH_D008288) before training. To our knowledge, there is no agreement on which embedding model is the most effective in biomedical domains. For example, Wang et al. [
In general, the methods to train word embeddings can be categorized into two groups: window-based and matrix factorization-based [
As mentioned, fastText represents each word as a set of character n-grams. In the case of bio-concept embeddings, however, each bio-concept should be considered a unit. Thus, when training with fastText, we disabled the n-grams representation for bio-concepts (in contrast, for the words that are not bio-concepts, we still used the default n-grams representation in fastText).
The values of hyperparameters for training embeddings are summarized in
Hyperparameter | Default values | Other values | |
---|---|---|---|
Shared hyperparameters | Vector dimension | 200 | 100, 300 |
Window size | 20 | 5, 10 | |
Negative samples | 5 | 2, 3 | |
Down-sampling threshold | 0.001 | 0.0001, 0.00001 | |
Minimal word occurrence | 5 | - | |
Learning rate | 0.025 | - | |
Training epochs | 10 | - | |
fastText-specific hyperparameters | Minimal character n-gram length | 2 | - |
Maximum character n-gram length | 3 | - |
To directly compare with the existing concept embeddings, we used the exact hyperparameter values from Yu et al. [
Yu et al. [
In addition, we trained and assessed BioConceptVec (cbow) under different parameters but keeping the same values for minimal word occurrences (so that embeddings share the same vocabulary), learning rate and training epochs (so that embeddings share the same optimization procedure). For each of the other hyperparameters, we selected two representative values that were used in the previous studies on embeddings [
Furthermore, different studies show that performance can vary by different embedding methods [
To ensure a fair comparison, the evaluation datasets described below contain only concepts shared among these baseline methods and BioConceptVec. We also measured the coverage of concepts using human genes as an example.
We posit that concept embeddings should give higher similarity to related concepts than to unrelated concepts. The intrinsic evaluations in our study quantify the effectiveness of concept embeddings in terms of identifying related genes. We concentrate on genes because genes are a central focus of biological studies; the interactions between genes (or genes and other biological concepts) are essential for understanding the structures and functions of a cell [
We adopted six datasets for creating evaluation datasets. The detailed statistics of these datasets are summarized in
There are six datasets in total. #groups: the number of groups in a dataset. Each group has a related set and an unrelated set of genes based on drug-gene interactions provided by CTD or gene sets provided by MSIGDB. #distinct concepts: the total number of distinct genes in a dataset. Avg #concepts per group: the average of number of genes in a group; note that one gene may be in multiple groups. #pairs: the total number of pairs in a dataset. Avg #pairs per group: the average of the number of pairs per group.
Dataset | #groups | #distinct concepts | Avg #concepts | #pairs | Avg #pairs |
---|---|---|---|---|---|
CTD | 6383 | 14,654 | 22.39 | 2,146,482 | 358.88 |
MSigDB datasets | |||||
C1 positional gene sets | 326 | 11,709 | 63.30 | 431,254 | 1447.16 |
C2 curated gene sets | 4,762 | 13,783 | 66.21 | 6,171,976 | 1621.21 |
C3 motif gene sets | 836 | 9,553 | 115.63 | 910,722 | 3976.95 |
C4 computational gene sets | 858 | 8,637 | 85.84 | 1,452,542 | 2392.99 |
C5 GO gene sets | 5,917 | 13,627 | 62.71 | 6,697,736 | 1455.08 |
Total | 19,082 | 14,998 | - | 17,810,712 | - |
For the first category, we used the Comparative Toxicogenomics Database (CTD) [
For the second category, we used five gene sets (C1–C5) of MSigDB [
We computed the similarity of a set by averaging the cosine similarity of all of the pairs in the set, using concept embeddings. Cosine similarity is the most popular similarity measure used by embeddings [
We used the similarity score difference between related sets and unrelated sets at group level as the final evaluation metric. As noted, a more effective concept embedding should have a greater similarity score difference between the related set and the unrelated set for a group. For computational efficiency, we restricted the maximum number of genes in a set to be 100, i.e., a group has, at most, 200 genes in total. Note that MSigDB has other gene sets, such as C6 and C7. We did not use them because the number is fewer than 100 in shared genes. Collectively, our intrinsic evaluation datasets contain over 13,000 genes and over 17 million instances across six datasets.
We further evaluated the utility of BioConceptVec in two downstream applications: protein-protein interaction (PPI) prediction on the STRING database [
Analyzing functional interactions between proteins, which facilitates the understanding of the cellular processing and characterization, is a routine task in molecular systems biology [
Existing studies have used STRING for training and testing machine learning models for PPI prediction [
We followed this study [
#Concepts: the number of concepts in the dataset. #Training: the number of training instances; same applies to #Validation and #Testing.
Dataset | #Concepts | #Training | #Validation | #Testing | Total |
---|---|---|---|---|---|
combined-score | 13,802 | 5,245,358 | 582,818 | 2,497,790 | 8,325,966 |
experimental-700 | 13,290 | 24,684 | 2,743 | 11,755 | 39,182 |
We also examined the usefulness of concept embeddings in a text-mining task. Specifically we evaluated the performance of concept embeddings on the SemEval 2013: Task 9 DDI extraction corpus [
In this task, the input is a sentence that contains a pair of drugs. If the pair of drugs represents a true DDI, the model needs to output the DDI type; otherwise, the model needs to indicate the pair is not a true DDI [
Mechanism, Effect, Advice, Int are four types of DDIs. Negative means that the instance does not contain a DDI.
Class | #Training | #Testing |
---|---|---|
Mechanism | 1,319 | 302 |
Effect | 1,621 | 360 |
Advice | 826 | 221 |
Int | 188 | 96 |
Negative | 23,772 | 4,737 |
We implemented a simple averaged sentence embedding neural network model (SEN) for DDI classification.
The number of human genes in different embeddings is shown individually. In total, these four embeddings consist of 18,881 human genes. Note that the embeddings from Beam et al. and Choi et al. were mainly trained on EHR. The results mainly aim to demonstrate that biomedical literature and EHR contain significantly different concepts.
Notably, the embeddings from Beam et al. and Choi et al, were primarily trained on EHR, and these embeddings are designed mainly for clinical applications. Hence, they only cover a small number of gene and protein concepts. This comparison thus further illustrates that the biomedical literature contains significantly different bio-concepts from clinical notes.
The embeddings were trained using the same default parameters. Direct comparison: the results of baseline embeddings and BioConceptVec trained using cbow.
In
Combined-scores: PPIs that have combined scores are considered positive cases. Experimental-700: PPIs that have experimental scores over 700 are considered positive cases. Direct comparison: the results of embeddings using the same method (cbow) and same hyperparameters. Different embedding methods: the results of BioConceptVec (skip-gram), BioConceptVec (GloVe) and BioConceptVec (fastText). The highest results of each section are marked as bold.
Combined-score dataset | Experimental-700 dataset | |||||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | AUC | Precision | Recall | F1 | AUC | |
BioAvgWord (cbow) | 0.8195 | 0.7935 | 0.8063 | 0.8941 | 0.8851 | 0.7422 | 0.8074 | 0.9123 |
Yu et al. (cbow) | 0.8236 | 0.8017 | 0.8125 | 0.9029 | 0.9130 | 0.7686 | 0.8346 | 0.9283 |
BioConceptVec (cbow) | ||||||||
BioConcept (skip-gram) | 0.8279 | 0.8097 | 0.8187 | 0.9074 | 0.8525 | 0.8850 | 0.9522 | |
BioConcept (GloVe) | 0.8116 | 0.8109 | 0.9004 | 0.8656 | 0.8289 | 0.8468 | 0.9218 | |
BioConcept (fastText) | 0.8100 | 0.9076 |
SOTA: state-of-the-art. P: Precision. R: Recall. The SOTA results are extracted from [
Model | F1-score on each relation type | Overall performance | |||||
---|---|---|---|---|---|---|---|
Int | Advice | Effect | Mechanism | P | R | F | |
Zhang et al. (SOTA) | 0.8000 | 0.7200 | 0.7400 | 0.7400 | 0.7200 | 0.7300 | |
SEN | 0.3569 | ||||||
SEN + BioAvgWord (cbow) | 0.3150 | 0.7787 | 0.8000 | 0.8824 | 0.7883 | 0.7814 | 0.7731 |
SEN + Yu et al. (cbow) | 0.4285 | 0.8263 | 0.8133 | 0.8559 | 0.7948 | 0.7961 | 0.7916 |
SEN + BioConceptVec (cbow) | |||||||
SEN + BioConcept (skip-gram) | 0.4090 | 0.8626 | 0.8025 | 0.7941 | |||
SEN + BioConcept (GloVe) | 0.8100 | 0.8160 | 0.8046 | ||||
SEN + BioConcept (fastText) | 0.4382 | 0.8153 | 0.8200 | 0.8571 | 0.7999 | 0.7998 | 0.7930 |
We also measured the performance of SEN by adding concept vectors. The direct comparison results show that BioConceptVec has better performance than the baseline approaches. Adding BioConceptVec improves the F1-score significantly and BioConceptVec (cbow) appears to be the most effective in this task. The results of BioConceptVec (cbow) using different hyperparameters are summarized in
We further qualitatively analyzed the errors by comparing the results of the SEN model with and without BioConceptVec. We found that the SEN model failed to classify challenging cases in which the definitions of relation types are somewhat similar. For example, the sentence, “Zidovudine competitively inhibits the intracellular phosphorylation of stavudine,” contains the relation “zidovudine-stavudine.” The annotator classified it as the effect type, but the SEN model wrongly classified it as the mechanism type. According to the annotation guidelines, both effect and mechanism types can describe pharmacological effects. The effect type, however, focuses on the change of the effect, whereas the mechanism type focuses on the underlying reason for the change. For this case, inhibiting the intracellular phosphorylation describes the change rather than the mechanism. There are ~20 similar erroneous cases for which the SEN model only mixed the effect type with the mechanism type. Adding BioConceptVec (cbow) to the SEN model correctly classified all of them. This is likely due to the fact that BioConceptVec provides additional information learnt from the entire PubMed abstracts, making the classification of the two related types easier as a result. Collectively, the results confirm the hypothesis that adding concept representatives improves the performance of downstream deep learning models and suggests that BioConceptVec has the potential to facilitate the development of deep learning models in the biomedical domain.
In this work, we propose BioConceptVec, concept embeddings that focus on primary biological concepts mentioned in the biomedical literature. We employed SOTA biological NER tools and trained four concept embeddings on the full collection of ~30 million PubMed abstracts. We evaluated the effectiveness of BioConceptVec in intrinsic and extrinsic settings, consisting of ~25 million instances in total. The results demonstrate that BioConceptVec consistently achieves the best performance in multiple datasets and in a range of applications. We hope that it can facilitate the development of deep learning models in biomedical research. In the future, we plan to leverage both PubMed abstracts and PMC full-text articles for training BioConceptVec.
This study focused on the evaluation on human genes because there are rich resources readily available for serving as a gold standard. We plan to evaluate BioConceptVec embeddings on different concept types in the future. Also, the quality of our concept embeddings is dependent on the accuracy of the NER tools. Improving NER tools such as PubTator would help enhance the quality of BioConceptVec. Finally, in this work, we did not apply retro-fitting, which is a fine-tuning step to further optimize the embeddings based on specific tasks with gold standard labels. For example, one of the most common retro-fitting procedures is to optimize the performance of the generated embeddings on identifying synonyms and acronyms. We did not employ it because such datasets are very limited for biomedical concepts. We plan to develop related datasets and apply the approach to further enhance BioConceptVec.
(DOCX)
Click here for additional data file.
(DOCX)
Click here for additional data file.
(DOCX)
Click here for additional data file.
(DOCX)
Click here for additional data file.
(DOCX)
Click here for additional data file.
The authors thank Dr. Alexis Allot and Dr. Robert Leaman for helpful discussions. We also thank Dr W. John Wilbur for proofreading the manuscript.