Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT

Faith Wavinya Mutinda; Shuntaro Yada; Shoko Wakamiya; Eiji Aramaki

doi:10.1055/s-0041-1731390

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Share / Bookmark

Facebook Linkedin Weibo

Download PDF

CC BY-NC-ND 4.0 · Methods Inf Med 2021; 60(S 01): e56-e64
DOI: 10.1055/s-0041-1731390

Original Article

Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT

Faith Wavinya Mutinda

¹Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Nara, Japan

,

Shuntaro Yada

¹Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Nara, Japan

,

Shoko Wakamiya

¹Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Nara, Japan

,

Eiji Aramaki

¹Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Nara, Japan

› Author Affiliations Funding This work was supported by a Japan Science and Technology Agency PRISM Grant (Grant No. JPMJCR18Y1).

› Further Information

Abstract
Full Text
References
Supplementary Material

Permissions and Reprints

Abstract

Background Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking. STS has been widely studied in the general English domain. However, there exists few resources for STS tasks in the clinical domain and in languages other than English, such as Japanese.

Objective The objective of this study is to capture semantic similarity between Japanese clinical texts (Japanese clinical STS) by creating a Japanese dataset that is publicly available.

Materials We created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). The CR dataset was created from publicly available case reports extracted from the CiNii database. The EMR dataset was created from Japanese electronic medical records.

Methods We used an approach based on bidirectional encoder representations from transformers (BERT) to capture the semantic similarity between the clinical domain texts. BERT is a popular approach for transfer learning and has been proven to be effective in achieving high accuracy for small datasets. We implemented two Japanese pretrained BERT models: a general Japanese BERT and a clinical Japanese BERT. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts.

Results The BERT models performed well in capturing semantic similarity in our datasets. The general Japanese BERT outperformed the clinical Japanese BERT and achieved a high correlation with human score (0.904 in the CR dataset and 0.875 in the EMR dataset). It was unexpected that the general Japanese BERT outperformed the clinical Japanese BERT on clinical domain dataset. This could be due to the fact that the general Japanese BERT is pretrained on a wide range of texts compared with the clinical Japanese BERT.

Keywords

natural language processing - semantic textual similarity - clinical semantic textual similarity - bidirectional encoder representations from transformers

Supplementary Material

Supplementary Material

Publication History

Received: 02 February 2021

Accepted: 18 May 2021

Article published online:
08 July 2021

© 2021. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Agirre E, Banea C, Cer D. et al. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. Paper presented at: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). Association for Computational Linguistics; 2016

PubMed Search in Google Scholar
2 Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L. Semeval-2017 task 1: semantic textual similarity—multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055; 2017

PubMed Search in Google Scholar
3 Mihalcea R, Corley C, Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity. AAAI 2006; 6: 775-780

PubMed Search in Google Scholar
4 Šarić F, Glavaš G, Karan M, Šnajder J, Dalbelo Bašić B. TakeLab: Systems for measuring semantic text similarity. The First Joint Conference on Lexical and Computational Semantics—Volume 1. Paper presented at: Proceedings of the Main Conference and the Shared Task, and Volume 2. Paper presented at: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). Association for Computational Linguistics; 2012: 441-448

PubMed Search in Google Scholar
5 Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Combining rich features and deep learning for finding similar sentences in electronic medical records. Paper resented at: Proceedings of the BioCreative/OHNLP Challenge. 2018: 5-8

PubMed Search in Google Scholar
6 Tian J, Zhou Z, Lan M, Wu Y. ECNU at Semeval-2017 task 1: Leverage kernel-based traditional NLP features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. Paper presented at: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics; 2017: 191-197

PubMed Search in Google Scholar
7 Zhao J, Zhu TT, Lan M. Ecnu: One stone two birds: Ensemble of heterogenous measures for semantic relatedness and textual entailment. Paper presented at: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin: Association for Computational Linguistics; 2014: 271-277

PubMed Search in Google Scholar
8 Kiros R, Zhu Y, Salakhutdinov RR. et al. Skip-thought vectors. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R. eds. Advances in Neural Information Processing Systems 28. Curran Associates, Inc.; 2015: 3294-3302

Search in Google Scholar
9 Mueller J, Thyagarajan A. Siamese recurrent architectures for learning sentence similarity. Paper presented at: Thirtieth AAAI Conference on Artificial Intelligence 2016

PubMed Search in Google Scholar
10 He H, Lin J. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. Paper presented at: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics; 2016: 937-948

PubMed Search in Google Scholar
11 Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084; 2019

PubMed Search in Google Scholar
12 He H, Gimpel K, Lin J. Multi-perspective sentence similarity modeling with convolutional neural networks. Paper presented at: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2015: 1576-1586

PubMed Search in Google Scholar
13 Tai KS, Socher R, Manning CD. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075; 2015

PubMed Search in Google Scholar
14 Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805; 2018

PubMed Search in Google Scholar
15 Wang Y, Afzal N, Liu S. et al. Overview of the BioCreative/OHNLP challenge 2018 task 2: clinical semantic textual similarity. Paper presented at: Proceedings of the BioCreative/OHNLP Challenge. 2018

PubMed Search in Google Scholar
16 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ. eds. Advances in Neural Information Processing Systems 26. Curran Associates, Inc.; 2013: 3111-3119

Search in Google Scholar
17 Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. Accessed December 2, 2020 at: https://s3-us-west-2amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

PubMed
18 Agirre E, Diab M, Cer D, Gonzalez-Agirre A. Semeval-2012 task 6: A pilot on semantic textual similarity. Paper presented at: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1. Paper presented at: Proceedings of the Main Conference and the Shared Task, and Volume 2. Paper presented at: Proceedings of the Sixth International Workshop on Semantic Evaluation. Association for Computational Linguistics; 2012: 385-393

PubMed Search in Google Scholar
19 Agirre E, Cer D, Diab M, Gonzalez-Agirre A, Guo W. SEM 2013 shared task: Semantic textual similarity. Second Joint Conference on Lexical and Computational Semantics, Volume 1. Paper presented at: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity. Vol 1.. 2013: 32-43

PubMed Search in Google Scholar
20 Agirre E, Banea C, Cardie C. et al. Semeval-2014 task 10: Multilingual semantic textual similarity. Paper presented at: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Association for Computational Linguistics; 2014: 81-91

PubMed Search in Google Scholar
21 Agirre E, Banea C, Cardie C. et al. Semeval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. Paper presented at: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics; 2015: 252-263

PubMed Search in Google Scholar
22 Wang Y, Afzal N, Fu S. et al. MedSTS: a resource for clinical semantic textual similarity. Lang Resour Eval 2020; 54: 57-72

Crossref PubMed Search in Google Scholar
23 Artstein R, Poesio M. Inter-coder agreement for computational linguistics. Comput Linguist 2008; 34 (04) 555-596

Crossref PubMed Search in Google Scholar
24 Yada S, Joh A, Tanaka R, Cheng F, Aramaki E, Kurohashi S. Towards a versatile medical-annotation guideline feasible without heavy medical knowledge: starting from critical lung diseases. Paper presented at: Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association; 2020: 4565-4572

PubMed Search in Google Scholar
25 Peters ME, Neumann M, Iyyer M. et al. Deep contextualized word representations. arXiv preprint arXiv:1802.05365; 2018

PubMed Search in Google Scholar
26 Alsentzer E, Murphy JR, Boag W. et al. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323; 2019

PubMed Search in Google Scholar
27 Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed with huge size of Japanese clinical narrative. medRxiv. 2020

PubMed Search in Google Scholar
28 Liu X, He P, Chen W, Gao J. Multi-task deep neural networks for natural language understanding. Paper presented at: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2019: 4487-4496

PubMed Search in Google Scholar

Supplementary Material

Supplementary Material

Subscribe to RSS

Share / Bookmark

Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT

Abstract

Keywords

Supplementary Material

Publication History

References