CC BY-NC-ND 4.0 · Appl Clin Inform 2024; 15(02): 357-367
DOI: 10.1055/a-2282-4340
Research Article

Examining the Generalizability of Pretrained De-identification Transformer Models on Narrative Nursing Notes

Fangyi Chen
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
,
Syed Mohtashim Abbas Bokhari
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
,
Kenrick Cato
2   School of Nursing, University of Pennsylvania, Philadelphia, Pennsylvania, United States
3   School of Nursing, Columbia University, New York, New York, United States
,
Gamze Gürsoy
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
,
Sarah Rossetti
1   Department of Biomedical Informatics, Columbia University, New York, New York, United States
3   School of Nursing, Columbia University, New York, New York, United States
› Author Affiliations
Funding This study was supported and funded by the National Institute of Nursing Research (1R01NR016941) and the American Nurses Foundation (ANF) Reimaging Nursing Initiative. The authors are solely responsible for the content of this work, and it does not necessarily reflect the official view of the National Institutes of Health.
 

Abstract

Background Narrative nursing notes are a valuable resource in informatics research with unique predictive signals about patient care. The open sharing of these data, however, is appropriately constrained by rigorous regulations set by the Health Insurance Portability and Accountability Act (HIPAA) for the protection of privacy. Several models have been developed and evaluated on the open-source i2b2 dataset. A focus on the generalizability of these models with respect to nursing notes remains understudied.

Objectives The study aims to understand the generalizability of pretrained transformer models and investigate the variability of personal protected health information (PHI) distribution patterns between discharge summaries and nursing notes with a goal to inform the future design for model evaluation schema.

Methods Two pretrained transformer models (RoBERTa, ClinicalBERT) fine-tuned on i2b2 2014 discharge summaries were evaluated on our data inpatient nursing notes and compared with the baseline performance. Statistical testing was deployed to assess differences in PHI distribution across discharge summaries and nursing notes.

Results RoBERTa achieved the optimal performance when tested on an external source of data, with an F1 score of 0.887 across PHI categories and 0.932 in the PHI binary task. Overall, discharge summaries contained a higher number of PHI instances and categories of PHI compared with inpatient nursing notes.

Conclusion The study investigated the applicability of two pretrained transformers on inpatient nursing notes and examined the distinctions between nursing notes and discharge summaries concerning the utilization of personal PHI. Discharge summaries presented a greater quantity of PHI instances and types when compared with narrative nursing notes, but narrative nursing notes exhibited more diversity in the types of PHI present, with some pertaining to patient's personal life. The insights obtained from the research help improve the design and selection of algorithms, as well as contribute to the development of suitable performance thresholds for PHI.


#

Background and Significance

The National Institutes of Health promotes the sharing of scientific data to accelerate discovery, enable external validation of research results, increase accessibility to high-value datasets, and promote data reuse and replicability.[1] To meet these expectations while protecting patient privacy, it requires a deeper understanding of the types and distributions of identifiable data within different unstructured notes from the electronic health records (EHRs).

The widespread adoption of EHRs and advanced health information technologies in the United States has led to the sharing of increasing volumes of complex health data from various sources, and some of these data have become openly available to drive reproducible, open science.[2] Most EHRs contain both structured (well-formatted tables) and unstructured data (free-text, images, etc.) providing valuable information about patients and populations to facilitate knowledge discovery and development of innovative clinical tools.[3] Roughly 80% of the health data stored in EHRs remain unstructured.[4] Compared with structured data from the EHR, these unstructured data are understudied due to methodological challenges in processing and analyzing narrative text that only recent methods have advanced.[5] Sharing of unstructured data (i.e., narrative notes) enables comprehensive analysis of current research findings and methods to ensure reproducibility, reliability, and accountability.[6] Our team has demonstrated the unique value and predictive signals embedded within nursing notes,[7] [8] [9] but nursing notes are vastly understudied compared with physician notes, possibly in part due to the lack of access to curated, de-identified datasets of nursing notes. For privacy protection, de-identification of clinical notes per the Health Insurance Portability and Accountability Act (HIPAA) is required prior to making a dataset openly available or direct sharing datasets across different health care institutions and organizations.[10] [11] De-identification involves removing all personal protected health information (PHI) from health records to protect patient privacy while enabling secondary use of her data.[11] Under the HIPAA specification, two methods for de-identifying PHI are suggested: “Expert Determination” (domain experts certify the risk of leveraging available information to identify an individual by any anticipated recipients is small), and “Safe Harbor” method where clinical records are considered as de-identified when all 18 PHI ([Table 1]) are removed. Biometric identifiers and photographs were not appliable in our study. Practical and efficient methods to reliably remove all PHI types in notes are currently lacking.

Table 1

Types of PHI defined by HIPAA Safe Harbor Legislation (items 16 and 17 were not applicable in our dataset; for the interest of our study, we only focused on the rest of the PHI categories)

PHI Category and Examples (if applicable)

1. Names: patient and family member names, health care providers, etc.

11. Certificate or license numbers

2. All geographic subdivisions smaller than a state: city, street address, county, zip code, etc.

12. Vehicle identification

3. Dates

13. Device identification or serial numbers

4. Telephone numbers

14. Universal resources locators (URLs)

6. Email addresses

15. Internet Protocol addresses

7. Social security numbers

16. Biometric identifiers

8. Medical record numbers

17. Full-face photographs and comparable images

9. Health plan numbers

18. Any other unique identifying number, characteristic, or code

10. Account numbers

Abbreviation: PHI, protected health information.


Automated methods for “Safe-harbor” de-identification methods are preferred given the large volume of notes. Currently, there are two main approaches to performing automated de-identification tasks in clinical notes: (1) rule-based/pattern matching approaches[12] [13] [14] [15] [16] [17] applying specialized dictionaries as well as complex hand-crafted regular expression rules, and (2) machine learning (ML) techniques (support vector machines, decision trees, conditional random fields, etc.).[18] [19] [20] [21] [22] ML techniques generally outperform rule-based methods but involve extensive feature engineering and time-consuming annotation.[23] [24] Recently, some research studies attempted to tackle this problem by using artificial neural networks (long-short-term memory architect, bidirectional transformer models).[23] [25] [26] Some studies[27] [28] [29] have explored hybrid models (combination of ML and rule-based), reporting a high average recall rate and accuracy (above 90%) while demanding an exhaustive feature extraction process and being questioned by their generalizability to external sources.[30] However, to make the data publicly accessible, all PHI should be removed as mandated by the law, which necessitates the perfect performance of a system.

Despite large advancements made so far, most research studies in this domain have neglected the distributions in PHI characteristics across various note types which can potentially inform our selection of the desired de-identification algorithms. Clinical notes serve different purposes (e.g., billing, decision-making, care assessment)[31] and exhibit varying content and structure, which implies the necessity for tailored de-identification algorithms and evaluation criteria. Much of the literature[19] [21] [29] leveraged specific corpora, like i2b2 (Informatics for integrating Biology and the Bedside) discharge summaries,[32] for training and evaluation but overlooked diverse EHR note types. This may hinder insights into algorithm adaptability across different note types or datasets. Understanding the PHI patterns across note types and examining any types of PHI frequency correlation within the same note type may help in the selection and refinement of de-identification models, as well as development of evaluation schema. In the domain of natural language processing (NLP), pretrained transformer models have emerged as foundational tools, achieving the state-of-art performances on various NLP tasks.[33] One widely recognized transformer includes RoBERTa trained on an extensive text corpus (over 160 GB), exhibiting significant improvement on several benchmarks.[33] [34] Additionally, ClinicalBERT trained on clinical documentation stands out as a promising representation of clinical notes.[35] Our study aims to examine the generalizability of pretrained transformer models on inpatient nursing notes and explore the variability of PHI distribution between i2b2 discharge summaries[32] and narrative inpatient nursing notes, with the implications for optimizing de-identification processes in nursing notes.


#

Objectives

In this study, we aim to address the following questions: (1) What is the generalizability of the current pretrained transformer models on different sets of data? (2) How does the pattern distribution of PHI in nursing notes differ from nonnursing clinical notes? (3) How does distribution variability impact our downstream analysis?


#

Methods

The study entails the following steps: data preprocessing, pretrained model implementation, evaluation, and lastly statistical analysis, shown in [Fig. 1]. The details for each stage are discussed in the subsections below.

Zoom Image
Fig. 1 Overview of the study.

Data Source

The main data source consists of a total number of 1,334 raw inpatient nursing notes covering acute care and intensive care units within cardiac, surgical, and neuro specialties between October 2021 and April 2022 at an academic medical center in the Northeast United States as part of the CONCERN study.[9] This study was approved by the Institutional Review Board. The raw nursing notes, comprising both structured and unstructured sections, were processed to extract only narrative components for our study interest. Details were described in a separate publication from our team.[36] Additionally, we also included i2b2 2014 discharge summaries for comparison, where the access can be requested here: https://www.i2b2.org/NLP/DataSets/Main.php. The datasets were manually de-identified and validated by the i2b2 team, and annotated PHI entities were replaced with realistic surrogates.


#

Data Preprocessing

One fundamental preprocessing step for NLP tasks is tokenization, breaking texts into subunits for input into transformer models. In this step, the raw narrative texts were split into sentences by leveraging spaCy models for biomedical text processing (“en_core_sci_lg”). Some additional custom clinical rules were applied for tokenization to reduce common missing space errors and adapt with some common abbreviations (e.g., R.N., Dr., q.n.) in clinical notes.


#

Model Implementation

In this study, we deployed two pretrained transformers (RoBERTa,[34] ClinicalBERT[35]) fine-tuned for de-identification tasks using i2b2 2014 discharge summaries.[32] The fine-tuned models were developed by Kailas et al, which can be downloaded from hugging face online: https://huggingface.co/obi. We followed their training instructions[37] and retrained the base models on the i2b2 2014 training set.[32] The training details are summarized in [Table 2]. The trained model can be leveraged for classifying tokens into PHI entities. The steps and additional information of model implementation are shown in [Fig. 2].

Zoom Image
Fig. 2 Illustration of de-identification process. The raw nursing note will be split into sentences and tokenized, described in the Data Preprocessing” section. The input sequence contains a sentence with maximum token length of 128 and 32 tokens on both sides of the sentence for contextual enrichment, which will be passed into the pretrained transformers (RoBERTa, ClinicalBERT) fine-tuned on i2b2 2014 discharge summaries.[32] RoBERTa and ClinicalBERT are both variants of the BERT (Bidirectional Encoder Representations from Transformer) model.[33] [34] [35] RoBERTa (Robustly Optimized BERT Pretraining Approach)[34] was trained on much larger datasets (over 160 GB of uncompressed text found on Web sites) compared with BERT base model, and ClinicalBERT[35] was trained on MIMIC-III clinical texts. The two models were fine-tuned on i2b2 discharge summaries for de-identification task which is regarded as name entity recognition. The input sequence will be passed into pretrained de-identification models and output the PHI labels for each token where the label “O” represents non-PHI tokens. PHI, protected health information.
Table 2

Training hyperparameters for RoBERTa and ClinicalBERT models

Hyperparameters

Models

RoBERTa

ClinicalBERT

Input length

128 tokens

128 tokens

Batch size

32

32

Optimizer

AdamW

AdamW

Learning rate

4e-5

5e-5

Dropout

0.1

0.1


#

Evaluation and Statistical Analysis

Evaluations were performed on i2b2 2014 testing set[32] and our data inpatient nursing notes. We manually reviewed the entire inpatient nursing notes alongside model outputs. The predicted PHI labels were grouped into the following 10 categories: (1) Date, (2) Hospital Name, (3) Location (including all addresses), (4) Patient or Care Partner Name, (5) Staff Name, (6) Identification Number, (7) Phone Number, (8) Age, (9) Organization Name, and (10) Email Address. To evaluate the model under the same standard, we annotated our dataset under the i2b2 2014 annotation guideline which was consistent with HIPAA standards plus additional specifications for category Age and Date. The Date PHI category covers any of the calendar dates, years, holidays, and days of the week, including annotation of seasons (“Fall'02”). The Age category includes any ages mentioned, whereas HIPAA requires only ages 90 and above to be annotated. The Hospital Name category pertains to health care organizations. Location category refers to any addresses or components of an address (street, zip code). The Patient Name category refers to either patients' names or their family members' names, while the Staff Name category entails all names of hospital employees. Identification Number category includes all medical record numbers, Social Security Numbers, provider numbers, etc. The Organization Name category encompasses all non-health care delivery organizations (e.g., a patient's employer). The remaining PHI categories are self-explanatory and follow HIPAA's description.

Each model-identified PHI token was labeled as TP (true positive if correctly categorized), FP (false positive if misidentified or misclassified), or FN (false negative for missed PHI token). We computed the performance metrics (precision, recall, and F1 score) under the token level to evaluate the generalizability of the pretrained de-identification models. In this task, it is inconsequential whether PHI was detected as separate tokens or combined. For example, tokens “James” and “Smith” could be the same as “James Smith” entity if their PHI labels are identical. A one-sided Mann–Whitney test was applied to compare PHI distributions of PHI in discharge summaries with those in inpatient nursing notes. To mitigate the sample size bias, we randomly sampled inpatient nursing notes to match the size of the i2b2 discharge summary (testing set). Lastly, we conducted the error analysis on the model with the highest F1 score, where error cases were categorized based on shared characteristics and patterns.


#
#

Results

Data Description

The raw dataset consisted of 1,334 raw nursing notes with a total number of 711,829 whitespace-separated tokens. The median number of PHI in a nursing note was 1, with an interquartile range (IQR) of 3. Out of these 1,334 nursing notes, 38.8% of them (n = 519) did not contain any of the PHI specified under HIPAA standards. Among the rest of the other notes, the median value of PHI was 2 (IQR: 4). A total number of 3,336 PHI instances were found, which were grouped into 10 categories (see “Methods” section).


#

Evaluation

Macro-average F1 score across PHI categories was calculated and recorded in [Table 3]. For the PHI binary task which classifies tokens into either PHI or non-PHI, the model RoBERTa achieved a F1 score of 0.932, approximately 4.6% lower than the baseline model evaluated on i2b2 test set. In our dataset, there were 160 instances misclassified as PHI by the RoBERTa model, and the optimal model (RoBERTa) failed to capture 57 PHI instances. Additionally, in the PHI token classification task, the best F1 score was 0.877 measured across all PHI categories, whereas in i2b2 test set, the model was able to achieve a F1 score of 0.956. The total number of error cases found in our nursing notes was 542, where 89.5% of them were either misclassification of PHI into the wrong category or capture of non-PHI tokens. In general, the RoBERTa model outperformed the ClinicalBERT model on our dataset across most of the PHI categories. The precision and recall were generally higher across all datasets (i2b2, nursing notes) for the RoBERTa model ([Table 4]).

Table 3

Trained RoBERTa and ClinicalBERT model evaluated on i2b2 test dataset and inpatient nursing notes under F1-measure

Models

F1-measure in PHI binary task

(PHI vs. non-PHI)

Macro-averaged F1-measure overall PHI categories

RoBERTa

(i2b2 test)

0.978

0.956

RoBERTa

(inpatient nursing notes)

0.932

0.887

ClinicalBERT

(i2b2 test)

0.963

0.820

ClinicalBERT

(inpatient nursing notes)

0.834

0.615

Abbreviation: PHI, protected health information.


Table 4

Precision and recall across PHI categories evaluated in inpatient nursing notes

PHI category

Model

Precision

Recall

Patient, or care partner name

RoBERTa

0.902

0.997

ClinicalBERT

0.766

0.957

Staff name

RoBERTa

0.951

0.973

ClinicalBERT

0.812

0.905

Age

RoBERTa

0.998

0.928

ClinicalBERT

0.964

0.949

Date

RoBERTa

0.992

0.997

ClinicalBERT

0.964

0.967

Phone number

RoBERTa

0.922

0.941

ClinicalBERT

0.367

0.333

Email address

RoBERTa

1.0

1.0

ClinicalBERT

0.0

0.0

Location

RoBERTa

0.903

0.984

ClinicalBERT

0.708

0.903

Hospital[a] name

RoBERTa

0.666

0.981

ClinicalBERT

0.323

0.824

Identification number

RoBERTa

0.118

0.966

ClinicalBERT

0.105

0.917

Organization[b] name

RoBERTa

0.785

0.944

ClinicalBERT

0.323

0.759

Abbreviation: PHI, protected health information.


Note: Higher precision and recall are highlighted in bold.


a All health care delivery organizations.


b Non-health care organizations.



#

Statistical Analysis

The PHI distribution among i2b2 discharge summaries and multi-site nursing notes is summarized in [Table 5], outlining the counts for each PHI category, along with the proportion of all numbers of PHI in the respective dataset. The dominant PHI category in both i2b2 discharge summaries and inpatient nursing notes was Date, followed by health care providers or hospital staff names. Furthermore, Email Address had the lowest number of occurrences in both data sources. The proportions of occurrences in the Location and Hospital categories were similar across both datasets. The table suggested different levels of PHI exposure across different clinical documentations.

Table 5

Summary of PHI entities across datasets

PHI category

i2b2 discharge summaries[32]

(n = 514)

Inpatient nursing notes

(n = 1,334)

All PHI

11,283

3,336

Names

 Patient, family member names

879 (7.79%)

424 (12.71%)

 HCPs or any hospitals/organization staff names

2,004 (17.76%)

476 (14.27%)

Ages (all ages being mentioned)

764 (6.77%)

459 (13.76%)

Date (all calendar dates, years, holiday, etc.)

4,980 (44.14%)

890 (26.68%)

Contact

 Phone/fax numbers

217 (1.92%)

347 (10.40%)

 Email

1 (0.01%)

7 (0.2%)

Locations (street, city, state, zip code, etc.)

856 (7.59)

329 (9.86%)

Hospitals (hospital names/abbreviations, pharmacy names, etc.)

875 (7.76%)

277 (8.30%)

IDs (SSNs, driver licenses, MRNs, translators/nurses/doctors IDs)

625 (5.54%)

40 (1.20%)

Organizations (non-health care organizations)

82 (0.73%)

87 (2.61%)

Abbreviations: HCP, health care provider; PHI, protected health information.


The visualization of frequency correlation between PHI groups is depicted in [Fig. 3], where the stronger the correlation between variables, the darker the color exhibited. There appears to have some strong pair-wise correlations (>0.5) between the frequency of patients' names or their care partners' (e.g., friend, family) names and phone/fax numbers within our dataset. Therefore, given a name from either a patient or a care partner is detected in a note, there might exist a strong possibility of a phone number also being included in the same note. On the contrary, no similar frequency correlation was perceived in the i2b2 discharge summaries.

Zoom Image
Fig. 3 Correlation frequency pair plot for PHI types. PHI, protected health information.

The pattern visualization along with statistical summarization of PHI distributions in the i2b2 corpus and inpatient nursing notes is demonstrated in [Fig. 4]. The distribution heatmaps (a, b) highlighted a consistently higher sparsity in the types of PHI found in inpatient nursing notes when compared with discharge summaries. In the i2b2 corpus, each note contains a substantial number of PHI instances (median:18, IQR: 14) and PHI types (median: 6, IQR: 1), encompassing at least two distinct PHI categories (e.g., Date, Staff Name), whereas inpatient nursing notes were found to contain a few PHI instances (median: 1, IQR: 3) and a limited number of PHI types (median:1, IQR: 3). The results of the one-sided Mann–Whitney test for each PHI category are shown in [Table 6]. Overall, i2b2 2014 discharge summaries entailed a statistically greater number of PHI instances in 9 out of 10 PHI categories compared with inpatient nursing notes. However, no statistical significance was found in the category of Email.

Table 6

One-sided statistical testing result, displaying the average counts per note in i2b2 2014 discharge summaries (n = 514) and randomly sampled inpatient nursing notes (n = 514)

PHI category

i2b2 2014[32]

Inpatient nursing notes

p-Value

Date

9.689

0.667

p < 0.001[a]

Hospital name

1.702

0.208

p < 0.001[a]

Location

1.665

0.247

p < 0.001[a]

Patient or care partner names

1.71

0.319

p < 0.001[a]

Staff name

3.90

0.357

p < 0.001[a]

Identification number

1.216

0.030

p < 0.001[a]

Phone number

0.422

0.260

p < 0.001[a]

Age

1.487

0.344

p < 0.001[a]

Organization name

0.160

0.065

p < 0.001[a]

Email address

0.005

0.0019

0.5

Abbreviation: PHI, protected health information.


a Represents the p-value is below the significant level (0.01), reflecting a statistically significant difference was detected.


Zoom Image
Fig. 4 Pattern comparison of PHI distribution in discharge summaries and inpatient nursing notes. Visualization of PHI distribution for discharge summaries (A) and inpatient nursing notes (B). The plots (C and D) displayed the median (displayed as circle) and the ranges (min, max) of PHI counts for discharge summaries (C) and inpatient nursing notes (D). PHI, protected health information.

#

Qualitative Error Analysis

To further understand the cases that the model (RoBERTa) failed to correctly classify, we grouped these error cases by common characteristics (see [Table 7]). The total number of error cases was 542 in the 1,334 inpatient nursing notes, where around 89% of the errors were either labeling non-PHI tokens or misclassification of PHI tokens to incorrect categories. Among false-negative errors, ages were the most common PHI that the model failed to capture (n = 33), which was attributed to missing space between tokens or tokenization errors. Among the age error cases, two cases of age entity were found to be above 89. The primary factors contributing to misidentification were parsing and tokenization errors. As for false-positive cases, approximately 33% of them (n = 160) mistagged non-PHI tokens due to the failure to recognize clinical domain terms or common word usage within this medical context. For instance, the term “G2P1002” in the sentence “the patient G2P1002 (2 pregnancy, 1 full term birth, 2 kids at home = twins) with recent admission for possible ectopic pregnancy” was mistagged as ID category. However, the clinical term refers to the pregnancy and birth history, and number of current children. The model also showed a strong tendency to misclassify PHI tokens into incorrect categories (n = 485, 89.5%), especially labeling phone/fax numbers as IDs (n = 212, 39.1%).

Table 7

Error analysis of the optimal model in inpatient narrative nursing

Errors

Category and potential reasons

Number of cases

False negative

(n = 57)

Age

 Missing space/tokenization errors

33

Hospital

 Hospital names' abbreviations not recognized

4

 Undetermined[a]

2

Phone/fax numbers

 Parsing errors

4

 Undetermined[a]

2

Staff

 Undetermined[a]

4

 Punctuations/tokenization errors

1

Other PHI (location, date, organization, ID)

7

False positive

(n = 485)

Tagging non-PHI tokens as PHI

 Clinical terms/acronyms not recognized

136

 Undetermined[a]

24

Classify PHI tokens to incorrect PHI category

 Phone/fax numbers and IDs confusion

212

 Misclassifying care partners of patients to staff

35

 Misclassifying organizations into hospital PHI group

32

 Staff names and hospital names confusion

12

 Other scenarios

34

Abbreviation: PHI, protected health information.


a No explicit reason why the algorithm misclassifies entities.



#
#

Discussion

This study leveraged two pretrained transformer models fine-tuned on i2b2 discharge summaries to evaluate the generalizability of narrative inpatient nursing notes and to examine the disparity of PHI distribution between discharge summaries and nursing notes. As shown, the performance of the optimal model achieved a F1 score of 0.887 across PHI categories and 0.932 in PHI binary task. Through conducting the error analysis, the algorithm missed to recognize 57 PHI instances. Of these 57 false-negative cases, age PHI accounted for 57.9% (n = 33). Improving spelling and token spacing correction could potentially enhance PHI detection, particularly for patient ages. However, it is important to note that under HIPAA regulation, only ages above 89 are considered PHI, and within the age error cases (n = 33) we detected, only two age instances met these criteria, where the evaluation outcomes may be differed if considering only ages above 89 as PHI.

Additionally, we observed a large percentage of errors in labeling PHI in the wrong category, such as predicting phone numbers as IDs, or vice versa. Despite these errors, in such cases, the overall message or main idea expressed by the data may still be preserved and remain accurate. For instance, the original text could be “the patient 123-123-1234 was diagnosed with diabetes” and the de-identified text >the patient <ID“ was diagnosed with diabetes,” which does not alter the central semantic meaning of the sentence even if the phone number was mislabeled as ID. Another major issue of the model entails the failure of comprehending clinical terms and/or acronyms, suggesting that relevant medical resources such as Unified Medical Language System can be integrated into the process. In a practical application, while striking for the balance of precision and recall, we should prioritize minimizing the risk of exposure to PHI.

Another objective of this study was to examine the PHI distribution variability between discharge summaries and narrative nursing notes. Few studies[38] to our knowledge have investigated the generalizability of a trained de-identification algorithm on inpatient narrative nursing notes. We grouped PHI into 10 categories as outlined in the “Methods” section. We demonstrated that while nursing notes and discharge summaries contain similar PHI types (patient names, addresses, contact information, etc.), discharge summaries contain a significantly higher number of PHI per note with respect to nursing narrative notes. Inpatient nursing notes offer more concise, daily patient updates, facilitating professional communication, whereas discharge summaries tend to follow a standardized format, ensuring clarity for nonprofessionals.[39] Thus, there might be some standards regarding what PHI data are required to include in discharge summaries and where we may expect to see them within the note, such as discharge/visit dates occurring on top, and doctors' names signed at the end of the confirmation of discharge. In contrast, PHI in narrative nursing notes was found to have greater variability and relate to a broader set of factors about the patient's personal life. Additionally, as compared with the reported results from MIMIC II (Medical Information Mart for intensive Care II) nursing progress notes,[13] we also perceived differences in some PHI distributions (e.g., location, IDs), suggesting the variability of nursing notes across different clinical units and institutions and that more representative samples are required when training de-identification algorithms. The variability observed in PHI distribution among different data sources and sites has important implications for designing and evaluating automated de-identification systems for nursing notes, which currently remains under-studied.

One of the state-of-art algorithms in de-identification Philter (Protected Heath Information filter) leveraged overlapping pipelines of multiple methods entailing pattern matching, statistical modeling, and blocked lists, among others.[34] This approach achieved an impressive recall score of 99.92% on i2b2 2014 discharge summaries but came with a slight compromise of precision (78.58%). Another study utilized XLNet (transformer-XL model) fine-tuned on i2b2 discharge summaries and reported an F1 score of 0.96. To publish clinical notes, it is mandated by law that the dataset is certified as free from any information that could reveal an individual's identity, which essentially requires perfect performance for a fully automated de-identification system. Yet, optimization of the architecture for the types of models such as transformer models used in de-identification rarely achieves perfection when transferring to different data sources.

On March 14, 2023, Open AI released GPT-4, a significant advancement in large language models.[40] Although models like RoBERTa and ClinicalBERT can attain high accuracy rate over 90%, they typically demand intensive coding efforts and significant time expenses. Instead, GPT provides a rapid and more accessible approach.[41] One research team achieved 99% accuracy using GPT on i2b2 dataset with optimize prompt engineering.[41] However, there are some real-world constraints imposed on the use of GPT for de-identification of clinical notes. Currently, only synthetic medical data can be passed into GPT due to privacy concerns, and insufficient synthetic nursing notes also impede the evaluation of GPT in de-identification task. It is critical to prioritize security and privacy in handling medical data for GPT, underlining the need for the development of relevant mechanisms and regulations.

Alternatively, a more appliable and efficient approach could involve additional implementation of simple processes (pattern-searching) to ensure complete de-identification before sharing a dataset. Another approach that could be layered on top of an automated de-identification system is the Hiding In Plain Sight (HIPS)[42] approach to address the residual identifier problem. Conventional de-identification algorithms replace PHI with entity placeholders, while HIPS replaces the detected identifiers with realistic but synthetic surrogates, making it challenging to distinguish from the original ones.[42] [43] The HIPS technique could be leveraged as the safety net to effectively reduce the exposure of PHI. Fully automated algorithms for de-identification require substantial resources, but semi-automated methods (de-identification ML models followed by minor human inspection) plus the HIPS technique offer an effective and feasible alternative option.

Our study consists of several limitations. The data included in our study are from academic medical centers, possibly limiting their generalizability to nonacademic settings. Further exploration of PHI distribution across diverse nursing notes and clinical units is warranted. Despite rigorous de-identification following the HIPAA regulations, there remains a 0.04% risk of reidentification using basic demographics (gender, race).[44] Thus, in the future studies, it is imperative to reassess privacy protection standards to safeguard individuals' identities.


#

Conclusion

To conclude, we evaluated two de-identification transformers (RoBERTa, ClinicalBERT) on nursing notes and compared their PHI usage with discharge summaries. Discharge summaries contained more PHI instances and types, while narrative nursing notes exhibited high variability in the types of PHI present. Openly sharing datasets is an important part of open science, yet techniques for effective removal of all PHI in nursing notes need further exploration. Understanding the PHI distribution will help to select the appropriate algorithm(s) and develop a customized evaluation schema for PHI.


#

Clinical Relevance Statement

The study focused on evaluating the generalizability of state-of-art models on a different clinical data source and examining the PHI pattern distributions between discharge summaries and inpatient nursing notes. The knowledge gained from this study can enhance both the design and selection of algorithms and offer insights toward developing appropriate evaluation schema for PHI.


#

Multiple Choice Questions

  1. According to HIPAA regulation, which of the following data elements can be retained in the raw text during the de-identification process?

    • Patient's residential address including specific street name, zip code, and country

    • Patients' first and last name

    • Patient's contact email

    • Patient's description of a common medical condition

    Correct answer: The correct answer is option d. Only data elements that do not entail direct identifiers or any information that could identify specific individuals can be retained in their original form. Choices a, b, and c are directly linked to an individual's identity. As part of the regulation, such entities (residential addresses, names, and contact information) are either removed from the text or replaced by the corresponding surrogates. Choice d referring to the description of a common medical condition does not disclose any specific individual or small group of people.

  2. What is the primary purpose of clinical data de-identification?

    • To optimize the clinical workflows

    • To improve the accuracy of disease diagnoses

    • To protect patient privacy

    • To reduce data size for storage

    Correct answer: The correct answer is option c. The primary goal of clinical data de-identification is the protection of patients' health care data ensuring data security and privacy. It is achieved by removing or anonymizing all protected health information elements. Only choice c is consistent with the primary purpose of clinical data de-identification.

  3. Which of the following statements is true about clinical data that has been de-identified automatically by a system with the previously reported accuracy of 99%?

    • It can only be shared directly with some affiliated organizations without any privacy consideration.

    • It requires further human validation and examination of any re-identification risks.

    • It is exempt from data protection regulations.

    • It can be released publicly with no restrictions.

    Correct answer: The correct answer is option b. Even if the data have been de-identified by a robust system, we still need to perform validation to ensure the removal of all PHI in the data before it can be shared with any organizations. Additionally, we need to check for pieces of information left that can be combined and potentially be used to re-identify individuals.


#
#

Conflict of Interest

None declared.

Protection of Human and Animal Subjects

The study was approved by institutional review boards.


  • References

  • 1 National Institutes of Health. Final NIH policy for data management and sharing. Accessed at https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
  • 2 Adler-Milstein J, Holmgren AJ, Kralovec P, Worzala C, Searcy T, Patel V. Electronic health record adoption in US hospitals: the emergence of a digital “advanced use” divide. J Am Med Inform Assoc 2017; 24 (06) 1142-1148
  • 3 Kong HJ. Managing unstructured big data in healthcare system. Healthc Inform Res 2019; 25 (01) 1-2
  • 4 HIT Consultant. Why unstructured data holds the key to intelligent healthcare systems [Internet]. Atlanta, GA: HIT Consultant. 2015. . Accessed March 7, 2024 at: https://hitconsultant.net/2015/03/31/tapping-unstructured-data-healthcares-biggest-hurdle-realized/
  • 5 Tayefi M, Ngo P, Chomutare T. et al. Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdiscip Rev Comput Stat 2021; 13 (06) e1549
  • 6 Schwalbe N, Wahl B, Song J, Lehtimaki S. Data sharing and global public health: defining what we mean by data. Front Digit Health 2020; 2: 612339
  • 7 Kang MJ, Dykes PC, Korach TZ. et al. Identifying nurses' concern concepts about patient deterioration using a standard nursing terminology. Int J Med Inform 2020; 133: 104016
  • 8 Korach ZT, Yang J, Rossetti SC. et al. Mining clinical phrases from nursing notes to discover risk factors of patient deterioration. Int J Med Inform 2020; 135: 104053
  • 9 Rossetti SC, Knaplund C, Albers D. et al. Healthcare process modeling to phenotype clinician behaviors for exploiting the signal gain of clinical expertise (HPM-ExpertSignals): development and evaluation of a conceptual framework. J Am Med Inform Assoc 2021; 28 (06) 1242-1251
  • 10 Standards for privacy of individually identifiable health information final rule. 67. Federal Register 2002: 53181-53273
  • 11 Act A. Health insurance portability and accountability act of 1996. Public Law 1996; 104: 191
  • 12 Friedlin FJ, McDonald CJ. A software tool for removing patient identifying information from clinical documents. J Am Med Inform Assoc 2008; 15 (05) 601-610
  • 13 Neamatullah I, Douglass MM, Lehman LW. et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008; 8 (01) 1-7
  • 14 Beckwith BA, Mahaadevan R, Balis UJ, Kuo F. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak 2006; 6: 12
  • 15 Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol 2004; 121 (02) 176-186
  • 16 Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic lexicon. Proc AMIA Symp 2000; 729-733
  • 17 Norgeot B, Muenzen K, Peterson TA. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ Digit Med 2020; 3 (01) 57
  • 18 Szarvas G, Farkas R, Kocsor A. A multilingual named entity recognition system using boosting and c4. 5 decision tree learning algorithms. In: Discovery Science: 9th International Conference, DS 2006, Barcelona, Spain, October 7–10, 2006. Berlin Heidelberg:: Springer; 2006: 267-278
  • 19 Aramaki E, Imai T, Miyo K, Ohe K. Automatic deidentification by using sentence features and label consistency. In: Proceedings i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data,. 2006, November 10 2006; Washington, DC: i2b2: 10-11
  • 20 Wellner B, Huyck M, Mardis S. et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 2007; 14 (05) 564-573
  • 21 Guo Y, Gaizauskas R, Roberts I, Demetriou G, Hepple M. Identifying personal health information using support vector machines. In: Proceedings i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, 2006. November 10:10–11
  • 22 Gardner J, Xiong L. An integrated framework for de-identifying unstructured medical data. Data Knowl Eng 2009; 68 (12) 1441-1451
  • 23 Hartman T, Howell MD, Dean J. et al. Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020; 20 (01) 1-9
  • 24 Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007; 14 (05) 550-563
  • 25 Johnson AE, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. In: Proceedings of the ACM Conference on Health, Inference, and Learning,. 2020, April 2:214–221
  • 26 Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics,. 2018, August: 1638-1649
  • 27 Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 2013; 20 (01) 77-83
  • 28 Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives. J Biomed Inform 2015; 58: S30-S38
  • 29 Liu Z, Chen Y, Tang B. et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. J Biomed Inform 2015; 58: S47-S52
  • 30 Khin K, Burckhardt P, Padman R. A deep learning architecture for de-identification of patient notes: implementation and evaluation. arXiv preprint arXiv:1810.01570. 2018 October 3. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1810.01570
  • 31 Rizvi RF, Harder KA, Hultman GM. et al. A comparative observational study of inpatient clinical note-entry and reading/retrieval styles adopted by physicians. Int J Med Inform 2016; 90: 1-1
  • 32 Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2/UTHealth corpus. J Biomed Inform 2015; 58: S20-S29
  • 33 Casola S, Lauriola I, Lavelli A. Pre-trained transformers: an empirical comparison. Mach Learn Appl 2022; 9: 100334
  • 34 Liu Y, Ott M, Goyal N. et al. Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019 July 26. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1907.11692
  • 35 Alsentzer E, Murphy JR, Boag W. et al. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323. 2019 April 6. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1904.03323
  • 36 Syed MAB, Krstovski K, Withall J. et al. Heuristic-based extraction and unigram analysis of nursing free text data residing in large EHR clinical notes. In: Proceedings of the 17th EAI International Conference on Pervasive Computing Technologies for Healthcare 2023 held on November 27–29,. 2023 , in Malmö, Sweden
  • 37 Kailas P, Goto S, Homilius M, MacRae CA, Deo RC. . obi-ml-public/ehr_deidentification (0.1.0b). Zenodo 2022. Accessed March 7, 2024 at: https://doi.org/10.5281/zenodo.6617957
  • 38 Trienes J, Trieschnigg D, Seifert C, Hiemstra D. Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records. arXiv preprint arXiv:2001.05714. 2020 Jan 16. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.2001.05714
  • 39 Adnan M, Warren J, Orr M. Assessing text characteristics of electronic discharge summaries and their implications for patient readability. In: Proceedings of the Fourth Australasian Workshop on Health Informatics and Knowledge Management,. January 1, 2010; Vol. 108, pp. 77–84
  • 40 Dai H, Liu Z, Liao W. et al. Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007. 2023 February 25. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.2302.13007
  • 41 Liu Z, Yu X, Zhang L. et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032. 2023 Mar 20. Accessed March, 2024 at: https://doi.org/10.48550/arXiv.2303.11032
  • 42 Carrell D, Malin B, Aberdeen J. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J Am Med Inform Assoc 2013; 20 (02) 342-348
  • 43 Chambon PJ, Wu C, Steinkamp JM, Adleberg J, Cook TS, Langlotz CP. Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods. J Am Med Inform Assoc 2023; 30 (02) 318-328
  • 44 Rothstein MA. Is deidentification sufficient to protect health privacy in research?. Am J Bioeth 2010; 10 (09) 3-11

Address for correspondence

Fangyi Chen, MS
Department of Biomedical Informatics, Columbia University
New York, NY 10027-6902
United States   

Publication History

Received: 01 December 2023

Accepted: 15 February 2024

Accepted Manuscript online:
06 March 2024

Article published online:
08 May 2024

© 2024. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 National Institutes of Health. Final NIH policy for data management and sharing. Accessed at https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
  • 2 Adler-Milstein J, Holmgren AJ, Kralovec P, Worzala C, Searcy T, Patel V. Electronic health record adoption in US hospitals: the emergence of a digital “advanced use” divide. J Am Med Inform Assoc 2017; 24 (06) 1142-1148
  • 3 Kong HJ. Managing unstructured big data in healthcare system. Healthc Inform Res 2019; 25 (01) 1-2
  • 4 HIT Consultant. Why unstructured data holds the key to intelligent healthcare systems [Internet]. Atlanta, GA: HIT Consultant. 2015. . Accessed March 7, 2024 at: https://hitconsultant.net/2015/03/31/tapping-unstructured-data-healthcares-biggest-hurdle-realized/
  • 5 Tayefi M, Ngo P, Chomutare T. et al. Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdiscip Rev Comput Stat 2021; 13 (06) e1549
  • 6 Schwalbe N, Wahl B, Song J, Lehtimaki S. Data sharing and global public health: defining what we mean by data. Front Digit Health 2020; 2: 612339
  • 7 Kang MJ, Dykes PC, Korach TZ. et al. Identifying nurses' concern concepts about patient deterioration using a standard nursing terminology. Int J Med Inform 2020; 133: 104016
  • 8 Korach ZT, Yang J, Rossetti SC. et al. Mining clinical phrases from nursing notes to discover risk factors of patient deterioration. Int J Med Inform 2020; 135: 104053
  • 9 Rossetti SC, Knaplund C, Albers D. et al. Healthcare process modeling to phenotype clinician behaviors for exploiting the signal gain of clinical expertise (HPM-ExpertSignals): development and evaluation of a conceptual framework. J Am Med Inform Assoc 2021; 28 (06) 1242-1251
  • 10 Standards for privacy of individually identifiable health information final rule. 67. Federal Register 2002: 53181-53273
  • 11 Act A. Health insurance portability and accountability act of 1996. Public Law 1996; 104: 191
  • 12 Friedlin FJ, McDonald CJ. A software tool for removing patient identifying information from clinical documents. J Am Med Inform Assoc 2008; 15 (05) 601-610
  • 13 Neamatullah I, Douglass MM, Lehman LW. et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008; 8 (01) 1-7
  • 14 Beckwith BA, Mahaadevan R, Balis UJ, Kuo F. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak 2006; 6: 12
  • 15 Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol 2004; 121 (02) 176-186
  • 16 Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic lexicon. Proc AMIA Symp 2000; 729-733
  • 17 Norgeot B, Muenzen K, Peterson TA. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ Digit Med 2020; 3 (01) 57
  • 18 Szarvas G, Farkas R, Kocsor A. A multilingual named entity recognition system using boosting and c4. 5 decision tree learning algorithms. In: Discovery Science: 9th International Conference, DS 2006, Barcelona, Spain, October 7–10, 2006. Berlin Heidelberg:: Springer; 2006: 267-278
  • 19 Aramaki E, Imai T, Miyo K, Ohe K. Automatic deidentification by using sentence features and label consistency. In: Proceedings i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data,. 2006, November 10 2006; Washington, DC: i2b2: 10-11
  • 20 Wellner B, Huyck M, Mardis S. et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc 2007; 14 (05) 564-573
  • 21 Guo Y, Gaizauskas R, Roberts I, Demetriou G, Hepple M. Identifying personal health information using support vector machines. In: Proceedings i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, 2006. November 10:10–11
  • 22 Gardner J, Xiong L. An integrated framework for de-identifying unstructured medical data. Data Knowl Eng 2009; 68 (12) 1441-1451
  • 23 Hartman T, Howell MD, Dean J. et al. Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020; 20 (01) 1-9
  • 24 Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007; 14 (05) 550-563
  • 25 Johnson AE, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. In: Proceedings of the ACM Conference on Health, Inference, and Learning,. 2020, April 2:214–221
  • 26 Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics,. 2018, August: 1638-1649
  • 27 Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 2013; 20 (01) 77-83
  • 28 Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives. J Biomed Inform 2015; 58: S30-S38
  • 29 Liu Z, Chen Y, Tang B. et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. J Biomed Inform 2015; 58: S47-S52
  • 30 Khin K, Burckhardt P, Padman R. A deep learning architecture for de-identification of patient notes: implementation and evaluation. arXiv preprint arXiv:1810.01570. 2018 October 3. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1810.01570
  • 31 Rizvi RF, Harder KA, Hultman GM. et al. A comparative observational study of inpatient clinical note-entry and reading/retrieval styles adopted by physicians. Int J Med Inform 2016; 90: 1-1
  • 32 Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2/UTHealth corpus. J Biomed Inform 2015; 58: S20-S29
  • 33 Casola S, Lauriola I, Lavelli A. Pre-trained transformers: an empirical comparison. Mach Learn Appl 2022; 9: 100334
  • 34 Liu Y, Ott M, Goyal N. et al. Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019 July 26. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1907.11692
  • 35 Alsentzer E, Murphy JR, Boag W. et al. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323. 2019 April 6. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.1904.03323
  • 36 Syed MAB, Krstovski K, Withall J. et al. Heuristic-based extraction and unigram analysis of nursing free text data residing in large EHR clinical notes. In: Proceedings of the 17th EAI International Conference on Pervasive Computing Technologies for Healthcare 2023 held on November 27–29,. 2023 , in Malmö, Sweden
  • 37 Kailas P, Goto S, Homilius M, MacRae CA, Deo RC. . obi-ml-public/ehr_deidentification (0.1.0b). Zenodo 2022. Accessed March 7, 2024 at: https://doi.org/10.5281/zenodo.6617957
  • 38 Trienes J, Trieschnigg D, Seifert C, Hiemstra D. Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records. arXiv preprint arXiv:2001.05714. 2020 Jan 16. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.2001.05714
  • 39 Adnan M, Warren J, Orr M. Assessing text characteristics of electronic discharge summaries and their implications for patient readability. In: Proceedings of the Fourth Australasian Workshop on Health Informatics and Knowledge Management,. January 1, 2010; Vol. 108, pp. 77–84
  • 40 Dai H, Liu Z, Liao W. et al. Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007. 2023 February 25. Accessed March 7, 2024 at: https://doi.org/10.48550/arXiv.2302.13007
  • 41 Liu Z, Yu X, Zhang L. et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032. 2023 Mar 20. Accessed March, 2024 at: https://doi.org/10.48550/arXiv.2303.11032
  • 42 Carrell D, Malin B, Aberdeen J. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J Am Med Inform Assoc 2013; 20 (02) 342-348
  • 43 Chambon PJ, Wu C, Steinkamp JM, Adleberg J, Cook TS, Langlotz CP. Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods. J Am Med Inform Assoc 2023; 30 (02) 318-328
  • 44 Rothstein MA. Is deidentification sufficient to protect health privacy in research?. Am J Bioeth 2010; 10 (09) 3-11

Zoom Image
Fig. 1 Overview of the study.
Zoom Image
Fig. 2 Illustration of de-identification process. The raw nursing note will be split into sentences and tokenized, described in the Data Preprocessing” section. The input sequence contains a sentence with maximum token length of 128 and 32 tokens on both sides of the sentence for contextual enrichment, which will be passed into the pretrained transformers (RoBERTa, ClinicalBERT) fine-tuned on i2b2 2014 discharge summaries.[32] RoBERTa and ClinicalBERT are both variants of the BERT (Bidirectional Encoder Representations from Transformer) model.[33] [34] [35] RoBERTa (Robustly Optimized BERT Pretraining Approach)[34] was trained on much larger datasets (over 160 GB of uncompressed text found on Web sites) compared with BERT base model, and ClinicalBERT[35] was trained on MIMIC-III clinical texts. The two models were fine-tuned on i2b2 discharge summaries for de-identification task which is regarded as name entity recognition. The input sequence will be passed into pretrained de-identification models and output the PHI labels for each token where the label “O” represents non-PHI tokens. PHI, protected health information.
Zoom Image
Fig. 3 Correlation frequency pair plot for PHI types. PHI, protected health information.
Zoom Image
Fig. 4 Pattern comparison of PHI distribution in discharge summaries and inpatient nursing notes. Visualization of PHI distribution for discharge summaries (A) and inpatient nursing notes (B). The plots (C and D) displayed the median (displayed as circle) and the ranges (min, max) of PHI counts for discharge summaries (C) and inpatient nursing notes (D). PHI, protected health information.