Manual Evaluation of Record Linkage Algorithm Performance in Four Real-World Datasets

Agrayan K. Gupta; Huiping Xu; Xiaochun Li; Joshua R. Vest; Shaun J. Grannis

doi:10.1055/a-2291-1391

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035026.xml

Download PDF

Appl Clin Inform 2024; 15(03): 620-628
DOI: 10.1055/a-2291-1391

Research Article

Manual Evaluation of Record Linkage Algorithm Performance in Four Real-World Datasets

Authors

Agrayan K. Gupta

¹Indiana University School of Medicine, Indianapolis, Indiana, United States
Huiping Xu

²Department of Biostatistics, Indiana University School of Medicine, Indianapolis, Indiana, United States
Xiaochun Li

²Department of Biostatistics, Indiana University School of Medicine, Indianapolis, Indiana, United States
Joshua R. Vest

³Center for Biomedical Informatics, Regenstrief Institute, Indianapolis, Indiana, United States

⁴Department of Health Policy and Management, Indiana University Richard M. Fairbanks School of Public Health, Indianapolis, Indiana, United States
Shaun J. Grannis

¹Indiana University School of Medicine, Indianapolis, Indiana, United States

³Center for Biomedical Informatics, Regenstrief Institute, Indianapolis, Indiana, United States

Funding This work was supported by the Patient-Centered Outcomes Research Institute grant number ME-2017C1-6425.

Further Information

Also available at

PDF Download Permissions and Reprints

Abstract
Background and Significance
Methods
- Dataset Descriptions
- Preprocessing: Field Selection and Data Standardization
- Record Pair Grouping (Blocking)
- Procedure for Sampling and Matching Pairs
- Process for Judging Record Matches and Nonmatches
- Interrater Reliability Measures
- Review Software or Tools Used
- Reviewer Characteristics
Results
- Overview of Adjudication Results for the Four Datasets
- Discordant Pairs Results by Dataset
- Discordance Rates per Dataset by Individual Reviewer
- Match Status Sensitivity Analyses by Dataset
Discussion
Limitations
Conclusion
Clinical Relevance Statement
Multiple-Choice Questions
References

Abstract

Objectives Patient data are fragmented across multiple repositories, yielding suboptimal and costly care. Record linkage algorithms are widely accepted solutions for improving completeness of patient records. However, studies often fail to fully describe their linkage techniques. Further, while many frameworks evaluate record linkage methods, few focus on producing gold standard datasets. This highlights a need to assess these frameworks and their real-world performance. We use real-world datasets and expand upon previous frameworks to evaluate a consistent approach to the manual review of gold standard datasets and measure its impact on algorithm performance.

Methods We applied the framework, which includes elements for data description, reviewer training and adjudication, and software and reviewer descriptions, to four datasets. Record pairs were formed and between 15,000 and 16,500 records were randomly sampled from these pairs. After training, two reviewers determined match status for each record pair. If reviewers disagreed, a third reviewer was used for final adjudication.

Results Between the four datasets, the percent discordant rate ranged from 1.8 to 13.6%. While reviewers' discordance rate typically ranged between 1 and 5%, one exhibited a 59% discordance rate, showing the importance of the third reviewer. The original analysis was compared with three sensitivity analyses. The original analysis most often exhibited the highest predictive values compared with the sensitivity analyses.

Conclusion Reviewers vary in their assessment of a gold standard, which can lead to variances in estimates for matching performance. Our analysis demonstrates how a multireviewer process can be applied to create gold standards, identify reviewer discrepancies, and evaluate algorithm performance.

Keywords

gold standard - manual review - patient matching - matching algorithms

Background and Significance

Fragmented patient care patterns combined with poor data interoperability and limited information exchange means patient data are siloed across multiple clinical repositories.[1] [2] This fragmentation causes discrepancies in patient reports and negatively affects health care systems' ability to deliver accurate, fast, and adequate patient care.[3] [4] [5] [6] To compile a single longitudinal patient record across disparate information repositories for clinical decision-making and population health management activities, health care organizations must use identifiers to link patient records.[7] However, the United States is the last industrialized nation without a unique patient identifier (UPI) and Congress has barred funding for developing a patient identifier system for over two decades, primarily citing privacy concerns.[8] [9] [10] Without a national UPI, health care systems rely on potentially unique information, such as social security numbers (SSNs) and patient names, to link records via probabilistic and heuristic matching algorithms.[2] [11] While many current patient matching systems have match rates above 99%, they often fail to fully describe their methodology for future evaluation and lack definitive information on which records should be linked, leading to undetectable false positive and false negative matches.[2] [12] [13]

There must be standardized reporting of methods to develop matching algorithms to enable reproducible and consistent processes. Peer-reviewed methodologies for record linkage, such as GUILD (GUidance for Information about Linking Datasets), describe each step in the linkage process and recommend methods to improve data quality and assess linkage error and its impact on results.[14] [15] [16] For example, GUILD step 2d says “Details of how error rates were estimated, for example, by comparing linked records with a reference dataset.”[14] While they mention the importance of a reference dataset, they offer no guidelines in building a gold standard. Gold standard datasets identify record linkage errors, where records should have been linked or where they have been incorrectly linked together.[17] Currently few organizations manually create a reference dataset to review algorithm predictive performance and linkage error.[18] [19] [20] Further there is inconsistent reporting of gold standard curation, meaning the rigor of data quality and elements of manual review are largely unknown.[21] [22]

Previous publications have offered frameworks for the reproducible evaluation of record linkage, although few focus on the detailed aspects of the process for producing gold standard datasets needed to assess matching algorithms.[15] [23] This study aims to use real-world datasets to apply and expand upon these frameworks, emphasizing the manual review of potentially matched patient records. We describe the information, tools, and preparation techniques required to fully describe the datasets using a systematic approach. The findings can inform decision-makers regarding the value of consistent approaches to the manual review of gold standard datasets and the impact on algorithm performance.

Methods

Real-world data were gathered from the Indiana Health Information Exchange, one of the nation's largest health information exchange networks. The manual review framework was applied on four linkage applications: deduplication of the Indiana Network for Patient Care (INPC), deduplication of the Newborn Screening (NBS) records, linkage of the Marian County Public Health Department (MCHD) to INPC, and linkage of the Social Security Number Death Master Registry (SSDMR) to INPC. In the methods, we will describe the four datasets involved in the linkage applications, using the INPC data deduplication as an exemplar for reporting the manual review process.

Dataset Descriptions

The INPC research database is one of the largest statewide health information exchanges as it covers two thirds of the Indiana population with over 18 million patients, 93 hospitals, and 950 million encounter records.[24] [25] The 47,334,986 record dataset contains overlapping patient populations from the different health systems, with varying levels of data completeness.

A NBS dataset was extracted from Health-Level 7 (HL7) messages from providers within the INPC. Newborns of age 2 months or less at the time of the HL7 message within a 1-year time frame were included in the 765,813 record dataset. There is a varying level of completeness, with many missing names and few unique identifiers.

The MCHD covers a large metropolitan area and tracks population health trends to support many public health activities. The 471,298 record dataset is derived from multiple public health service areas and exhibits varying quality, including inaccurate and incomplete patient data with few unique identifiers.

The SSDMR contains 89,556,520 records of national death data linked to all participating health information exchange organizations within the INPC database. As a federal patient registry, there is a high level of data completeness, however, the dataset has fewer matching fields relative to the other datasets.[26]

Preprocessing: Field Selection and Data Standardization

For the INPC dataset deduplication, reviewers evaluated record pairs used the following 11 matching fields: social security number (SSN), last name (LN), first name (FN), middle name (MN), sex, date of birth (DB), month of birth (MB), year of birth (YB), phone number (TEL), address (ADR), and zip code (ZIP). The raw datasets also included city and state fields, although they were excluded as ZIP was used. Data standardization steps included formatting all DOB fields to “MM/DD/YYYY” and removing hyphens, parenthesis, and international codes from telephone numbers. The team reviewed the raw data to eliminate any bias from the data standardization process.

Record Pair Grouping (Blocking)

Blocking serves to subset a large dataset into smaller groups of potential record pairs by common attributes to efficiently reduce the computational complexity of record comparison.[27] As shown in [Supplementary Appendix 1] (available in the online version), data were blocked using five blocking schemes based on expert input. We then performed proportional sampling from the union of record pairs with strata defined by the blocking schemes.

Procedure for Sampling and Matching Pairs

15,000 record pairs were sampled using proportional sampling from the total number of record pairs with strata defined by blocking schemes. The sampling probability of a pair of records is proportional to the size of the stratum the pair belongs to.

Record pairs with fewer than two agreeing fields were automatically considered “certain nonmatches,” whereas record pairs with 10 or greater matching variables were automatically considered “certain matches.” Both cases were excluded from human review. Record pairs with 3 to 9 agreeing fields underwent human review.

Process for Judging Record Matches and Nonmatches

Four of the team's experienced reviewers curated a 200-record dataset that contained data from INPC and MCHD. Experts discussed discordance match/nonmatch status and used RecMatch, a proprietary probabilistic linkage program based on the Fellegi-Sunter model designed to score and organize potential matches into record pairs, to establish a gold standard.[28] The reviewers selected a variety of patient records where many fields agreed, few fields agreed, and a combination of discriminating and nondiscriminating fields agreed.

Each reviewer was trained with this dataset. Research staff reviewed training mismatches between every reviewer and the gold standard to compare discordance and highlight any potential biases. If significant disagreement, defined by a 20% discordance rate, with the gold standard was present, reviewers received additional training before further proceeding to the larger record pair dataset.

Interrater Reliability Measures

During the manual review, at least two reviewers independently reviewed each record, and if there was a tie, an additional independent annotator was used as a tiebreaker. In the analysis, we respectively highlight each reviewer's individual discordance rate and each analyses' sensitivity, specificity, and F-scores.

Review Software or Tools Used

We developed a manual review tool to standardize the review process, facilitating pairwise comparisons among all 11 variables for each record pair ([Fig. 1]). The interface enabled reviewers to easily navigate between record pairs and assign match probability in a centralized secure environment.[29] The record pair review software highlighted agreeing fields green, disagreeing fields red, and missing fields gray for reviewer ease. It did not consider variable differences due to syntax mismatches (e.g., “court” vs. “ct”) to be matching variables. In addition, the software displays previously matched records under the reviewed records so reviewers can analyze previous patient information.

Fig. 1 Manual review tool user interface populated with artificial data. This interface enabled reviewers to easily navigate between record pairs and assign match probability in a centralized secure environment. The software highlighted agreeing fields green, disagreeing fields red, and missing fields gray for reviewer ease. Reviewers ranked pairs using a scale of 1 to 4 with respective values indicated above, with 1 scoring as “Certain Nonmatch,” 2 as “Probable Nonmatch,” 3 as “Probable Match,” and 4 as “Certain Match.” ADR, address; DB, date of birth; FN, first name; LN, last name; MB, month of birth; MI, middle initial; SSN, social security number; TEL, phone number; YB, year of birth; ZIP, zip code.

Reviewed pairs were ranked using a scale of 1 to 4 with respective values indicated above, with 1 scoring as “Certain Nonmatch,” 2 as “Probable Nonmatch,” 3 as “Probable Match,” and 4 as “Certain Match.” Scoring was inputted in the box next to “Certainty (1–4)” legend ([Fig. 1]). Reviewers used the same manual review software for both training and final annotation.

Reviewer Characteristics

Fifteen annotators were recruited to review the INPC dataset. The highest degree and demographic background for each annotator are described in [Supplementary Appendix 2] (available in the online version). Since assessing matches among record pairs often includes comparing individual names from different demographic and cultural backgrounds, reporting reviewer characteristics may help understand potential biases in record linkage algorithms.

Results

The manual review methodology identified above was applied to four different applications: INPC deduplication, MCHD-INPC linkage, SSDMR-INPC linkage, and NBS deduplication.

Overview of Adjudication Results for the Four Datasets

[Table 1] presents the overall number of agreed and discordant pairs, along with the percent discordance rate for each dataset. Agreement is defined as both reviewers agreeing on a record match or nonmatch, while reviewers disagree on the match-status in discordant pairs. The INPC and MCHD datasets had a lower discordance rate, partly due to each dataset containing higher quality data with more distinct patient populations. The NBS dataset had a higher rate at 5.7%, which was expected as many newborn records have generic data (e.g., John Doe) and no previously matched information. It was unexpected the SSDMR dataset had a 13.6% discordance rate as the data derive from a federal SSN master file with specific data quality requirements.[30] However the SSDMR dataset has fewer types of matching fields than the other datasets, so reviewers had less information to adjudicate with.[30]

Table 1
The count of pairs reviewed in each dataset with two reviewers agreeing versus discordant
	INPC	MCHD	SSDMR	NBS
Agreed pairs	14,618	15,227	14,261	14,147
Discordant pairs	382	273	2,239	853
Total records reviewed	15,000	15,500	16,500	15,000
% Discordant pair rate	2.5%	1.8%	13.6%	5.7%

Abbreviations: INPC, Indiana Network for Patient Care; MCHD, Marian County Public Health Department; NBS, Newborn Screening; SSDMR, Social Security Number Death Master Registry.

Note: Each dataset contains between 15,000 and 16,500 reviewed pairs.

Discordant Pairs Results by Dataset

[Table 2] further splits the discordant pairs into the match/nonmatch status determined by the third reviewer. The match rate for the INPC and MCHD datasets was 43%, whereas the NBS dataset was lower at 33%. This range of match rates is expected as the prior two reviewers had differing adjudications, and reviewers tend to be slightly more conservative in record matching. The SSDMR dataset had a 16% match rate, meaning most discordant pairs were labeled as nonmatches.

Table 2
The number of matches and nonmatches and match rate for the discordant pairs adjudicated by the third reviewer
	INPC	MCHD	SSDMR	NBS
Total discordant pairs	382	273	22,39	854
Matches	165	177	368	279
Nonmatches	217	156	1,871	574
Match rate (matches/total discordant pairs)	43%	43%	16%	33%

Abbreviations: INPC, Indiana Network for Patient Care; MCHD, Marian County Public Health Department; NBS, Newborn Screening; SSDMR, Social Security Number Death Master Registry.

Discordance Rates per Dataset by Individual Reviewer

[Table 3] breaks down each dataset by individual reviewer and presents their overall agreed and discordant pair count with the discordance rate. The discordance rate for reviewers in the INPC and MCHD generally fluctuated between 1 and 3%, supporting the low overall discordance rate presented in [Table 1]. While Reviewer 5 had slightly higher discordance rates in the INPC and MCHD datasets, they were a significant outlier in the SSDMR dataset with a 59% discordance rate. It is possible this reviewer was clicking the wrong button or had a conservative matching bias; regardless, as shown in [Table 2], the third reviewer helped control such variability as a tie-breaking adjudicator.

Table 3
The counts of pairs reviewed in each dataset that agreed with the other reviewer or were discordant, for each individual reviewer
Reviewer	INPC (I)				MCHD				SSDMF				NBS
Reviewer	Agreed	Discordant	Total	% Discordant	Agreed	Discordant	Total	% Discordant	Agreed	Discordant	Total	% Discordant	Agreed	Discordant	Total	% Discordant
1	2,449	52	2,501	2	2,618	25	2,643	1	8,361	1,139	9,500	12	1,968	175	2,143	8%
2	2,465	35	2,500	1	2,119	24	2,143	1	–	–	–	–	2,042	101	2,143	5%
3	1,640	27	1,667	2	2,112	30	2,142	1	2,691	109	2,800	4	2,030	113	2,143	5%
4	814	19	833	2	–	–	–	–	–	–	–	–	–	–	–	–
5	2,367	133	2,500	5	2,010	133	2,143	6	1,154	1,646	2,800	59	2,002	140	2,142	7%
6	2,460	40	2,500	2	2,117	25	2,142	1	–	–	–	–	2,037	106	2,143	5%
7	1,635	32	1,667	2	–	–	–	–	–	–	–	–
8	809	24	833	3	4,214	72	4,286	2	–	–	–	–	4,053	233	4,286	5%
9	2,417	84	2,501	3	2,121	22	2,143	1	–	–	–	–	2,049	94	2,143	4%
10	2,420	81	2,501	3	–	–	–	–	–	–	–	–	2,009	134	2,143	6%
11	2,436	63	2,499	3	2,101	42	2,143	2	2,658	142	2,800	5	2,001	142	2,143	7%
12	810	23	833	3	4,203	83	4,286	2	–	–	–	–	4,019	267	4,286	6%
13	1,637	29	1,666	2	–	–	–	–	–	–	–	–
14	2,439	61	2,500	2	2,112	31	2,143	1	–	–	–	–	2,038	105	2,143	5%
15	2,438	61	2,499	2	2,111	32	2,143	1	2,615	185	2,800	7	2,046	96	2,142	4%
16	–	–	–	–	2,122	21	2,143	1	2,727	73	2,800	3	–	–	–	–
17	–	–	–	–	494	6	500	1	8,316	1,184	9,500	12	–	–	–	–

Abbreviations: INPC, Indiana Network for Patient Care; MCHD, Marian County Public Health Department; NBS, Newborn Screening; SSDMF, social security death master file.

Notes: Reviewers who have blank values in some columns did not review records in that dataset. INPC and MCHD datasets had generally low discordance rates per reviewer. Reviewer 5 consistently had higher discordance rates compared with other reviewers, with a 59% discordance rate in the SSDMR dataset.

Match Status Sensitivity Analyses by Dataset

Although we used a third reviewer for discordant pairs, discordance from the first two reviewers indicates the uncertainty in the data. The original analysis displays the match performance of our algorithm using the established gold standards for each dataset ([Table 4]).[31] We then conducted three sensitivity analyses on each gold standard dataset by changing the match status of discordant pairs and evaluated the performance ([Table 4]). Since the third reviewer can potentially agree with either of the original reviewers, the first sensitivity analysis explores the counterfactual that the third reviewer agreed with the reviewer s/he did not agree with at the time of the adjudication. For the second sensitivity analysis, all discordant pairs were declared nonmatches, and for the third sensitivity analysis they were declared matches. While it is improbable that discordant pairs are either all matches or all nonmatches, the second and third sensitivity analyses represent extreme scenarios, investigated to ensure the comprehensiveness of our sensitivity evaluations on matching performance.

Table 4
The match performance metrics with a 95% confidence interval for each dataset
	Original (95% CI)	Sensitivity 1 (95% CI)	Sensitivity 2 (95% CI)	Sensitivity 3 (95% CI)	Orig-Sens1 (95% CI)	Orig-Sens2 (95% CI)	Orig-Sens3 (95% CI)
INPC
Sensitivity	0.976 (0.972, 0.979)	0.952 (0.947, 0.957)	0.98 (0.976, 0.983)	0.948 (0.944, 0.953)	0.024 (0.02, 0.028)	−0.004 (−0.006, −0.003)	0.028 (0.024, 0.031)
Specificity	0.969 (0.965, 0.973)	0.961 (0.957, 0.966)	0.958 (0.953, 0.962)	0.973 (0.969, 0.976)	0.008 (0.005, 0.011)	0.011 (0.009, 0.014)	−0.004 (−0.006, −0.002)
PPV	0.97 (0.967, 0.974)	0.964 (0.959, 0.968)	0.959 (0.955, 0.963)	0.975 (0.972, 0.978)	0.007 (0.004, 0.01)	0.011 (0.009, 0.014)	−0.005 (−0.006, −0.003)
NPV	0.975 (0.971, 0.978)	0.949 (0.944, 0.954)	0.979 (0.976, 0.983)	0.944 (0.939, 0.949)	0.026 (0.022, 0.03)	−0.005 (−0.006, −0.003)	0.031 (0.027, 0.035)
F-scores	0.973 (0.97, 0.976)	0.958 (0.955, 0.961)	0.969 (0.967, 0.972)	0.961 (0.958, 0.964)	0.015 (0.013, 0.018)	0.004 (0.002, 0.005)	0.012 (0.01, 0.013)
SSDMR
Sensitivity	0.785 (0.775, 0.796)	0.615 (0.604, 0.625)	0.817 (0.807, 0.826)	0.6 (0.59, 0.611)	0.171 (0.162, 0.18)	−0.031 (−0.036, −0.027)	0.185 (0.178, 0.193)
Specificity	0.995 (0.993, 0.996)	0.984 (0.981, 0.986)	0.985 (0.982, 0.987)	0.996 (0.995, 0.997)	0.011 (0.009, 0.014)	0.01 (0.008, 0.012)	−0.001 (−0.002, −0.001)
PPV	0.989 (0.985, 0.991)	0.969 (0.964, 0.974)	0.964 (0.959, 0.969)	0.993 (0.991, 0.995)	0.02 (0.015, 0.025)	0.024 (0.02, 0.029)	−0.005 (−0.007, −0.003)
NPV	0.892 (0.886, 0.897)	0.756 (0.748, 0.763)	0.913 (0.908, 0.919)	0.734 (0.726, 0.743)	0.136 (0.128, 0.143)	−0.021 (−0.024, −0.019)	0.157 (0.151, 0.164)
F-scores	0.875 (0.869, 0.882)	0.752 (0.743, 0.761)	0.884 (0.878, 0.891)	0.748 (0.74, 0.757)	0.123 (0.117, 0.13)	−0.009 (−0.012, −0.005)	0.127 (0.121, 0.133)
NBS
Sensitivity	0.86 (0.852, 0.868)	0.858 (0.851, 0.866)	0.869 (0.862, 0.876)	0.85 (0.842, 0.858)	0.002 (−0.002, 0.005)	−0.009 (−0.011, −0.007)	0.01 (0.007, 0.012)
Specificity	0.873 (0.865, 0.881)	0.903 (0.897, 0.91)	0.855 (0.847, 0.862)	0.926 (0.92, 0.932)	−0.03 (−0.036, −0.024)	0.019 (0.016, 0.022)	−0.052 (−0.057, −0.047)
PPV	0.885 (0.877, 0.892)	0.916 (0.909, 0.922)	0.863 (0.855, 0.87)	0.938 (0.933, 0.943)	−0.031 (−0.037, −0.025)	0.022 (0.019, 0.025)	−0.053 (−0.058, −0.048)
NPV	0.846 (0.838, 0.855)	0.839 (0.83, 0.848)	0.861 (0.853, 0.869)	0.824 (0.815, 0.833)	0.008 (0.003, 0.012)	−0.015 (−0.018, −0.012)	0.022 (0.019, 0.026)
F-scores	0.872 (0.866, 0.878)	0.886 (0.881, 0.891)	0.866 (0.86, 0.872)	0.892 (0.887, 0.897)	−0.014 (−0.018, −0.01)	0.006 (0.004, 0.009)	−0.02 (−0.023, −0.017)
MCHD
Sensitivity	0.946 (0.94, 0.952)	0.938 (0.932, 0.944)	0.956 (0.95, 0.961)	0.929 (0.922, 0.935)	0.009 (0.004, 0.013)	−0.009 (−0.012, −0.007)	0.017 (0.014, 0.021)
Specificity	0.988 (0.986, 0.99)	0.986 (0.984, 0.989)	0.982 (0.98, 0.985)	0.992 (0.99, 0.994)	0.002 (−0.001, 0.003)	0.006 (0.004, 0.007)	−0.004 (−0.006, −0.003)
PPV	0.98 (0.976, 0.983)	0.977 (0.973, 0.981)	0.97 (0.965, 0.974)	0.987 (0.984, 0.99)	0.002 (−0.001, 0.006)	0.01 (0.007, 0.013)	−0.007 (−0.01, −0.005)
NPV	0.967 (0.964, 0.971)	0.962 (0.958, 0.966)	0.974 (0.97, 0.977)	0.956 (0.952, 0.96)	0.005 (0.003, 0.008)	−0.006 (−0.008, −0.005)	0.012 (0.009, 0.014)
F-scores	0.963 (0.959, 0.966)	0.957 (0.953, 0.961)	0.963 (0.959, 0.966)	0.957 (0.953, 0.961)	0.006 (0.003, 0.008)	0 (−0.002, 0.002)	0.006 (0.004, 0.008)

Abbreviations: CI, confidence interval; INPC, Indiana Network for Patient Care; MCHD, Marian County Public Health Department; NBS, Newborn Screening; NPV, negative predictive value; PPV, positive predictive value; SSDMR, Social Security Number Death Master Registry.

Notes: The original analysis shows the performance metrics of the algorithm compared against the manually reviewed datasets. For the first sensitivity analysis, we reversed the match status of a discordant pair to nonmatch if the adjudicated match status is match, and match if the adjudicated match status is nonmatch. For the second sensitivity analysis, we set match status of all discordant pairs as nonmatches and in the third sensitivity analysis, we declared all discordant pairs to be matches.

In the first sensitivity analysis, sensitivity (recall) decreased numerically in all four datasets, significantly for INPC and SSDMR. The F1-score decreased significantly for INPC and SSDMR but increased significantly for NBS. However, the F-score and positive predictive value (PPV) increased significantly for NBS; NBS data naturally have more uncertainty as many of the record pairs have standard names (e.g., John Doe) and the same date of birth with limited address and telephone information. In the second sensitivity analysis, the sensitivity increased in all four datasets. This was expected as the denominator for sensitivity is smaller when all discordant pairs are treated as nonmatches. Likewise, the PPV decreased significantly in all four use cases. In the third sensitivity analysis, the sensitivity decreased significantly for INPC, SSDMR, and MCHD as the denominator of sensitivity is larger when all discordant pairs are treated as matches.

Discussion

To evaluate a patient matching algorithm, there must be a consistent process for creating a gold standard dataset that can be used to evaluate matching algorithms.[17] There have been multiple studies where authors have conducted a manual review process on their datasets.[18] [19] [20] However, this process can differ by author group and institution, resulting in varying estimates for algorithm performance and linkage error. Based on previous record linkage manual review frameworks, we reported each step of our manual review process from data cleaning and sampling to reviewer training and describing the software used.[23] This level of reporting provides a detailed description of each step and allows others to evaluate, reproduce, and utilize our review methodology.

This process allowed us to accurately measure real-world match performance and identify potential linkage biases. First, we assembled a diverse team of reviewers with different degrees, demographics, and backgrounds. Second, we used a consistent process to train each reviewer regardless of prior knowledge, ensuring each reviewer was unbiased and comfortable with the software. Third, allowing reviewers to choose between 1 and 4 for match or nonmatch confidence allowed for a flexibility in label assignment absent in traditional binary match/nonmatch processes. Lastly, we had a third reviewer resolve any discordant pairs. This consistent procedure provided structure in our manual review process and resulted in the best algorithmic concordance as [Table 4] shows the original analysis almost always had the highest predictive values and F-scores compared with the various sensitivity analyses.

Further, using a third reviewer creates a resilient process compared with when only one or two reviewers review each record. In [Table 3] for the SSDMR dataset, reviewer 5 had a 59% discordance rate. Although such a case may be an outlier, it may not have been identified without a detailed, formal manual review process. Although they all received the same training, each reviewer undertakes a different strategy to manual review, producing varying results for match concordance. More so without a third reviewer, the algorithm's actual match performance would have decreased as [Table 4] depicts how significantly performance metrics change just by altering the match status of discordant pairs. More importantly, worse algorithmic performance leads to inaccurate patient identification, delays in patient care, and higher health care costs.[2] [3] [32] [33] This underlines the need for consistent manual review to evaluate reviewer accuracy and matching performance.

Limitations

This framework has two primary limitations. First, the quality of reference data, including completeness and accuracy, can significantly influence matching performance. For instance, the NBS dataset, characterized by numerous incomplete records and poor field quality, was the only dataset that exhibited worse performance in the original analysis compared with the sensitivity analyses ([Table 4]). Depending on the dataset type and record quality, the framework and subsequent reviewer training could be refined in the future.[34] Second, many studies have deemed manual review too expensive for research and care delivery applications, as it demands substantial personnel and technical resources.[22] [35] Nevertheless, due to increased federal funding and the high costs associated with erroneous record matching, some institutions have recently prioritized manual review.[19] [36] [37]

Conclusion

Our findings show that individual reviewers produce different gold standards, resulting in varying matching performance estimates. We used a uniform manual review approach on four real-world datasets to assess the effects on matching algorithm performance in comparison to reference datasets. This approach outlined each manual review stage, enabling us to pinpoint reviewers with significant disagreement and examine how altering match status for discordant pairs substantially impacts performance metrics. As a result, health care organizations and policymakers should investigate additional methods to review datasets to ensure the accuracy of real-world matching performance.

Clinical Relevance Statement

Our findings show that individual reviewers produce different gold standards, resulting in varying matching performance estimates. We used a uniform manual review approach on four real-world datasets to assess the effects on matching algorithm performance in comparison to reference datasets. This approach outlined each manual review stage, enabling us to pinpoint reviewers with significant disagreement and examine how altering match status for discordant pairs substantially impacts performance metrics.

Multiple-Choice Questions

What algorithm was used to compare the gold standard against?
- Jaro Approach
- Fellegi-Sunter
- Naïve-Bayes
- Splink-Spark
Correct Answer: The correct answer is option b. As described in the article, the Fellegi-Sunter algorithm was used to measure the gold standards developed against.
Which country doesn't use a UPI or similar identification system?
- United Kingdom
- China
- Thailand
- United States
Correct Answer: The correct answer is option d. The USA is the last developed nation to not have a unique patient identifier for patient medical records. As described in the article, this significantly impacts data interoperability and results in worse patient care as data are siloed and can't be linked across systems.

Conflict of Interest

None declared.

Protection of Human and Animal Subjects

No animals were used and all human involved were employees of the Indiana University.

Data Availability Statement

The participants of this study did not give written consent for their data to be shared publicly, so due to the sensitive nature of the research supporting data are not available.

Authors' Contributions

S.J.G. and J.R.V. contributed to the conception, design, acquisition, and analysis for the work. H.X. and X.L. performed analysis and contributed to design. A.K.G. drafted the initial manuscript and contributed to analysis.

Supplementary Material

Supplementary Material (PDF)

References
1 Finnell JT, Overhage JM, Grannis S. All health care is not local: an evaluation of the distribution of emergency department care delivered in Indiana. AMIA Ann Symp Proc 2011; 2011: 409-416

Reference Link Ris
PubMed Search in Google Scholar
2 Genevieve Morris GF, Scott A, Carol R. Patient identification and matching final report. Off Natl Coordinator Health Inform Technol Audacious Inquiry 2014 . Accessed May 31, 2024 at: https://www.healthit.gov/sites/default/files/resources/patient_identification_matching_final_report.pdf

Reference Link Ris
PubMed Search in Google Scholar
3 Friedman CP, Wong AK, Blumenthal D. Achieving a nationwide learning health system. Sci Transl Med 2010; 2 (57) 57cm29

Reference Link Ris
Crossref PubMed Search in Google Scholar
4 Just BH, Proffitt K. Do you know who's who in your EHR?. Healthc Financ Manage 2009; 63 (08) 68-73

Reference Link Ris
PubMed Search in Google Scholar
5 College of Healthcare Information Management Executives. Summary of CHIME Survey on Patient Data-Matching. CHIME; 2012

Reference Link Ris
Search in Google Scholar
6 Grinspan ZM, Abramson EL, Banerjee S, Kern LM, Kaushal R, Shapiro JS. Potential value of health information exchange for people with epilepsy: crossover patterns and missing clinical data. AMIA Annu Symp Proc 2013; 2013: 527-536

Reference Link Ris
PubMed Search in Google Scholar
7 Kern LM, Grinspan Z, Shapiro JS, Kaushal R. Patients' use of multiple hospitals in a major US city: implications for population management. Popul Health Manag 2017; 20 (02) 99-102

Reference Link Ris
Crossref PubMed Search in Google Scholar
8 HIMSS Applauds Senate in Removing Ban on Unique Patient Identifier from Labor-HHS Bill. 2021 https://www.himss.org/news/himss-applauds-senate-removing-ban-unique-patient-identifier-labor-hhs-bill

Reference Link Ris
PubMed Search in Google Scholar
9 Hillestad R, Bigelow JH, Chaudhry B. et al. Identity crisis? Approaches to patient identification in a National Health Information Network. Santa Monica, CA: RAND Corporation; 2008

Reference Link Ris
Crossref Search in Google Scholar
10 Bernstam EV, Applegate RJ, Yu A. et al. Real-world matching performance of deidentified record-linking tokens. Appl Clin Inform 2022; 13 (04) 865-873

Reference Link Ris
Thieme Connect PubMed Search in Google Scholar
11 Ross MK, Sanz J, Tep B, Follett R, Soohoo SL, Bell DS. Accuracy of an electronic health record patient linkage module evaluated between neighboring academic health care centers. Appl Clin Inform 2020; 11 (05) 725-732

Reference Link Ris
Thieme Connect PubMed Search in Google Scholar
12 The Sequoia Project BCBS. Person matching for greater interoperability: a case study for payers. The Sequoia Project: 2020

Reference Link Ris
PubMed
13 Society Healthcare Information and Management Systems. EPIC: care everywhere. 2008 https://www.himss.org/resource-environmental-scan/care-everywhere

Reference Link Ris
PubMed Search in Google Scholar
14 Gilbert R, Lafferty R, Hagger-Johnson G. et al. GUILD: GUidance for Information about Linking Data sets. J Public Health (Oxf) 2018; 40 (01) 191-198

Reference Link Ris
Crossref PubMed Search in Google Scholar
15 Pratt NL, Mack CD, Meyer AM. et al. Data linkage in pharmacoepidemiology: a call for rigorous evaluation and reporting. Pharmacoepidemiol Drug Saf 2020; 29 (01) 9-17

Reference Link Ris
Crossref PubMed Search in Google Scholar
16 Privacy and Security Solutions for Interoperable Health Information Exchange Perspectives on Patient Matching: Approaches, Findings, and Challenges. Accessed May 31, 2024 at: https://digital.ahrq.gov/sites/default/files/docs/page/privacy-and-security-solutions-for-interoperable-hie-nationwide-summary.pdf

Reference Link Ris
PubMed
17 Harron KL, Doidge JC, Knight HE. et al. A guide to evaluating linkage quality for the analysis of linked data. Int J Epidemiol 2017; 46 (05) 1699-1710

Reference Link Ris
Crossref PubMed Search in Google Scholar
18 Grannis SJ, Overhage JM, McDonald CJ. Analysis of identifier performance using a deterministic linkage algorithm. Proc AMIA Symp 2002; 305-309

Reference Link Ris
PubMed Search in Google Scholar
19 Joffe E, Byrne MJ, Reeder P. et al. A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. J Am Med Inform Assoc 2014; 21 (01) 97-104

Reference Link Ris
Crossref PubMed Search in Google Scholar
20 Campbell KM, Deck D, Krupski A. Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a ‘basic’ deterministic algorithm. Health Informatics J 2008; 14 (01) 5-15

Reference Link Ris
Crossref PubMed Search in Google Scholar
21 Beil H, Preisser JS, Rozier RG. Accuracy of record linkage software in merging dental administrative data sets. J Public Health Dent 2013; 73 (02) 89-93

Reference Link Ris
Crossref PubMed Search in Google Scholar
22 Antonie L, Inwood K, Lizotte DJ, Andrew Ross J. Tracking people over time in 19th century Canada for longitudinal analysis. Mach Learn 2014; 95 (01) 129-146

Reference Link Ris
Crossref PubMed Search in Google Scholar
23 Gupta AK, Kasthurirathne SN, Xu H. et al. A framework for a consistent and reproducible evaluation of manual review for patient matching algorithms. J Am Med Inform Assoc 2022; 29 (12) 2105-2109

Reference Link Ris
Crossref PubMed Search in Google Scholar
24 About Us. 2021 https://www.ihie.org/about-us/

Reference Link Ris
PubMed Search in Google Scholar
25 McDonald CJ, Overhage JM, Barnes M. et al; INPC Management Committee. The Indiana Network for Patient Care: a working local health information infrastructure. An example of a working infrastructure collaboration that links data from five health systems and hundreds of millions of entries. Health Aff (Millwood) 2005; 24 (05) 1214-1220

Reference Ris Wihthout Link

Search in Google Scholar
26 Conway RBN, Armistead MG, Denney MJ, Smith GS. Validating the matching of patients in the linkage of a large hospital system's EHR with state and national death databases. Appl Clin Inform 2021; 12 (01) 82-89

Reference Link Ris
Thieme Connect PubMed Search in Google Scholar
27 A Comparison of Blocking Methods for Record Linkage. 2014. Cham: Springer International Publishing;

Reference Link Ris
Search in Google Scholar
28 Ruppert LP, He J, Martin J. et al. Linkage of Indiana State Cancer Registry and Indiana Network for Patient Care data. J Registry Manag 2016; 43 (04) 174-178

Reference Link Ris
PubMed Search in Google Scholar
29 University Information Technology Services. About Carbonate at Indiana University. 2021 https://kb.iu.edu/d/aolp

Reference Link Ris
PubMed Search in Google Scholar
30 Requesting SSA's Death Information. 2022 https://www.ssa.gov/dataexchange/request_dmf.html

Reference Link Ris
PubMed Search in Google Scholar
31 Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc 1969; 64 (328) 1183-1210

Reference Link Ris
Crossref PubMed Search in Google Scholar
32 Studies in Success: Duplicate Records Compromise EHR Investment. Just Associates; 2015

Reference Link Ris
Search in Google Scholar
33 Black Book Research. Improving Provider Interoperability Congruently Increasing Patient Record Error Rates, Black Book Survey. 2018 https://blackbookmarketresearch.newswire.com/news/improving-provider-interoperability-congruently-increasing-patient-20426295

Reference Link Ris
PubMed Search in Google Scholar
34 Cummins MR, Ranade-Kharkar P, Johansen C. et al. Simple workflow changes enable effective patient identity matching in poison control. Appl Clin Inform 2018; 9 (03) 553-557

Reference Link Ris
Thieme Connect PubMed Search in Google Scholar
35 Guillet F, Hamilton HJ. Quality Measures in Data Mining. Springer; 2007

Reference Link Ris
Crossref Search in Google Scholar
36 Bailey SR, Heintzman JD, Marino M. et al. Measuring preventive care delivery: comparing rates across three data sources. Am J Prev Med 2016; 51 (05) 752-761

Reference Link Ris
Crossref PubMed Search in Google Scholar
37 Studdert DM, Mello MM, Gawande AA. et al. Claims, errors, and compensation payments in medical malpractice litigation. N Engl J Med 2006; 354 (19) 2024-2033

Reference Link Ris
Crossref PubMed Search in Google Scholar

Address for correspondence

Agrayan K. Gupta, BS

Indiana University, School of Medicine

107 S Indiana Avenue, Bloomington, Indianapolis, IN 47405

United States

Email: aggupta@iu.edu

Publication History

Received: 26 September 2023

Accepted: 18 March 2024

Accepted Manuscript online:
20 March 2024

Article published online:
31 July 2024

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Finnell JT, Overhage JM, Grannis S. All health care is not local: an evaluation of the distribution of emergency department care delivered in Indiana. AMIA Ann Symp Proc 2011; 2011: 409-416

Reference Link Ris
PubMed Search in Google Scholar
2 Genevieve Morris GF, Scott A, Carol R. Patient identification and matching final report. Off Natl Coordinator Health Inform Technol Audacious Inquiry 2014 . Accessed May 31, 2024 at: https://www.healthit.gov/sites/default/files/resources/patient_identification_matching_final_report.pdf

Reference Link Ris
PubMed Search in Google Scholar
3 Friedman CP, Wong AK, Blumenthal D. Achieving a nationwide learning health system. Sci Transl Med 2010; 2 (57) 57cm29

Reference Link Ris
Crossref PubMed Search in Google Scholar
4 Just BH, Proffitt K. Do you know who's who in your EHR?. Healthc Financ Manage 2009; 63 (08) 68-73

Reference Link Ris
PubMed Search in Google Scholar
5 College of Healthcare Information Management Executives. Summary of CHIME Survey on Patient Data-Matching. CHIME; 2012

Reference Link Ris
Search in Google Scholar
6 Grinspan ZM, Abramson EL, Banerjee S, Kern LM, Kaushal R, Shapiro JS. Potential value of health information exchange for people with epilepsy: crossover patterns and missing clinical data. AMIA Annu Symp Proc 2013; 2013: 527-536

Reference Link Ris
PubMed Search in Google Scholar
7 Kern LM, Grinspan Z, Shapiro JS, Kaushal R. Patients' use of multiple hospitals in a major US city: implications for population management. Popul Health Manag 2017; 20 (02) 99-102

Reference Link Ris
Crossref PubMed Search in Google Scholar
8 HIMSS Applauds Senate in Removing Ban on Unique Patient Identifier from Labor-HHS Bill. 2021 https://www.himss.org/news/himss-applauds-senate-removing-ban-unique-patient-identifier-labor-hhs-bill

Reference Link Ris
PubMed Search in Google Scholar
9 Hillestad R, Bigelow JH, Chaudhry B. et al. Identity crisis? Approaches to patient identification in a National Health Information Network. Santa Monica, CA: RAND Corporation; 2008

Reference Link Ris
Crossref Search in Google Scholar
10 Bernstam EV, Applegate RJ, Yu A. et al. Real-world matching performance of deidentified record-linking tokens. Appl Clin Inform 2022; 13 (04) 865-873

Reference Link Ris
Thieme Connect PubMed Search in Google Scholar
11 Ross MK, Sanz J, Tep B, Follett R, Soohoo SL, Bell DS. Accuracy of an electronic health record patient linkage module evaluated between neighboring academic health care centers. Appl Clin Inform 2020; 11 (05) 725-732

Reference Link Ris
Thieme Connect PubMed Search in Google Scholar
12 The Sequoia Project BCBS. Person matching for greater interoperability: a case study for payers. The Sequoia Project: 2020

Reference Link Ris
PubMed
13 Society Healthcare Information and Management Systems. EPIC: care everywhere. 2008 https://www.himss.org/resource-environmental-scan/care-everywhere

Reference Link Ris
PubMed Search in Google Scholar
14 Gilbert R, Lafferty R, Hagger-Johnson G. et al. GUILD: GUidance for Information about Linking Data sets. J Public Health (Oxf) 2018; 40 (01) 191-198

Reference Link Ris
Crossref PubMed Search in Google Scholar
15 Pratt NL, Mack CD, Meyer AM. et al. Data linkage in pharmacoepidemiology: a call for rigorous evaluation and reporting. Pharmacoepidemiol Drug Saf 2020; 29 (01) 9-17

Reference Link Ris
Crossref PubMed Search in Google Scholar
16 Privacy and Security Solutions for Interoperable Health Information Exchange Perspectives on Patient Matching: Approaches, Findings, and Challenges. Accessed May 31, 2024 at: https://digital.ahrq.gov/sites/default/files/docs/page/privacy-and-security-solutions-for-interoperable-hie-nationwide-summary.pdf

Reference Link Ris
PubMed
17 Harron KL, Doidge JC, Knight HE. et al. A guide to evaluating linkage quality for the analysis of linked data. Int J Epidemiol 2017; 46 (05) 1699-1710

Reference Link Ris
Crossref PubMed Search in Google Scholar
18 Grannis SJ, Overhage JM, McDonald CJ. Analysis of identifier performance using a deterministic linkage algorithm. Proc AMIA Symp 2002; 305-309

Reference Link Ris
PubMed Search in Google Scholar
19 Joffe E, Byrne MJ, Reeder P. et al. A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. J Am Med Inform Assoc 2014; 21 (01) 97-104

Reference Link Ris
Crossref PubMed Search in Google Scholar
20 Campbell KM, Deck D, Krupski A. Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a ‘basic’ deterministic algorithm. Health Informatics J 2008; 14 (01) 5-15

Reference Link Ris
Crossref PubMed Search in Google Scholar
21 Beil H, Preisser JS, Rozier RG. Accuracy of record linkage software in merging dental administrative data sets. J Public Health Dent 2013; 73 (02) 89-93

Reference Link Ris
Crossref PubMed Search in Google Scholar
22 Antonie L, Inwood K, Lizotte DJ, Andrew Ross J. Tracking people over time in 19th century Canada for longitudinal analysis. Mach Learn 2014; 95 (01) 129-146

Reference Link Ris
Crossref PubMed Search in Google Scholar
23 Gupta AK, Kasthurirathne SN, Xu H. et al. A framework for a consistent and reproducible evaluation of manual review for patient matching algorithms. J Am Med Inform Assoc 2022; 29 (12) 2105-2109

Reference Link Ris
Crossref PubMed Search in Google Scholar
24 About Us. 2021 https://www.ihie.org/about-us/

Reference Link Ris
PubMed Search in Google Scholar
25 McDonald CJ, Overhage JM, Barnes M. et al; INPC Management Committee. The Indiana Network for Patient Care: a working local health information infrastructure. An example of a working infrastructure collaboration that links data from five health systems and hundreds of millions of entries. Health Aff (Millwood) 2005; 24 (05) 1214-1220

Reference Ris Wihthout Link

Search in Google Scholar
26 Conway RBN, Armistead MG, Denney MJ, Smith GS. Validating the matching of patients in the linkage of a large hospital system's EHR with state and national death databases. Appl Clin Inform 2021; 12 (01) 82-89

Reference Link Ris
Thieme Connect PubMed Search in Google Scholar
27 A Comparison of Blocking Methods for Record Linkage. 2014. Cham: Springer International Publishing;

Reference Link Ris
Search in Google Scholar
28 Ruppert LP, He J, Martin J. et al. Linkage of Indiana State Cancer Registry and Indiana Network for Patient Care data. J Registry Manag 2016; 43 (04) 174-178

Reference Link Ris
PubMed Search in Google Scholar
29 University Information Technology Services. About Carbonate at Indiana University. 2021 https://kb.iu.edu/d/aolp

Reference Link Ris
PubMed Search in Google Scholar
30 Requesting SSA's Death Information. 2022 https://www.ssa.gov/dataexchange/request_dmf.html

Reference Link Ris
PubMed Search in Google Scholar
31 Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc 1969; 64 (328) 1183-1210

Reference Link Ris
Crossref PubMed Search in Google Scholar
32 Studies in Success: Duplicate Records Compromise EHR Investment. Just Associates; 2015

Reference Link Ris
Search in Google Scholar
33 Black Book Research. Improving Provider Interoperability Congruently Increasing Patient Record Error Rates, Black Book Survey. 2018 https://blackbookmarketresearch.newswire.com/news/improving-provider-interoperability-congruently-increasing-patient-20426295

Reference Link Ris
PubMed Search in Google Scholar
34 Cummins MR, Ranade-Kharkar P, Johansen C. et al. Simple workflow changes enable effective patient identity matching in poison control. Appl Clin Inform 2018; 9 (03) 553-557

Reference Link Ris
Thieme Connect PubMed Search in Google Scholar
35 Guillet F, Hamilton HJ. Quality Measures in Data Mining. Springer; 2007

Reference Link Ris
Crossref Search in Google Scholar
36 Bailey SR, Heintzman JD, Marino M. et al. Measuring preventive care delivery: comparing rates across three data sources. Am J Prev Med 2016; 51 (05) 752-761

Reference Link Ris
Crossref PubMed Search in Google Scholar
37 Studdert DM, Mello MM, Gawande AA. et al. Claims, errors, and compensation payments in medical malpractice litigation. N Engl J Med 2006; 354 (19) 2024-2033

Reference Link Ris
Crossref PubMed Search in Google Scholar

Permissions and Reprints

Supplementary Material

Supplementary Material (PDF)

Related Journals

Subscribe to RSS

Share / Bookmark

Manual Evaluation of Record Linkage Algorithm Performance in Four Real-World Datasets

Authors

Abstract

Keywords

Background and Significance

Methods

Dataset Descriptions

Preprocessing: Field Selection and Data Standardization

Record Pair Grouping (Blocking)

Procedure for Sampling and Matching Pairs

Process for Judging Record Matches and Nonmatches

Interrater Reliability Measures

Review Software or Tools Used

Reviewer Characteristics

Results

Overview of Adjudication Results for the Four Datasets

The count of pairs reviewed in each dataset with two reviewers agreeing versus discordant

Discordant Pairs Results by Dataset

The number of matches and nonmatches and match rate for the discordant pairs adjudicated by the third reviewer

Discordance Rates per Dataset by Individual Reviewer

The counts of pairs reviewed in each dataset that agreed with the other reviewer or were discordant, for each individual reviewer

Match Status Sensitivity Analyses by Dataset

The match performance metrics with a 95% confidence interval for each dataset

Discussion

Limitations

Conclusion

Clinical Relevance Statement

Multiple-Choice Questions

Conflict of Interest

Protection of Human and Animal Subjects

Data Availability Statement

Authors' Contributions

Supplementary Material

References

Address for correspondence

Publication History

References