Abstract
Background Social and behavioral determinants of health (SBDH) are environmental and behavioral
factors that often impede disease management and result in sexually transmitted infections.
Despite their importance, SBDH are inconsistently documented in electronic health
records (EHRs) and typically collected only in an unstructured format. Evidence suggests
that structured data elements present in EHRs can contribute further to identify SBDH
in the patient record.
Objective Explore the automated inference of both the presence of SBDH documentation and individual
SBDH risk factors in patient records. Compare the relative ability of clinical notes
and structured EHR data, such as laboratory measurements and diagnoses, to support
inference.
Methods We attempt to infer the presence of SBDH documentation in patient records, as well
as patient status of 11 SBDH, including alcohol abuse, homelessness, and sexual orientation.
We compare classification performance when considering clinical notes only, structured
data only, and notes and structured data together. We perform an error analysis across
several SBDH risk factors.
Results Classification models inferring the presence of SBDH documentation achieved good
performance (F1 score: 92.7–78.7; F1 considered as the primary evaluation metric).
Performance was variable for models inferring patient SBDH risk status; results ranged
from F1 = 82.7 for LGBT (lesbian, gay, bisexual, and transgender) status to F1 = 28.5
for intravenous drug use. Error analysis demonstrated that lexical diversity and documentation
of historical SBDH status challenge inference of patient SBDH status. Three of five
classifiers inferring topic-specific SBDH documentation and 10 of 11 patient SBDH
status classifiers achieved highest performance when trained using both clinical notes
and structured data.
Conclusion Our findings suggest that combining clinical free-text notes and structured data
provide the best approach in classifying patient SBDH status. Inferring patient SBDH
status is most challenging among SBDH with low prevalence and high lexical diversity.
Keywords
social determinants of health - electronic health records - machine learning - natural
language processing