Keywords listening effort - psychophysiological measure - listening demand
Assessing physiological measures in listening effort research is common. Between 2019
and 2021, Clarivate's Web of Science database lists a total of 239 articles with the
term “listening effort” in the title, abstract, or keywords. Among these articles,
35% (81) employed at least one physiological measure to examine listening effort;
7% (16) employed more than one physiological measure. The variety of measures used
was large, and included measures directly indexing brain activity, such as electroencephalogram
(EEG) alpha oscillations,[1 ]
[2 ] EEG-evoked potential components,[3 ]
[4 ] functional near-infrared spectroscopy (fNIRS),[5 ]
[6 ] and peripheral measures, such as skin conductance,[7 ]
[8 ] pupil diameter,[9 ]
[10 ] heart rate variability,[11 ]
[12 ] and cardiovascular preejection period.[12 ]
[13 ] The reason for the particular measures used seemed to be driven more by the researchers'
interest and availability of measurement equipment than by a theoretical or conceptual
rationale.
Given the lack of a unifying rationale, the heterogeneity in the employed measures
constitutes a problem: It makes it difficult for listening effort researchers to decide
which measure to use, to compare findings across studies involving different measures,
and to draw straightforward conclusions from the existing literature.[14 ] Ultimately, a unifying rationale would boost theoretical progress and advance our
understanding of the determinants, consequences, and mechanisms associated with listening
effort. A more comprehensive approach that systematically integrates multiple physiological
measures could be particularly useful when studying listening effort. However, there
are a number of practical challenges to combining more than a single physiological
measure in a listening effort study. The purpose of this article is to highlight some
of these challenges and to provide recommendations on how to address them. We hope
that this will help listening effort researchers to develop a more integrative, unified
approach to using physiological measures and thereby accelerate the advancement of
our understanding of listening effort. Our discussion strongly draws on the experience
that we have gained in the context of the HEAR-ECO project (http://hear-eco.eu/ ) in which we employed several physiological measures to examine listening effort.[11 ]
[13 ]
[15 ]
[16 ]
[17 ]
[18 ] The topics that we are going to discuss here are (1) the selection of appropriate
physiological measures, (2) the simultaneous assessment of multiple physiological
signals, (3) the aggregation and combination of simultaneously assessed physiological
measures, and (4) the statistical analyses of multiple physiological measures.
Selection of Appropriate Physiological Measures
Selection of Appropriate Physiological Measures
One of the most challenging aspects of a systematic, integrative approach that uses
multiple physiological measures to examine listening effort is to find a good rationale
for selecting the measures. For almost any common physiological measure, it is possible
to find at least one publication where the authors associate the measure with listening
effort or related constructs like effort, engagement, or resource allocation. Finding
a published rationale that justifies the use of multiple physiological measures is,
however, more difficult. Nonetheless, a unifying rationale seems to be desirable to
facilitate the integration of results from different studies. Moreover, the lack of
a unifying rationale increases the likelihood of a conflation of concept and measure,
which is illustrated by the current discussion about the multiple dimensions of listening
effort.[7 ]
[19 ]
[20 ]
[21 ] The lack of a unifying rationale linking the concept (listening effort) to physiological
measures makes it difficult to decide whether the discussion is about the dimensions
of listening effort or about the dimensions of the measures employed in listening
effort research.
The first step may thus be a clear and commonly accepted definition of the concept
of listening effort. Without a clear definition of the concept, we will struggle to
differentiate it from other phenomena[22 ]—for instance, to decide whether a listening situation is more effortful or more
arousing[11 ]
[16 ]—to find (psychophysiological) measures that appropriately match our concept,[22 ]
[23 ] and to build a refined theory of listening effort.[24 ] Psychophysiological measures can be viewed as proxies to self-report measures of
subjectively perceived listening effort—a rating or other type of assessment of the
individual's perception of how effortful listening is—which in common language may
be viewed as the most meaningful definition of listening effort.[25 ] Whether or not this can be regarded as the “ground truth” depends on the experimental
setup and on the specific definition of listening effort. The same applies to objective
behavioral measures of listening effort such as dual-task measures or delayed recall.
There are at least two approaches to developing a clear definition of a concept, and
both have been used in listening effort literature.[21 ] The first is the empirical observation that a physiological measure responds to
variations in an independent criterion variable—for instance, a listening demand-related
variable like the signal-to-noise ratio of speech embedded in noise—as evidence that
the measure constitutes a correlate of listening effort.[26 ]
[27 ]
[28 ]
[29 ]
[30 ]
[31 ]
[32 ] This implies a definition of listening effort as a state that changes in a predictable
way when the level of the criterion variable (e.g., listening demand) changes. For
example, it is usually assumed that a measure sensitive to listening effort should
indicate relatively high effort in moderately difficult listening demand conditions,
and less effort in low listening demand conditions. Using such a concept definition,
any physiological measure that has been demonstrated to respond to variations in the
criterion variable would constitute a valid outcome of listening effort[33 ] and could be included in listening effort studies that employ physiological measures.
Listening effort researchers favoring this approach should thus specify their criterion
variable(s) and then review the literature to find out which psychophysiological measures
respond to changes in it/them. These measures would then constitute the set of physiological
measures that could legitimately be used to examine listening effort.
The second approach to define the concept of listening effort is to provide a verbal
description of it. For instance, McGarrigle and colleagues[20 ] defined listening effort as “the mental exertion required to attend to, and understand,
an auditory message,” Picou and colleagues[34 ] conceptualized it as “cognitive resources allocated for speech recognition,” and
Pichora-Fuller and colleagues[35 ] defined it as “the deliberate allocation of mental resources to overcome obstacles
in goal pursuit when carrying out a [listening] task.” The advantage of such a concept
definition is that it avoids the risk of circularity of the criterion-variable approach[21 ]—the observed empirical relationship between a physiological measure and a listening-effort
manipulation is considered to validate the measure as indicator of listening effort
and, at the same time, hypotheses about whether the manipulation changes listening
effort are tested using the physiological measure. If the concept definition refers
to specific self-report or objective behavioral measures of listening effort, these
measures provide an efficient way to resolve the circularity problem. For instance,
a definition of listening effort as the subjective feeling of investing effort in
a listening task would point to a self-report measure of listening effort as criterion.
However, the descriptive approach often requires additional concept definitions to
allow the justification of the selection of physiological measures. For instance,
it requires an additional operational definition of cognitive resource allocation
as changes in pupil diameter to use Picou and colleagues' definition[34 ] to justify the use of pupil diameter in listening effort research. As far as we
know, none of the current theoretical accounts of listening effort offer such a justification
of specific physiological measures.
If these additional concept definitions refer to general physiological mechanisms
(instead of referring to a specific measure), they offer the justification of multiple
physiological measures that is needed for a unified approach to the use of physiological
measures in listening effort research. For instance, using the operational definition
of mental resource allocation as increased cardiac sympathetic activity in combination
with Pichora-Fuller and colleagues' general definition of listening effort[13 ] would imply that all physiological measures that reflect cardiac sympathetic activity
should be included in listening effort research. It is obviously not required to have
two levels of concept definitions—a general one of listening effort and an operational
one linking listening effort to a physiological mechanism. One could directly use
an operational definition of listening effort that refers to physiological measures—for
instance, a definition of listening effort as cardiac sympathetic activity in listening
tasks.[36 ] However, including a broad, descriptive concept definition of listening effort probably
offers a better integration of the listening effort literature that has not used physiological
measures, such as those studies using only self-report or behavioral measures.
Recommendation 1
Use a clear definition of the concept of listening effort that creates an explicit
link to the employed physiological measures. Make this definition salient. Other researchers
will adopt your concept definition or present conflicting definitions, which will
foster a discussion about the listening effort concept. This will hopefully lead over
time to a commonly accepted definition of listening effort.
Simultaneous Collection of Multiple Physiological Biosignals
Simultaneous Collection of Multiple Physiological Biosignals
Once the physiological measures of interest have been selected, one needs to collect
the biosignals that are required to calculate these measures. One of the most obvious
challenges in the simultaneous collection of multiple biosignals is the parallel use
of different measurement devices and sensors, which may interfere with one another
and may result in discomfort and stress for study participants. For instance, EEG
electrodes and fNIRS optodes often need to be placed at similar locations on the participant's
head, which may be physically impossible if two separate sensor patches are necessary.
EEG and fNIRS sensors may also interfere with the appropriate placement of the electrodes
of impedance cardiograph systems (required for the determination of preejection period)
that use electrodes on the forehead or behind the ears.[37 ]
[38 ] Another example is the competition of eye tracking glasses and fNIRS and EEG sensors
for space on the forehead. In addition to the competition for space, devices may also
interfere with one another because of their electromagnetic properties. For instance,
the simultaneous use of EEG and fNIRS can induce noise on the EEG signal caused by
the electric activity of the fNIRS system.[39 ]
[40 ] Another example is the interference due to the magnetic field of magnetic resonance
imaging (MRI) systems that can influence the ECG signal.[41 ]
[42 ]
Many of these problems can be avoided by carefully selecting equipment. For instance,
there are custom-made hybrid EEG-fNIRS systems that enable the simultaneous assessment
of both signals.[43 ]
[44 ] Impedance cardiography and measures that require sensors mounted on the head can
be made compatible by using impedance cardiographs with an electrode configuration
that does not interfere with the other devices' sensors (e.g., systems that only require
electrodes on the thorax and neck[45 ]). Eye tracking is compatible with head-mounted sensors if a screen-based (remote)
eye tracker is used. The problem of MRI artifacts on the ECG signal can be mitigated
by using carbon fiber electrodes and leads as well as by employing statistical methods
to control for the induced artifacts.[41 ]
[42 ] However, careful planning, customization, and expertise are required for all involved
biosignals.
Of course, in field research, the simultaneous measurement of multiple biosignals
is highly limited by the need for equipment to be sufficiently unobtrusive and practical
so as not to interfere with daily life, while also remaining reliable, valid, and
sensitive. This, of course, is challenging especially when experiments take place
over many days and thus require the participants to manage the fitting and charging
of equipment at home.[46 ] However, the rapid development of commercially available mobile sensors might solve
some of the issues once these systems have proven to be sufficiently reliable, valid,
and sensitive.[47 ]
[48 ]
The second major problem related to the simultaneous assessment of multiple biosignals
is the synchronization of the data. The most frequently used approach is to label
the data during the data collection process with event markers and to use these recorded
markers to align the different signals offline. However, given that the signals are
digitized by separate devices with their own independent clocks, there will be some
delays and misalignment between the signals.[39 ]
[40 ] Moreover, if the signals were originally collected at different sampling rates,
down-sampling of the raw signals to one and the same sampling rate may considerably
distort the temporal aspects of the signals and introduce misalignment of the signals.
A more sophisticated approach to data synchronization is to have one device that controls
the sampling of all other devices. There are commercial solutions available, but the
device (or software) would probably need to be customized to suit the needs of the
specific, individual setup. Moreover, many standalone measurement devices are closed
systems that do not allow a second device to control their data sampling process.
A researcher aiming to assess multiple physiological measures to examine listening
effort thus needs to find a solution for the physical and electromagnetic interference
of the employed devices and sensors as well as solve the problem of data synchronization.
It may not always be possible to find an ideal solution, but awareness of these potential
obstacles will allow for study designs to be optimized.
Recommendation 2
Determine how and whether the selected biosignals will interfere with one another
and acquire appropriate specialized equipment accordingly to mitigate any problems
caused by the physical and electromagnetic interference of the measurement devices
and to attenuate the data synchronization issue. Consider these issues already at
the planning stage of projects to ensure that the required financial, logistical,
and knowledge-related resources (e.g., for the purchase of integrated measurement
systems or for the recruitment of individuals with the expertise to provide custom-made
solutions) are available.
Aggregation and Combination of Physiological Measures
Aggregation and Combination of Physiological Measures
Once one has managed to simultaneously sample the required biosignals and to synchronize
them, the derived physiological measures must be aggregated and compared. One of the
main challenges to this is caused by differences in the time characteristics of the
physiological measures used in listening effort research. Continuous measures have
a meaningful value at any given point in time and their time resolution is only limited
by the quality of the measurement device. For instance, pupil diameter[10 ] or skin conductance level[29 ] has one particular value at any given moment and all such values provide meaningful
information. In contrast, noncontinuous measures either do not exist at some points
in time or they cannot be related to one specific point in time in a meaningful manner.
For instance, peak pupil diameter refers to a specific point in time when the pupil
diameter attains its maximum value in a certain time interval.[49 ] At all other points in time, peak pupil diameter does not exist. The same applies
to specific components of EEG event-related potentials like the P400 amplitude[50 ] or systolic blood pressure,[12 ] the maximum blood pressure between two consecutive heart beats.
In addition to noncontinuous measures that exist only at specific points in time,
there are noncontinuous measures that refer to specific time periods and can therefore
not be associated with a specific point in time. For instance, preejection period[12 ] refers to the time interval between the onset of the electrical excitation of the
left heart ventricle and the opening of the aortic valve. Consequently, it does not
exist during other periods of the cardiac cycle[51 ] and is not associated with one single, specific point in time. Another example is
heart period,[52 ] which refers to the time interval between two consecutive heart beats. There are
also listening effort measures that are noncontinuous because of how they are calculated.
For instance, the determination of EEG alpha[53 ] and theta power[54 ] requires the use of epochs to extract the frequency components of interest (e.g.,
an epoch of 1,250 ms would be required for the quantification of theta power[55 ]). Another example is heart rate variability,[29 ] which can also be determined only by quantifying variability over a certain time
period (e.g., 1-minute intervals if a Fast Fourier Transform is used to quantify high-frequency
heart rate variability[56 ]). [Fig. 1 ] provides an illustration of the variability in the time characteristics of the discussed
measures.
Figure 1 Time characteristics of selected physiological measures. Dark gray lines indicate
continuous measures; dark gray dots indicate noncontinuous measures. Dark gray dots
with surrounding light gray boxes indicate noncontinuous measures that refer to time
periods and not to specific points in time. The light gray boxes indicate the measurement
epochs required to obtain the measure.
Associated with the various time-scales is the difference in baseline interval or
nature of the baseline between various measures. For example, pupillometry measures
often apply a trial-based baseline correction that is based on the mean pupil size
in a relatively short period (e.g., 1,000–200 ms) prior to stimulus onset.[57 ] In some studies, this baseline is corrected for the individual dynamic range in
the pupil size.[58 ] On the other hand, the reactivity of cardiovascular measures like preejection period
or heart rate variability is often compared to a baseline measured before the onset
of the task of interest (during rest).[12 ]
It should be evident then that aggregating noncontinuous and continuous measures is
complex. While it is technically possible to treat the noncontinuous measure as a
continuous one by resampling to obtain one data point of the noncontinuous measure
for each data point of the continuous measure,[44 ]
[59 ] this leads to a bias given that data points are created where the measure does not
exist or that a noncontinuous measure is treated as a continuous one. The solution
that probably introduces the least artificial information is to use averages across
large time periods (e.g., over a block of stimulus response trials) for both continuous
and noncontinuous measures. One could still argue that the continuous measure is more
reliable because it depends on more measurement points and its values are not artificially
introduced. However, averaging across longer time periods comes with a cost: a potential
loss of sensitivity to shorter, phasic changes and only reflecting tonic changes in
the measures. Given the popularity of paradigms in listening effort research that
rely on the analysis of short stimulus evoked phasic changes (e.g., changes in pupil
response evoked by auditive stimuli[49 ]
[60 ]), this constitutes a serious shortcoming.
In addition to the obstacles to the integration and comparison of multiple physiological
measures created by the time characteristics of the measures themselves, differences
in the time characteristics of the underlying physiological mechanisms must also be
considered. Many of the physiological measures used in listening effort are driven
by physiological mechanisms that operate on different time scales. For instance, it
can take up to 20 seconds from the onset of nervous system activity to the maximum
response of heart rate and blood pressure, and it also can take more than 10 seconds
from the end of nervous system activity to the return of heart rate and blood pressure
to their baseline values.[61 ]
[62 ] Pupil responses seem to be driven by faster physiological mechanisms given that
they appear sooner (a few seconds after stimulus onset) and also disappear within
seconds.[63 ] EEG evoked potentials rely on even faster mechanisms, and can be observed after
a few milliseconds.[64 ]
Given the differences in the time characteristics of the underlying physiological
mechanisms, the paradigms used to optimize the assessment of the physiological measures
of listening effort vary considerably. For instance, paradigms using cardiovascular
measures normally present a single stimulus condition over a period of several minutes,[12 ]
[13 ]
[29 ] whereas paradigms using pupil-related measures tend to present different stimulus
conditions in intervals of a few seconds.[30 ]
[65 ]
[66 ] Using multiple physiological measures that are driven by different physiological
mechanisms consequently requires researchers to develop paradigms that are appropriate
for the various time scales of their measures.
Recommendation 3
Take the individual time characteristics of the physiological measures and underlying
physiological mechanisms into account when planning a study with multiple physiological
measures. Develop paradigms that are appropriate for all involved measures.
Statistical Analysis of Multiple Physiological Measures
Statistical Analysis of Multiple Physiological Measures
The final challenge to using multiple physiological measures in listening effort research
is the selection of an appropriate statistical approach. The main concern here is
the prevention of type-I error inflation due to the number of assessed physiological
measures. One approach that is frequently adopted in listening effort research is
to use an independent statistical test for each assessed measure. Unfortunately, this
quickly increases type-I error. It is thus necessary to employ a type-I error control
procedure. However, the big challenge is to find one that has a minimal impact on
statistical power.
One option is to analyze all physiological measures in a two-step procedure where
a first multivariate analysis of variance (MANOVA) is used as gatekeeper for follow-up
univariate tests.[67 ] For instance, Plain and colleagues[11 ] analyzed seven different physiological listening effort measures by first conducting
a MANOVA that included all measures and then using univariate tests for those measures
that were significant. If such a two-stage procedure is used with appropriately adapted
critical F - and t -values for the follow-up tests, it can successfully control the maximum type-I error
rate. However, in designs with more than two groups, simple single-stage multiple-comparison
procedures (like the Bonferroni procedure) perform as well as the more complex MANOVA-protected
procedure and may thus be preferred.[67 ] Avoiding multivariate procedures also mitigates the problem of multicollinearity
between the dependent variables, which can influence the interpretability of the results.[68 ] Multicollinearity—the correlation between the outcome variables in this case—is
common in psychophysiological research given that the measures are often driven by
the same or associated physiological mechanisms.[69 ] For instance, both pupil changes and heart rate changes are driven by sympathetic
and parasympathetic nervous system activity and will highly correlate with one another
if the autonomic outflow to the pupil and the heart does not differ.
An alternative approach is to aggregate the measures into a single index.[70 ]
[71 ] For instance, preejection period and pupil diameter in the dark—when the parasympathetic
contribution is minimal[72 ]—could be combined into a single index of sympathetic activity. A single aggregated
index could be analyzed with a single statistical test and would thus prevent the
problem of type-I error inflation discussed in the preceding paragraphs. Moreover,
it would have higher statistical power because no type-I error inflation control would
be needed and specific planned contrasts could be conducted.[73 ]
[74 ]
[75 ] Aggregating measures requires a decision on whether to standardize the individual
measures before the aggregation. Standardizing the measures controls for the impact
of the variability and magnitude of the responses of the individual measures. At first
sight, this might seem to be a good idea because one would like each measure to have
the same influence on the aggregated index. However, the standardization—for instance
a z-standardization[76 ]—is often performed using the collected data, which introduces a bias. For instance,
combining a z-standardized physiological measure where participants showed originally
almost no response variability—for instance, heart rate changes with a mean of 2 beats
per minute (bpm) and a standard deviation of 1 bpm—with a z-standardized measure where
participants showed strong response differences—for instance, systolic blood pressure
responses with a mean of 20 millimeters of mercury (mm Hg) and a standard deviation
of 10 mm Hg—leads to a huge bias because it treats a blood pressure change of 30 mmHg
as being equivalent to a heart rate change of 3 bpm. A blood pressure change of 30 mm
Hg constitutes a much stronger physiological response than a heart rate response of
3 bpm, but this is neglected by the resulting index. This problem can be prevented
by standardizing the individual physiological measures using their physiologically
possible range as criterion (instead of their sample mean and variability). For fNIRS
research, this approach has been taken recently by Zhang and colleagues who used a
breath-holding task to scale the fNIRS response differences between conditions by
the physiologically plausible range of the fNIRS response before performing the statistical
analysis.[77 ] Unfortunately, information about the absolute minimum and maximum response of many
of the physiological measures employed in listening effort research is often not available.
For instance, no information is known regarding the physiological maximum of a skin
conductance response.
Recommendation 4
Plan your statistical analysis to account for the problems of assessing statistical
significance (p -values) when running multiple tests (i.e., increased type-I error when uncorrected
or reduced statistical power when corrected for multiple testing) and of analyzing
measures that are potentially highly correlated. If possible, use an aggregate index
that represents the physiological mechanism that you are interested in.
Summary
Moving from using single physiological measures in listening effort research to combining
multiple measures that are justified by a single, unifying rationale would help the
field to overcome the fragmented approach that currently exists. The explicit presentation
of researchers' concept definition of listening effort and its use to justify the
employed physiological measures would promote a discussion about the core concept
and hopefully lead to a commonly accepted definition of listening effort. Combining
multiple measures does, however, require awareness of the problems that are caused
by the simultaneous use of multiple measurement devices as well as sound knowledge
about the time characteristics of the measures and the underlying physiological mechanisms.
Moreover, awareness of the statistical issues associated with analyzing multiple measures
is also required. The solutions to many of the challenges that we have outlined are
still in their infancy or are yet to be developed. However, we are convinced that
we should not leave it to future generations of researchers to integrate the fragmented
field that we have created. Addressing these issues now is the only way forward to
a more integrated approach to the use of physiological measures in listening effort
research and to a comprehensive understanding of listening effort.