Keywords
central tendency - hypothesis testing - parametric - statistics
Introduction
Adopting a systematic approach to statistical analysis is essential for ensuring the
accurate interpretation of data and drawing valid conclusions from research studies.
In the field of radiology, statistics play a crucial role in enhancing diagnostic
precision, improving patient outcomes, and driving advancements in research. This
primer offers a thorough and condensed overview of key statistical concepts that are
pertinent to both radiologists and clinicians. The first part is dedicated to discussing
types of data, data distribution, descriptive and inferential statistics, hypothesis
testing, and sampling. The second part delves into advanced statistical concepts such
as correlation and causality, regression analysis, survival curves, and the analysis
of diagnostic tests, encompassing contingency tables and receiver operating characteristic
(ROC) curves. This primer not only serves as a foundational resource for grasping
basic statistical concepts but also aids in the interpretation of various methodologies
relevant to daily research endeavors.
Radiology has been at the forefront of technological innovations and various advancements,
focusing not only on disease diagnosis but also on therapeutic interventions. The
conduct of research assessing the utility of imaging techniques and their applications
are crucial for shaping clinical recommendations and establishing practice guidelines,
both now and in the future.[1] Understanding fundamental statistical principles will enable radiologists as well
as clinicians to critically assess existing literature and make well-informed clinical
decisions, which are the foundations of evidence-based medicine.[2] Similarly, the proper application and interpretation of statistical methods are
crucial for carrying out scientifically rigorous studies. Nonetheless, training in
research methodology, particularly in statistics, is generally limited throughout
postgraduate medical training.[3] Our objective is to provide an overview of the most frequently used data analysis
methods found in radiology literature.
Types of Data
Statistical data can be broadly classified into two types: quantitative and qualitative.
Understanding the type of data are crucial for selecting the appropriate statistical
method for analysis.[4] Quantitative data refers to numerical information that can be measured and counted.
It can be further subdivided into two types ([Fig. 1])[5]
[6]:
-
Continuous data can take any value within a specified range, allowing for the calculation of statistical
measures such as means and variances. For instance, in a study measuring the size
of tumors in breast cancer patients before and after treatment, the tumor sizes are
considered continuous data because they can assume any value within the range of possible
measurements, such as 1.2, 2.5, 3.7 cm, and so on.
-
On the other hand, discrete data consist of distinct and separate values, often arising from counting processes. For
example, the number of renal cysts present on ultrasound images of different patients
represents discrete data. If one patient has three cysts and another has five, these
values are discrete data.
Fig. 1 Flowchart demonstrating the classification of types of data.
Qualitative data describe characteristics or categories that cannot be quantified.
They are also known as categorical data and can be subdivided into two types[5]
[6]:
-
Nominal data: These represent categories that do not have an inherent order. This type of data
is often used to classify observations into distinct groups. For example, in a study
evaluating the choice of different imaging modalities for a particular suspected pathology
among various radiologists, the modalities (magnetic resonance imaging [MRI], computed
tomography [CT], ultrasound) are nominal data.
-
Ordinal data: This type of data represents categories with a meaningful order but no consistent
difference among them. It is useful for ranking observations but does not provide
information about the relative distance between ranks. For example, when evaluating
patient satisfaction with imaging services, responses might be categorized as “poor,”
“fair,” “good,” or “excellent.” These categories have a natural order, but the intervals
between them are not necessarily equal.
Consider a study that examines the efficiency of different radiology workflows. The
study can collect both quantitative and qualitative data. Quantitative data can be measured as the time taken (in minutes) to complete a set of imaging examinations,
while qualitative data can be formulated as the type of workflow (manual vs. automated). Statistical tests
are more robust for quantitative data than for qualitative data. By analyzing both
types of data, the researcher can determine not only which workflow is faster but
also how the type of workflow affects overall efficiency as well as user satisfaction.
When gathering data for research, it is advisable to collect the data as continuous
variables rather than nominal variables when there is flexibility in organizing the
data. For instance, when recording the hypertensive status of multiple patients, it
is more advantageous to gather individual blood pressure measurements rather than
categorizing patients as hypertensive or nonhypertensive. This approach offers benefits
such as greater statistical power, reduced information loss, and increased flexibility
in data transformation.
Distribution of Data
Understanding the distribution of data is essential for selecting appropriate statistical
methods. Distribution describes how the data values are spread across and thereby
provides insight into underlying patterns as well as trends within the dataset.[7]
Normal distribution (also known as Gaussian distribution) basically links frequency
distribution to probability distribution, representing how near or how far distribution
of the observed sample is from the ideal distribution of a population-based sample.
It is a symmetrical, bell-shaped curve where most of the data points cluster around
the mean. Many biological measurements, like blood pressure or body temperature, follow
a normal distribution. Mean in such data occupies the central position within the
distribution. Standard deviation (SD) indicates how data are dispersed around the
mean. Larger the SD, wider and flatter the curve. Two SDs cover 95% and 3 SDs cover
99.7% of the observations. The properties of the normal distribution allow for the
application of various statistical techniques, including parametric tests.[7]
[8]
Skewness is a measure of asymmetry and deviation from a normal distribution. Data
can be skewed if they are not symmetrically distributed. Skewness can be positive
(right skewed) or negative (left skewed; [Fig. 2]).[9]
Fig. 2 Bar charts demonstrating types of data distribution. Normal distribution of data
is represented by the typical symmetrical bell-shaped curve, e.g., in a typical healthy
population, liver attenuation values (in HU) usually center around a mean of 50 to
60 HU, with most people falling close to this value. There are few individuals with
extremely high or low attenuation values, leading to the characteristic bell-shaped,
symmetrical curve of a normal distribution. Positively skewed distribution causes
the peak of the curve to shift toward the positive left side, e.g., in a dataset measuring
duration of hospital stays for patients undergoing different interventional radiology
procedures, a right-skewed distribution might indicate that while most patients are
discharged within a few days, a smaller number of patients have significantly longer
stays due to complications. Negatively skewed distribution causes it to shift toward
the negative right side, e.g., if age at diagnosis for a particular disease shows
a left-skewed distribution, it might indicate that most diagnoses occur later in life,
with a few cases occurring at younger ages. Bimodal distribution with two peaks on
the right and left side, for example, distribution of heights in a mixed-gender sample.
Right-skewed distribution: Most data points are concentrated on the left with a long tail to the right. For
example, in a dataset measuring the duration of hospital stays for patients undergoing
different interventional radiology procedures, a right-skewed distribution might indicate
that while most patients are discharged within a few days, a smaller number of patients
have significantly longer stays due to complications.
Left-skewed distribution: Most data points are concentrated on the right with a long tail to the left, such
as in the case of age at diagnosis for a particular disease. For example, if age at
diagnosis for a particular disease shows a left-skewed distribution, it might indicate
that most diagnoses occur later in life, with a few cases occurring at younger ages.
A bimodal distribution has two peaks. This can occur when data are collected from
two different populations. For example, the distribution of heights in a mixed-gender
sample.
Presentation of Data
Data can be presented in three ways: as text, in tabular form, or in graphical form
([Fig. 3])[4]
[10]:
-
Text: This is the main method of conveying information to explain results and trends,
as well as to provide contextual information.
-
Table: It helps in the representation of larger amounts of data in an engaging, easy-to-read
and coordinated manner. The data are arranged in rows and columns.
-
Graphical form: It is a powerful tool to communicate research results and to gain information from
data. It may be in the form of a bar chart, pie chart, line diagram, scatter plot,
or histogram.
Fig. 3 Examples of different forms of data presentation. (A) Bar chart, which is used to compare the frequency or values of different categories,
for example, comparing the number of patients with different types of brain tumors
[gliomas, meningiomas, metastases] diagnosed over a year. (B) Pie chart, which is used to show proportions or percentages of a whole, for example,
showing the percentage distribution of different imaging modalities (magnetic resonance
imaging [MRI], computed tomography [CT], ultrasound, X-ray) used in a hospital's radiology
department. (C) Line diagram, which is used to track changes or trends over time, for example, tracking
the trend of average radiation dose per CT scan in a radiology department over time
(across months or years). (D) Scatter plot, which is used to explore relationships or correlations between two
continuous variables, for example, plotting the relationship between tumor size (in
cm) and patient survival time (in months) after diagnosis of a malignant tumor. (E) Histogram, which is sed to display the distribution of a continuous variable by
grouping data into bins, for example, displaying the distribution of radiodensity
values (in Hounsfield units) for liver tissue on CT in a group of patients to assess
for fatty liver disease. (F) Box and whisker plot, which is used to show the spread, central tendency, and outliers
in a dataset, for example, comparing the distribution of radiologists' interpretation
times (in minutes) for reading brain MRI across different experience levels (junior,
senior, expert).
Descriptive and Inferential Statistics
Descriptive and Inferential Statistics
Once you have gathered data and organized it according to its type and distribution,
the next step is to analyze the data. One important aspect of statistics involves
making assertions about a population. Since it is often impractical to obtain data
from an entire population, a sample is typically taken instead. Descriptive statistics
are then used to characterize this sample, including measures such as the mean value
and the degree of dispersion. However, characterizing the sample alone does not provide
insight into the population as a whole; this is the domain of inferential statistics.
In this case, a sample is drawn from the population with the aim of drawing broader
conclusions about the population based on this sample. Thus, inferential statistics
seek to deduce the unknown parameters of the population from the known parameters
of a sample, going beyond the immediate data unlike descriptive statistics. To accomplish
this, inferential statistics utilize hypothesis tests such as the t-test or analysis of variance (ANOVA). Both are crucial for analyzing data and drawing
meaningful conclusions from them ([Fig. 4]).[11]
Fig. 4 Pictorial representation of descriptive versus inferential statistics. Sampling is
the process of selecting a subset of individuals or data points from a population
to make inferences about the entire population. Inferential statistics are used to
make predictions or generalizations about a population based on sample data, often
involving hypothesis testing and confidence intervals. Descriptive statistics are
used to summarize and describe the main features of a dataset, such as measures of
central tendency and variability.
Descriptive Statistics
Descriptive statistics summarize and describe features of a particular dataset using
statistical characteristics, graphics, charts, or tables. They provide simple summaries
about the sample and its measures, thereby offering critical insights into central
tendency, dispersion, and shape of data distribution. It is important to understand
that in descriptive statistics only properties of the sample are evaluated, and we
do not draw conclusion about other points in time or the population. Descriptive statistics
are further broadly divided into two subtypes: location parameters (i.e., measures
of central tendency) and dispersion parameters (i.e., measures of variability). Parameter
basically represents a measurable characteristic of the population.
Measures of Central Tendency
Measures of central tendency basically describe where the center of a sample is or
where most of the sample is.[12]
[13]
[14]
Mean: it represents the average of all data points, which is calculated by summing
all the values and dividing by the number of observations. The mean can be calculated
only for metric variables and is sensitive to outliers. For example, if a radiologist
measures the mean size of the liver in a sample of five patients with glycogen storage
disorders as 15, 16, 17, 18, and 19 cm, the mean liver size is (15 + 16 + 17 + 18 + 19)/5 = 17 cm.
Median: when data points are ordered from smallest to largest, the middle value is
termed as median. The variables must have an ordinal or metric scale level for calculating
median. The median is less affected by outliers and skewed data. For the aforementioned
example of liver size in a sample of five patients with glycogen storage disorders,
the median is 17. For an even number of observations, the median is the average of
the two middle values.
Mode: the most frequently occurring value in the dataset is defined as mode. There
can be more than one mode if multiple values have the same frequency. It can be used
for metric, nominal, or ordinal variables. For example, if the liver sizes are 15,
16, 17, 17, and 18 cm, the mode is 17 cm because it appears most frequently. The advantages
and disadvantages of measures of central tendency are given in [Table 1].
Table 1
Table demonstrating the advantages and disadvantages of measures of central tendency
Measure of central tendency
|
Advantages
|
Disadvantages
|
Mean
|
• Takes all data points into account, providing a comprehensive summary
|
• Sensitive to outliers, which can skew the result
|
• Most commonly used and understood
|
• Not suitable for skewed distributions
|
Median
|
• Not affected by outliers or skewed data
|
• Does not consider all data points, only the middle value
|
• Represents the 50th percentile, providing a central location
|
• Less informative in symmetric distributions with no outliers
|
Mode
|
• Useful for categorical data where we wish to know the most common category
|
• May not be unique or may not exist in a continuous dataset
|
• Not affected by outliers
|
• Less informative when the distribution is fairly uniform
|
Measures of Variability
Measures of variability describe how much values of variables in a sample differ from
each other. In other words, they described how much the values of the variable deviated
from the mean value ([Fig. 5]).[15]
[16]
[17]
[18]
Fig. 5 Graphical representation of measures of central tendency and measures of dispersion.
Measures of central tendency are statistical metrics (mean, median, mode) that represent
the central point or typical value in a dataset, for example, if a radiologist measures
the mean size of the liver in a sample of five patients with glycogen storage disorders
as 12, 15, 15, 16, and 14 cm, the mean liver size is (12 + 15 + 15 + 16 + 14)/5 = 14.5,
the median is 15, and, mode is 15. Measures of dispersion on the other hand are metrics
(range, variance, standard deviation) that quantify the spread or variability of data
around the central tendency, for example, in the previous example of mean liver size
measurement, if the values are 10, 13, 14, 16, and 19 cm, range will be 9, variance
will be 9.31, and standard deviation will be 3.05.
Range: it is the difference between the highest and lowest values in the dataset.
It gives a sense of the spread but is affected by outliers. Let us consider the previous
example of a radiologist measuring the mean size of the liver in a sample of five
patients with glycogen storage disorders as 15, 16, 17, 18, and 19 cm. Range is 19–15 = 4.
Variance: the average of the squared differences from the mean. Variance provides
a measure of how much the values in the dataset deviate from the mean.
For a population, the formula is the following:
where N is the size of the population; xi
are the values in the population, μ is the population mean.
For a sample, the formula is the following:
where n is the size of the sample, xi
are the values in the sample, x̄ is the sample mean.
For the example mentioned above (liver sizes of 15, 16, 17, 18, and 19 cm), the variance
is calculated as the following:
-
Calculate the mean: x̄ = (15 + 16 + 17 + 18 + 19)/5 = 17.
-
Calculate the squared differences from the mean: (xi
− x̄)2–
-
(15 − 17)2 = (−2)2 = 4.
-
(16 − 17)2 = (−1)2 = 1.
-
(17 − 17)2 = 02 = 0.
-
(18 − 17)2 = 12 = 1.
-
(19 − 17)2 = 22 = 4.
-
Sum the squared differences: ∑n(xi
− x)2 = 4 + 1 + 0 + 1 + 4 = 10.
-
Calculate the variance: s
2 = 10/(5–1) = 10/4 = 2.5.
SD: it is the square root of variance and indicates the average distance of data points
from the mean. Thus, SD is the mean deviation (root mean square) of all measured values
from the mean. It is expressed in the same units as the data.
For a population, the formula is the following:
where N is the size of the population, xi
are the values in the population, and μ is the population mean.
For a sample, the formula is the following:
where n is the size of the sample, xi
are the values in the sample, and x̄ is the sample mean.
For the example mentioned above (liver sizes of 15, 16, 17, 18, and 19 cm), SD is
calculated as the following:
Quartile: it divides data into four parts as equal as possible. For this, the data
must be arranged from the smallest to the largest.
-
Quartile (Q1): Middle value between the smallest value and the median.
-
Quartile (Q2): Median of the data, that is, 50% of the values are smaller and 50%
of the values are larger.
-
Quartile (Q3): Middle value between the median value and the largest value.
Interquartile range: to find out the range in which the middle 50% of all values lie,
one can use the scattering parameter known as interquartile range.
The advantages and disadvantages of measures of variability are given in [Table 2].
Table 2
Table demonstrating the advantages and disadvantages of measures of variability
Measure of central tendency
|
Advantages
|
Disadvantages
|
Range
|
• Simple and easy to calculate
|
• Highly sensitive to outliers
|
|
• Ignores the distribution of data points within the range
|
Variance
|
• Takes into account all data points, providing a comprehensive measure
|
• Not in the same units as the original data (squared units)
|
• Useful in statistical calculations and inferential statistics
|
• Sensitive to outliers
|
Standard deviation
|
• Provides a clear measure of spread in the same units as the original data
|
• Sensitive to outliers
|
• Widely used and understood in statistical analysis
|
• Can be less intuitive to interpret compared to the range
|
Interquartile range
|
• Not affected by outliers, as it focuses on the middle 50% of data
|
• Ignores the data outside the 1st and 3rd quartiles
|
• Useful in skewed distributions
|
• Less informative for distributions that are not skewed or have outliers
|
Inferential Statistics
Inferential statistics allow us to make predictions or inferences about a specific
population based on the sample data. This includes estimating population parameters
as well as testing hypotheses. It therein provides a way to generalize findings beyond
the observed data.[19]
Inferential statistics are broadly of four types:
-
Difference between two groups of variables.
-
Correlation between two groups of variables.
-
Predicting the outcome variable.
-
Relation of variables in time distribution.
In this section, we shall be dealing with the difference between two groups of variables.
The rest will be dealt with in part 2 of the series.
Estimation
Estimation refers to the use of sample data to estimate population parameters, such
as the mean or proportion. The accuracy of these estimates can be assessed using confidence
intervals.[20]
Confidence intervals: range of values within which the true population parameter is
expected to lie with a certain level of confidence (e.g., 95% confidence interval).
A wider interval indicates greater uncertainty about the parameter estimate. Let us
consider the example of a study measuring the average radiation dose patients receive
during a whole body 18-FDG positron emission tomography (PET)/CT, where a 95% confidence
interval might be 13 to 15 mSv. The confidence level of 95% means that if we were
to repeat this study multiple times, approximately 95% of the calculated confidence
intervals from those studies would contain the true population mean radiation dose.
Hypothesis Testing: Fundamentals
Hypothesis is defined as an assumption that is neither proved nor disproved. It is
a research process that involves testing assumptions or claims about a population
parameter. Usually hypotheses are formulated starting from a literature review and
framing a research question based on this review. Hypothesis testing of the collected
data provides a formal framework for making decisions based on sample data. The final
target is to either reject or retain this hypothesis.[21]
[22]
Null and Alternative Hypothesis
Null hypothesis (H0): it is the default assumption that there is no statistically
significant difference between two or more groups with respect to a particular characteristic
(like no statistically significant difference between variables or no effect of an
intervention). In a study comparing two imaging techniques, the null hypothesis might
state that there is no statistically significant difference in the diagnostic accuracy
between these two techniques.
Alternative hypothesis (H1): alternate hypothesis assumes that there is a difference
between two or more groups. It represents the opposite of the null hypothesis. Alternative
hypothesis might state that there is a difference in diagnostic accuracy between the
two imaging techniques.
Difference and Correlation Hypothesis
Difference hypothesis: it tests whether there is a difference between two or more
groups. Difference hypothesis might state that there is a difference in diagnostic
accuracy between two imaging techniques.
Correlation hypothesis: it tests whether there is a correlation between two or more
variables. Correlation hypothesis might state that there is a correlation between
the size of a tumor measured by ultrasound and its volume measured by MRI.
Directional and unidirectional hypothesis: with an undirectional hypothesis, focus
of interest is whether there is a difference in a value between the groups under consideration.
On the other hand, a directional hypothesis focuses on whether one group has a higher
or lower value than the other.
The fundamental concept of hypothesis testing is that whether a hypothesis can be
accepted or rejected based on a certain probability of error. The reason for this
probability of error is that each time you take a sample, you get a different sample,
which means that the results are different every time.[23]
Type I error: it refers to rejecting the null hypothesis when it is true (false positive).
The significance level (α) represents the probability of making a type I error. Usually, a significance level
of 5 or 1% is set.
For example, if α is set at 0.05, there is a 5% chance of incorrectly rejecting the null hypothesis
when it is actually true.
p-Value: it is the probability of obtaining the observed results if the null hypothesis
is true. If the p-value is less than the significance level, the null hypothesis is to be rejected
(otherwise not). A p-value less than 0.05 is typically considered statistically significant, indicating
that the observed results are unlikely to have occurred by chance. For example, if
the p-value is 0.03 in a study comparing imaging techniques, it suggests that there is
a statistically significant difference in diagnostic accuracy.
Type II error: it is failing to reject the null hypothesis when it is false (false
negative). The probability of making a type II error is denoted by β, and power is defined as 1–β. For example, if a study has low power, there is a higher
chance of failing to detect a true difference between imaging techniques, resulting
in a type II error.
It is important to keep in mind that just because an effect is statistically significant
it does not mean that the effect is relevant. If a very large sample is taken and
it has a very small spread, even a minute difference between two groups may be significant,
but it may not be practically relevant.
Sample Size Determination
Determining the appropriate sample size is very crucial for ensuring the reliability
and validity of study results. Too small a sample size will not give valid results
or will not adequately represent the realities of the population being analyzed. On
the other hand, larger sample sizes give smaller margins of error and are more representative.
In fact, a sample size that is too large may significantly increase the cost and time
taken to conduct the research.[24]
[25]
[26]
[27]
[28] The factors that influence sample size include the following:
-
Population size: larger populations generally require larger samples.
-
Effect size: smaller effect sizes require larger samples to detect differences.
-
SD: the higher the distribution is, the greater the SD and the greater the magnitude
of deviation.
-
Significance level (α): lower significance levels require larger samples.
-
Power (1–β): higher power (typically 0.80) requires larger samples to reduce the risk of type
II errors.
Case Study: Sample Size in Radiological Research
A study aims to evaluate the diagnostic accuracy of a new MRI sequence in neuroimaging.
Researchers need to determine an appropriate sample size to ensure the study's findings
are statistically significant and reliable.
-
Population size: the population includes all patients eligible for brain MRI at the
hospital.
-
Effect size: based on preliminary data, the researchers estimate a moderate effect
size.
-
Significance level (α): they choose a significance level of 0.05.
-
Power (1–β): they aim for a power of 0.80, meaning they want an 80% chance of detecting a true
difference if one exists.
Using sample size calculation formulas, they determine that a sample size of 200 patients
is needed to achieve the desired power and significance level. This ensures that the
study results will be robust and reliable, providing valuable insights into the new
MRI technique's diagnostic accuracy.
But which formula should we use to calculate the sample size ([Fig. 6], [Table 3])?
Fig. 6 Formulae for sample size. In Eq. 1, n: required sample size for an unlimited population; z: Z-score, corresponding to the desired confidence level (e.g., 1.96 for 95% confidence);
p̂: estimated proportion of the population (i.e., the proportion you expect to observe
a certain characteristic in the population); ϵ: margin of error (the maximum acceptable difference between the true population parameter
and the sample estimate). In Eq. 2, n': adjusted sample size for a finite population; n: sample size calculated for an unlimited population (from the first formula); N: size of the finite population. In Eq. 3, n: required sample size for a finite population; (N): total population size; (Z): Z-score corresponding to the desired confidence level (e.g., 1.96 for 95% confidence);
p: estimated proportion of the population (the probability of the characteristic being
studied); 1–p: complementary proportion (the probability of not having the characteristic being
studied); e: margin of error (acceptable level of precision in the results). In Formula 4, (N): required sample size; σ
2: population variance (or an estimate of the variance of the outcome); Z
1 − α
: Z-score corresponding to the desired level of statistical significance (e.g., 1.96
for a 95% confidence level), which accounts for type I error (false positives); Z
1 −
β
: Z-score corresponding to the desired statistical power, representing type II error
(false negatives); typically, 1 − β is set at 0.80 or 0.90, and the corresponding Z-score is looked up (e.g., 0.842 for 80% power); d
min: minimum detectable difference or effect size, representing the smallest difference
that is practically significant and you wish to detect in your study.
Table 3
Table showing minimum sample size calculation of different statistical tests and examples
with radiology literature citations
Test type
|
Formula
|
Variables needed
|
Example in radiology
|
Study
|
Unpaired t-test
|
|
• Significance level (α)
• Z
α/2 is the Z-value corresponding to the desired significance level
• Power (1–β)
• Z
1–β
is the Z-value corresponding to the desired power
• Standard deviation (σ)
• Effect size (difference in means; M1–M2)
|
Comparison of 320-detector volumetric and 64-detector helical computed tomography
(CT) images of the pancreas for size measurement of various anatomical structures
|
Goshima et al[48]
|
Paired t-test
|
|
• Significance level (α)
• Power (1–β)
• Effect size (mean difference d)
• Standard deviation of differences (σd)
|
Comparison of tumor size on microscopy, CT, and MRI assessments vs. pathologic gross
specimen analysis of pancreatic neuroendocrine tumors
|
Bian et al[49]
|
Chi-squared test
|
|
• Significance level (α)
• Proportion (p)
• Difference in proportions (Δ)
|
Comparison of enhancement patterns between benign and malignant solid renal lesions
|
Millet et al[50]
|
ANOVA
|
|
• Significance level (α)
• Power (1–β)
• Effect size (η
2)
• Variance between groups (σ
2)
|
Population-stratified analysis of bone mineral density distribution in cervical and
lumbar vertebrae of Chinese from quantitative computed tomography
|
Zhang et al[51]
|
Confidence level
|
z-score
|
80%
|
1.28
|
85%
|
1.44
|
90%
|
1.65
|
95%
|
1.96
|
99%
|
2.58
|
Steps in using the formula for sample size calculation:
-
Determine the population size (if known).
-
Determine the confidence interval.
-
Determine the confidence level.
-
Determine the SD (basically representing the population proportion, which is assumed
to be 50% = 0.05).[29]
-
Convert the confidence level into a Z-score.
-
Put these figures into the sample size formula to get your sample size.
Necessary sample size = (Z-score)2 × SD × (1–SD)/(margin of error)2.
Say you choose to work with a 95% confidence level, an SD of 0.5, and a confidence
interval (margin of error) of ± 5%.
Necessary sample size = {(1.96)2 × 0.5 × 0.5/(0.5)2} = (3.8416 × 0.25)/0.0025 = 384.16.
Hence, the sample size should be 385.
Hypothesis Testing
Hypothesis testing is a statistical method used to make decisions about the population
based on sample data. It is used to assess whether a particular viewpoint is likely
to be true.[30] It involves several steps ([Fig. 7]):
-
Formulate hypotheses: define the null hypothesis (H0) and alternative hypothesis (H1).
-
Selection of study design and sample size: select ones that are appropriate to the
hypothesis being tested.
-
Select significance level (α): commonly set at 0.05.
-
Collect data: gather sample data relevant to the hypothesis.
-
Calculate test statistic: use an appropriate test (e.g., t-test, chi-squared test) to calculate the test statistic for each outcome variable
of interest.
-
Determine p-value: compare the p-value to the significance level.
-
Make a decision: reject H0 if p-value < α; otherwise, fail to reject H0.
Fig. 7 Pictorial representation of hypothesis testing process. Steps involved in the hypothesis
testing process are the following: (1) Formulation of a hypothesis (question mark
at the top center). This step involves defining a research question or hypothesis.
Typically, there are two hypotheses: (a) null hypothesis (H
0)—assumes no effect or no statistically significant difference and (b) alternative
hypothesis (H
1)—assumes there is an effect or difference. (2) Selecting a sample (right panel showing
population and sample). From the larger population, a sample is selected. (The sample
should be representative of the population to generalize the findings back to the
population.) (3) Hypothesis testing (bottom-right panel showing hypothesis testing).
Statistical tests are performed on the sample data to test the hypothesis. (The aim
is to determine whether the data provide enough evidence to reject the null hypothesis
in favor of the alternative hypothesis). (4) Significance and p-value (bottom left with p-value). The result of the hypothesis test is evaluated using the p-value. (If the p-value is less than 0.05 [commonly used significance level], it suggests that the
results are statistically significant, meaning there is sufficient evidence to reject
the null hypothesis.) (5) Conclusion (Arrow back to the top indicating significance). Based on the p-value and test results, conclusions are drawn about the hypothesis, indicating whether
the evidence supports rejecting the null hypothesis.
Hypothesis testing is just like the concept of “An accused is presumed to be innocent
until proved guilty.”
Common Hypothesis Tests in Radiology
It is broadly divided into two groups: hypothesis tests done on numerical data and
those done on categorical data. Basically, these tests are used to find the difference
between two groups of variables.
Datasets will have to be treated as paired if they are related. Thus, if we compare
the systolic blood pressure values of two independent sets of subjects, it is an example
of unpaired data. However, if a condition is included like all the individuals in
one dataset are siblings of the individuals represented in the other dataset, then
corresponding values in the two datasets may be related in some manner (due to genetic
or familial reasons) and the datasets are no longer independent.
Parametric data are normally distributed numerical data that follows the parameters
of a normal distribution curve. If it is a skewed distribution, there is no particular
distribution, or if the distribution is unknown, then it should be considered as nonparametric
data. But practically, how do we determine whether the numeric data are normally distributed?
One gross method is to look at the measures of central tendency, mean, and median.
If the mean and median are the same or are very close to one another (as compared
with the total data spread), then we can assume that we are dealing with parametric
data. However, the proper method to test the fit of data to a normal distribution
is to use “goodness-of-fit” tests such as the Kolmogorov–Smirnov test and Shapiro–Wilk
test. The null hypothesis in these tests is that the frequency distribution of your
data is normally distributed. If any of these tests return a p-value less than 0.05, it implies that the normal distribution will have to be rejected
and the data would have to be taken as nonparametric.[31]
[32]
[33]
[34]
Statistical tests for normal distribution:
The major disadvantage of these tests is that the calculated p-value is affected by the sample size. Therefore, if the sample size is very small,
the p-value may be much larger than 0.05. But if the sample size from the same population
is very large, your p-value may be smaller than 0.05.
To overcome this disadvantage, graphical tests for normal distribution are used ([Fig. 8]):
-
Histogram data: Compare the histogram curve with the normal distribution curve.
-
Quantile–quantile plot: Compare the theoretical quantiles of normally distributed
data with quantiles of the measured values. If data were perfectly normally distributed,
all the points would be on a straight line. The further the points deviate from the
line, the less normally distributed the data are.
Fig. 8 Histogram curve and Q-Q plot for graphical representation of normality of distribution.
Histogram shows the data's shape, and the Q-Q plot compares the data's quantiles to
a theoretical normal distribution to identify deviations from normality.
Hypothesis Tests Done on Contiguous Data
Parametric Data
Simple t-test: this is a test used to determine whether the mean calculated from sample data
collected from a single group is different from the population selected ([Fig. 9]).[35]
[36]
Fig. 9 Approach to select appropriate parametric tests.
Let us consider a study where the researchers want to assess whether the hippocampal
volume on MRI in temporal lobe epilepsy patients is significantly lower as compared
with all epilepsy patients imaged during a specific time period. The t-test would then be used to show if the hippocampal volume is statistically lower
in temporal lobe epilepsy patients.
Unpaired sample t-test (for two independent samples): it compares the means of two independent groups.
There is no relationship between the subjects in one group and those in the other.[36] For example, an unpaired t-test could be used to compare the average radiation dose received by patients undergoing
neurointervention on a monoplane and biplane angio-suite, assuming patients are randomly
assigned to one of the techniques.
Student's paired t-test (for two dependent samples): it compares the means of two related groups or
conditions. Each subject or sample is measured twice, resulting in paired observations.[36] A t-test might be used to compare the average size of hepatocellular carcinoma nodules
in patients treated with a new intra-arterial chemotherapy drug. If the t-test shows a significant difference in mean sizes, it suggests that the drug is effective
in reducing tumor size.
A tailed t-test refers to either a one-tailed test or a two-tailed test used to determine the
direction of an effect, while a nontailed t-test typically implies a two-tailed test that assesses for any significant difference
without specifying the direction.
One-tailed t-test: it tests for the possibility of an effect in one specific direction (e.g.,
greater than or less than). For example, when the research hypothesis predicts the
direction of the difference (e.g., drug A increases recovery rate more than drug B).
Basically, it tests if the mean is greater than a certain value.
Two-tailed t-test: it tests for the possibility of an effect in both directions (e.g., not equal
to). For example, when the research hypothesis does not predict the direction of the
difference (e.g., drug A has a different recovery rate than drug B, without specifying
higher or lower). Basically, it tests if the mean is different from a certain value,
either higher or lower.
One factorial ANOVA (for more than two independent samples): it determines whether
there are any statistically significant differences between the means of three or
more independent groups (or levels) on a continuous independent variable. It tests
the null hypothesis that all group means are equal[37]
[38]: A one-way factorial ANOVA could be used to compare the average reading times of
radiologists interpreting images from three different types of imaging modalities
(X-ray, MRI, and CT scan).
Repeated measures ANOVA (for more than two dependent samples): it determines whether
there are any statistically significant differences between the means of three or
more related groups (or levels) on a continuous dependent variable measured at multiple
time points or under different conditions. It accounts for the correlation between
measurements taken from the same subject across different conditions or at different
time points.[38]
[39] Repeated measures ANOVA could be used to assess the effectiveness of a new contrast
agent in enhancing detection of small cerebral metastatic lesions across multiple
time points during an MRI scan session (comparing the detection before contrast administration,
immediately after contrast administration and 30 minutes postcontrast administration).
Non-parametric Data
For One Sample
Wilcoxon's test (Wilcoxon signed-rank test): it compares the median of a single sample
of paired data against a specified median value (typically zero, assuming no difference;
[Fig. 10] and [Table 4]). It is typically used when the data do not meet the assumptions required for a
parametric test like the t-test, such as when the data are not normally distributed or when the measurement
scale is ordinal.[40] Wilcoxon signed-rank test could be used to assess whether a new MRI sequence results
in significantly improved lesion detection as compared with an established sequence.
Fig. 10 Approach to select appropriate nonparametric tests.
Table 4
Table showing various nonparametric tests used depending on the type of variables
Variable type
|
Test
|
Description
|
Continuous
|
Mann–Whitney U test
|
Compares differences between two independent groups
|
Kruskal–Wallis test
|
Extension of the Mann–Whitney U test for three or more groups
|
Nominal
|
Chi-squared test
|
Assesses whether there is a significant association between two categorical variables
|
Fisher's exact test
|
Used for small sample sizes (<5 in 1 cell) to determine if there are nonrandom associations
between two categorical variables
|
McNemar's test
|
Used for paired nominal data to determine if there is a difference in proportions
|
Ordinal
|
Wilcoxon's test
|
Compares two independent groups with ordinal data
|
Friedman's test
|
Compares differences between three or more dependent groups (repeated measures) in
ordinal scale
|
Between Two Groups
Mann–Whitney U test (for two independent samples; also known as Wilcoxon rank sum test): it assesses
whether two independent groups differ significantly in terms of their medians. It
does not assume that the data follow a normal distribution.[41] The Mann–Whitney U test could be used to compare the interpretation times between two groups of radiologists
interpreting the same set of MRI scans.
Wilcoxon's test (for two dependent samples): it compares the medians of two related
groups or conditions. It assesses whether there is a statistically significant difference
between paired observations from the same subjects under different conditions.[42] Wilcoxon signed-rank test for two dependent samples could be used to evaluate the
effectiveness of a new image enhancement AI algorithm compared with the current conventional
MRI images.
More than Two Groups
Kruskal–Wallis test (for more than two independent samples): it determines whether
there are statistically significant differences between three or more independent
groups in terms of their medians. It is an extension of the Mann–Whitney U test for more than two groups.[43] For example, the Kruskal–Wallis test could be used to compare the hepatic lesion
size (measured as a continuous variable) among three different types of imaging modalities
(ultrasound, MRI, and CT scan).
Friedman's test (for more than two dependent samples): it determines whether there
are statistically significant differences between three or more dependent groups (repeated
measures) in terms of their medians. It is analogous to the Kruskal–Wallis test but
is used for within-subject designs.[44] Friedman's test could be used to compare the ratings of definition of margins of
a cerebral lesion (ordinal scale) from the same set of radiologists across three different
MRI sequences.
Hypothesis Tests Done on Categorical Data
Fig. 11 Approach to select appropriate statistical tests for categorical data.
The tests to be done based on the type of data are summarized in [Tables 4] and [5].
Table 5
Table showing various parametric and nonparametric tests used depending on the nature
of the sample being analyzed
|
Parametric tests
|
Nonparametric tests
|
One sample
|
Simple t-test
|
Wilcoxon's test for 1 sample
|
Two dependent samples
|
Paired sample t-test
|
Wilcoxon's test
|
Two independent samples
|
Unpaired sample t-test
|
Mann–Whitney U test
|
More than two independent samples
|
One factorial ANOVA
|
Kruskal–Wallis test
|
More than two dependent samples
|
Repeated measures ANOVA
|
Friedman's test
|
Correlation between two variables
|
Pearson's correlation
|
Spearman's correlation
|
Abbreviation: ANOVA, analysis of variance.
Reporting Statistical Tests
Reporting statistical tests in radiology is important to clearly and concisely convey
the results of analyses performed to evaluate the significance of findings and robustness
of conclusions drawn. Key points to consider when reporting statistical tests are
the following:
-
Specify the statistical test used: clearly mention which statistical test was employed (e.g., t-test, ANOVA, chi-squared test, Mann–Whitney U test). Justification for the choice of test also has to be provided, including the
nature of the data (parametric vs. nonparametric, nominal vs. continuous).
-
Include relevant parameters: degrees of freedom (if applicable; e.g., for t-tests and ANOVA), effect size (include measures such as Cohen's d for t-tests or eta-squared for ANOVA) to indicate the magnitude of the difference, and
confidence intervals (present confidence intervals for mean differences or proportions
to give context to the results).
-
Present p-values: clearly state the p-value obtained from the statistical test (use the conventional threshold for significance,
e.g., p < 0.05; if the p-value is above this threshold, avoid stating it as “not significant”; instead, indicate
the p-value explicitly). For very small p-values, it is common to report them as p < 0.001.
-
Interpret results: provide a clear interpretation of what the statistical results mean in the context
of the study. Clinical significance of the findings should also be discussed, not
just statistical significance.
-
Contextualize with clinical implications: discuss how the statistical findings relate to clinical practice, patient outcomes,
or the diagnostic performance of imaging modalities. Consider including sensitivity,
specificity, positive predictive value, and negative predictive value if applicable.
-
Follow reporting guidelines: adhere to relevant reporting guidelines (e.g., Standards for Reporting Diagnostic
Accuracy [STARD] for diagnostic accuracy studies, Consolidated Standards of Reporting
Trials [CONSORT] for randomized controlled trials) to ensure clarity and transparency
in the reporting of statistical analyses.
Here is an example of how statistical results might be reported in a radiology study.
Let us consider a study to compare the average tumor volume measured by MRI in patients
with type A and B tumors. A total of 60 patients were included in the analysis, with
30 patients in the type A group and 30 patients in the type B group. The mean tumor
volume for patients with type A tumors was 15.2 cm3 (±3.1 cm3), while the mean tumor volume for patients with type B tumors was 22.8 cm3 (±4.5 cm3). An independent sample t-test was performed to assess whether the difference in mean tumor volumes between
the two groups was statistically significant (after testing the normality of distribution).
The results indicated a significant difference in tumor volume between the two groups
(t(58) = –5.46, p < 0.001; “t” signifies the result is derived from a t-test; the number in brackets is the degree of freedom {N1 + N2–2 = 30 + 30–2 = 58};
–5.46 is the t statistic value, with negative indicating the mean of the first group is less than
that of the second group; p < 0.001 is the p-value that is statistically significant). Patients in the type B group exhibited
larger tumor volumes than those in the type A group. The effect size, calculated using
Cohen's d, was 1.41, indicating a large effect. Additionally, a 95% confidence interval for
the difference in means was calculated, resulting in an interval of (–9.11 cm3, –5.25 cm3). This interval suggests that the mean tumor volume for type B tumors is significantly
higher than that for type A tumors, with a clinically relevant difference. In conclusion,
these findings demonstrate that patients with type B tumors have significantly larger
tumor volumes compared with those with type A tumors, which may have implications
for treatment planning and prognosis.
Conclusion
To conclude, statistics play a crucial role in radiology, aiding in accurate data
interpretation, improving diagnostic accuracy, and advancing research. Proper understanding
and application of statistical principles such as data types, their distribution,
descriptive and inferential statistics, hypothesis testing, correlation, and sampling
are essential for research in radiology. The foundational knowledge needed to leverage
statistics effectively, ultimately enhancing clinical decision-making and patient
outcomes.