The Discovery of Oral Cancer Prognostic Factor Ranking Using Association Rule Mining

Abstract Objective  A 5-year survival rate is a predictor for the assessment of oral cancer prognosis. The purpose of this study is to analyze oral cancer data to discover and rank the prognostic factors associated with oral cancer 5-year survival using the association rule mining (ARM) technique. Materials and Methods  This study is a retrospective analysis of 897 oral cancer patients from a regional cancer center between 2011 and 2017. The 5-year survival rate was assessed. The multivariable Cox proportional hazards analysis was performed to determine prognostic factors. ARM was applied to clinicopathologic and treatment modalities data to identify and rank the prognostic factors associated with oral cancer 5-year survival. Results  The 5-year overall survival rate was 35.1%. Multivariable Cox proportional hazards analysis showed that tumor (T) stage, lymph node metastasis, surgical margin, extranodal extension, recurrence, and distant metastasis of tumor were significantly associated with overall survival rate ( p  < 0.05). The top associated death within 5 years rule was positive extranodal extension, followed by positive perineural and lymphovascular invasion, with confidence levels of 0.808, 0.808, and 0.804, respectively. Conclusion  This study has shown that extranodal extension, and perineural and lymphovascular invasion were the top ranking and major deadly prognostic factors affecting the 5-year survival of oral cancer.


Introduction
Oral cancer is the most common malignant tumor of the head and neck and is a highly malignant tumor with a relatively high mortality rate and a major health problem worldwide.The new cases and the number of deaths from oral cancer in 2020 were reported to be 1.8 million and 464,000 worldwide, respectively. 1Oral cancer can be classified according to its origin (carcinoma and sarcoma).These neoplasms are aggressive in their biological behavior, leading to significant destruction of the structure of the oral cavity, and can develop local and distant metastases. 2 The gold standard for definitive diagnosis of oral cancer is confirmation by pathological examination. 3,4Oral cancer treatment modalities depend on the American Joint Committee on Cancer (AJCC) tumor-node-metastasis (TNM) staging system, including tumor size, cervical lymph nodes, and distant metastases. 5The main therapeutic approach for oral cancer over the past decade has not changed; it is surgical treatment followed by adjuvant radiation therapy with/without chemotherapy in cases with high-risk pathologic features or late-stage oral cancer. 3,6,7In addition, oral cancer has a critical influence on patients in terms of facial appearance after treatment, ability to perform daily activities, ability to work, and quality of life. 8,9Nevertheless, improvements in medical imaging, surgical and adjuvant chemoradiotherapy techniques, and advances in supportive care modalities may improve the quality of life, but not significantly improve the 5-year survival rate of oral cancer patients. 10Thus, analyzing oral cancer data with data mining techniques to extract patterns between clinicopathologic factors, treatment, and 5-year survival outcome could provide an opportunity to better understand the pattern of oral cancer prognosis.
Data mining, known as knowledge discovery from data, is the process of extracting potentially useful information and identifying knowledge hidden in a large amount of data.Unlike traditional statistical research methods, data mining technologies mine information to discover knowledge based on unclear assumptions. 11,12In the medical field, data mining techniques have the potential to capture complex details and patterns in medical data to predict disease. 13For example, using time-series analysis and association rule mining (ARM) model to predict the number of Coronavirus Disease-2019 cases. 14ARM is a pattern-extracted data mining technique, which was first introduced by Agrawal et al. as a method of analyzing marketing data.ARM consists of two steps: the first is to identify the frequent itemsets from the data, and the second is to generate the association rules from the frequent itemsets. 15ARM has a different concept from conventional statistics, that is, ARM is the process of deriving useful insights and being able to extract meaningful patterns from the data, while conventional statistics is the science of collection, analysis, and interpretation of data. 16The ARM technique is considered a useful tool in the medical field to provide the ability to perform intelligent diagnoses, extract invaluable information, and automatically create important insights while identifying relationships within and between interested variables. 16An ARM is utilized to mine cancer data from a medical record to extract the significant pattern to discover the most common factors related to cancer biology and clinical prognosis.For example, ARM was utilized to decode molecular mechanisms of renal cell carcinoma subtypes 17 and to predict breast cancer recurrence. 18In addition, previous studies have applied ARM to extract history and clinical data of oral cancer to discover the pattern for early detection and prevention of oral cancer. 19,20Therefore, utilizing the ARM technique to extract the remarkable pattern of relationship between clinicopathologic, treatment, and 5-year survival data of oral cancer could be beneficial to aid clinicians' decision-making in oral cancer treatment.
The aim of this study is to analyze oral cancer data, including clinicopathologic features, treatment, and 5-year overall survival data, to discover and rank the prognostic factors associated with oral cancer 5-year survival using the ARM technique.The main contribution of this work is to offer an alternative analytical methodology, including conventional statistics, to define new, useful, and interesting relationships between various cancer factors and survival outcomes of oral cancer.This work is expected to provide supplementary information for aiding clinicians' decisionmaking in clinical practice.

Data Acquisition
This study was approved by the Human Research Ethics Committee of Thammasat University (COE 015/2565) and was performed in accordance with the tenets of the Declaration of Helsinki.Informed consent was waived by the Human Research Ethics Committee of Thammasat University because of the retrospective nature of the fully anonymized data.
Oral cancer data were collected from electronic medical records from a regional cancer center of Thailand between 2011 and 2017.All cases of oral cancer were diagnosed by pathological examination as the gold standard of oral cancer diagnosis and follow-up for at least 5 years.In this study, oral cancer staging is according to the TNM staging classification system as proposed by the eighth edition AJCC cancer staging of head and neck cancer. 21In addition, patients with pathological results of carcinoma in situ and with cancer in areas other than the oral cavity were excluded.Selection was based on completeness of clinicopathologic data, treatment modalities, and 5-year overall survivability data.After deleting cases with incomplete data, a total of 897 oral cancer cases, including squamous cell carcinoma (SCC), undifferentiated carcinoma, nonkeratinizing carcinoma, adenoid cystic carcinoma, mucoepidermoid carcinoma, and other types of oral cancer, remained available for analysis.The workflow of this study is illustrated in ►Fig. 1.

Statistical Analysis
Three types of data are available for analysis: (1)   2) treatment modalities data (surgery only, surgery with radiotherapy, surgery and concurrent chemoradiotherapy, induction chemotherapy, concurrent chemoradiotherapy, and palliative treatment); and (3) 5-year survivability data.Descriptive statistics were calculated for clinicopathological, treatment modalities, and survivability data.The overall and each TNM stage of the 5-year survival rate were calculated.The multivariable Cox proportional hazards model was used to determine independent predictors of 5-year survival rate and was performed with binary logistic regression, which included clinicopathological, treatment modalities, and survivability data.In addition, the Kaplan-Meier method was calculated to determine the cumulative proportion surviving and to plot the tumor stage survival curves.The data were analyzed using IBM SPSS Statistics version 26.

Association Rule Mining (ARM)
ARM discovers the pattern of frequent items or events in the dataset, including the association between items and events.The pattern exposes the combination of the items or events that occur at the same time.In the medical field, it is helpful to know how one disease is associated with others.ARM can be used as a multivariate analysis of the correlation between factors.Given a dataset containing a collection of records or transactions, each record comprises a set of categorical attributes.An association rule can be denoted by A ! B, where A (the antecedent or left-hand side [LHS]) and B (the consequent or right-hand side [RHS]) are sets of various attribute-value pairs (also called itemset) and are separate. 14,22,23Generally, the effectiveness of discovered rules is measured in terms of Support, Confidence, and Lift.The rule represents the assumption that when variables in A occur in the dataset, the variables in B also occur.Association mining generates a large number of rules from a given dataset.The goal of this approach is to find rules that have high practical significance.To eliminate false rules, the effectiveness of discovered rules is measured in terms of Support, Confidence, Lift, Cosine, and Correlation coefficient.][24][25] Support refers to the number of records where the attribute-value pairs in either set A or B appear in the dataset relative to the total number of records (transactions or instances), which indicates how frequently the itemset appears in the dataset.The Support value is symmetric so that Support (A !B) ¼ Support (B !A), and it equals the total numbers of records containing both A and B to the total number of records in the dataset.
The Confidence of the rule A ! B measures the conditional probability of B, given A, which determines how frequently B appears in those who have A. Therefore, the Confidence measure for a given rule is asymmetric, that is, Confidence (A !B) 6 ¼ Confidence (B !A).The confidence is the conditional probability of occurrence of consequent given the antecedent.However, the value of confidence limits our capability to make an inference.
The Lift measure is the ratio between the observed support and the expected support between the independent variables A and B. The lift suggests how often B appears when A appears while controlling the likely occurrence of B. The value of lift determines the correlation between A and B: lift ¼ 1 indicates independence, lift more than 1 indicates  positive relationship, and lift less than 1 indicates negative relationship.Lift is also a symmetric measure between the itemset A and B, i.e., Lift (A !B) ¼ Lift (B !A).More the value of lift, greater are the chances of preference to consequent if the antecedent has already occurred.Lift is a measure that shows an importance of small item.If the lift is more than 1, these rules are potentially useful for predicting consequences in future datasets. 26osine measures organize and summarize correlations based on "similarity," which will provide a consistent and accurate view of correlations.The Cosine measure for the two rules can be organized into binary-valued vectors.It will give a value of 0 or 1 depending on whether the common between the two rules is present on the RHS or LHS of the rule; generally, a value higher than 0.5 shows strong similarities. 27osine measures the similarity between two vectors of an inner product space that refers to distance with dimensions representing features of the data object, in a dataset.
Correlation coefficient measures the strength of the linear relationship between a pair of two variables.][24][25] A chi-squared test is used in the analysis of contingency tables when the sample sizes are large.It is primarily used to examine whether two categorical variables are independent in influencing the test.Chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.
The formulation of Support, Confidence, Lift, Cosine, Correlation coefficient, and Chi-square was calculated as follows: |A| and |B| are the numbers of records that include A and B. |A ⋂ Bj is the number of records that contain both A and B. N is the total number of patients.
In this study, ARM was implemented by a Python script and applied to clinicopathologic treatment modalities data to identify the survivability rules.The redundant rules were filtered, and significant rules identified.The antecedent A corresponds to clinicopathologic factors (gender, age range, comorbidities, tumor location, T stage, pathologic N stage, TNM stage, tumor types, lymph node metastasis, surgical margin, extranodal extension, lymphovascular invasion, perineural invasion, recurrence, and distant metastasis), and treatment modalities (surgery only, surgery with radiotherapy, surgery and concurrent chemoradiotherapy, induction chemotherapy, concurrent chemoradiotherapy, and palliative treatment).Furthermore, the consequent B focuses on 5-year survivability, including (1) death within 5 years and (2) survival of more than 5 years.Since one assumption for ARM is that all the values of attributes are discrete, the numerical data used in the study were translated into discrete labels.

Five-Year Survival Rate
The 5-year overall survival rate in this study was 35

Multivariable Cox's Proportional Hazards Analysis
In the multivariate analysis, this study found that T stage, positive lymph node metastasis, surgical margin, extranodal extension, lymphovascular invasion, recurrence of tumor, and presence of distant metastasis were significantly correlated to overall 5-year survival rate (p < 0.05) (►Table 2).Discovery of Oral Cancer Prognostic Factor Ranking Using Association Rule Mining Chaowchuen et al.Confidence of 0.576, Lift of 1.390, Cosine of 0.727, Correlation coefficient of 0.458, and Chi-square of 167.31) was the major rule of survival greater than 5 years, followed by negative perineural invasion and negative extranodal extension.

Discussion
This work examined the effectiveness of ARM in extracting a set of meaningful rules to determine remarkable prognostic   factors of oral cancer using clinicopathologic, treatment modalities, and 5-year survivability data.The 5-year survival rate of oral cancer is a key indicator of prognosis and treatment success, and understanding the factors related to this survival rate is important to improve patient prognosis.9][30][31] The low 5-year survival rate could stem from the fact that most patients in this study were diagnosed at an advanced stage, which had a poor prognosis with a 5-year survival rate of 30%, 2.5%, and 0% for stages IVa, IVb, and IVc, respectively.3][34] Although the multivariable analysis could identify the significant prognostic factors related to 5-year survival rate of oral cancer, it could not contribute to the ranking of significant factors.Therefore, the application of a computational technique, the ARM technique, to oral cancer data could extract and provide new information by ranking the prognostic patterns of oral cancer for additional insights for clinicians' decision-making in the clinical practice.
In the ARM analysis of oral cancer data, the top five deaths within the 5 years rules of oral cancer included a positive extranodal extension, perineural invasion, lymphovascular invasion, surgical margin, and tumor type of SCC with moderately differentiate (MD) with a Support of 0.232 to 0.274, Confidence of 0.683 to 0.808, Lift of 1.045 to 1.688, Cosine of 0.492 to 0.67, and Correlation coefficient of 0.044 to 0.457.Furthermore, the top five survival rules consisted of patient with negative lymphovascular invasion, perineural invasion, extranodal extension, and nonavailable data of surgical margin group and female patients with a Support of 0.357 to 0.381, Confidence of 0.574 to 0.576, Lift of 1.042 to 1.390, Cosine of 0.61 to 0.727, and Correlation coefficient of 0.006 to 0.458.These meant that if a patient exhibited positive extranodal extension, perineural or lymphovascular invasion, surgical margin, and SCC with MD, then there was higher confidence of that oral cancer patient dying within 5 years.The results of ARM corresponded with the multivariate analysis, which found that the extranodal extension, lymphovascular invasion, and surgical margin significantly impacted the 5-year survival rate of oral cancer.Nevertheless, the ARM contributed to the ranking of these factors, which showed that positive extranodal extension was the top ranking related to 5-year survival rate of oral cancer.These results of ARM were an additional information to emphasize that the presence of these adverse pathologic features was the major deadly prognostic pattern for oral cancer, which the clinicians should focus on and be concerned about.
As per our understanding, this is the first study conducted to date to define death within 5 years and survive more than 5 years rules in oral cancer using ARM techniques, which prevents this study from comparing these findings with those of other studies.4][35] However, previous studies that analyzed the data with traditional statistics could only provide the significant factors but could not contribute to the ranking of prognostic factors related to oral cancer survivability.Therefore, these results provide new insights for the exploration of prognostic factors and reveal invaluable information about the deadly pattern of oral cancer.This work has theoretical and practical importance, which can serve as a reference for relevant studies in the future and will aid clinicians as supplementary information to predict oral cancer prognosis and select the most appropriate treatment plan for oral cancer patients.
The limitation of this study needs to be addressed.First, rule mining analysis is primarily for exploring associations and patterns in data.One of the main challenges of ARM is the ability to generate an overwhelming number of rules from a large dataset, which can be costly and complex to analyze.Second, the depth of invasion, which is a pathologic feature and important prognostic factor of oral cancer, was not included and analyzed due to the missing data in the medical and pathologic record.As the previous edition of AJCC cancer staging of head and neck cancer did not mention the depth of invasion for the cancer staging, 36 so there is no record in this cancer center between 2011 and 2017.Therefore, future research should build on this methodology linked with online data sources to collect more oral cancer data, including medical, radiologic, pathologic, and genomic data, from various cancer and health centers to achieve completed oral cancer data so that a decent number of meaningful prognostic rules can be extracted.In addition, the combination of artificial intelligence technology and analysis with different data mining techniques, including causal inference, could extract and provide other significant information about factors related to survival rate for predicting the oral cancer prognosis in the clinical practice.Furthermore, the application of machine learning techniques, including decision tree and deep learning algorithm, to create a prognostic prediction model, could be combined with ARM to create a prognostic prediction model with multiple variables allowing to establish a survival prediction for oral cancer cases to be applied in a real clinical scenario.

Conclusion
Introducing the ARM technique into oral cancer data as a powerful approach can extract and classify data to uncover an interesting relationship between prognostic factors and the 5-year survival rate of oral cancer.The ARM identified the major ranking of the deadly prognostic rules of oral cancer, which were extranodal extension, perineural invasion, and European Journal of Dentistry © 2024.The Author(s).
Discovery of Oral Cancer Prognostic Factor Ranking Using Association Rule Mining Chaowchuen et al. lymphovascular invasion.The results of this study will provide important insights into the pattern and ranking of prognostic factors that influence the 5-year survival rate of oral cancer and may aid clinicians in selecting the most appropriate treatment plan to increase the survival rate for oral cancer patients.
Discovery of Oral Cancer Prognostic Factor Ranking Using Association Rule Mining Chaowchuen et al.
.1%, which was divided into stage I of 75.8%, stage II of 60.7%, stage III of 43.0%, stage IVa of 21.4%, stage IVb of 4.2% and stage IVc of 0%.The Kaplan-Meier curves of the 5-year survival rate of TNM stage are shown in ►Fig.2.
ARM was applied to identify the rules associated to 5-year survivability of oral cancer, including death within 5 years and survival of more than 5 years.The top 10 death within European Journal of Dentistry © 2024.The Author(s).

Table 1
Summary of clinicopathologic, treatment modalities, and survival data of the oral cancer patients Discovery of Oral Cancer Prognostic Factor Ranking Using Association Rule Mining Chaowchuen et al.
European Journal of Dentistry © 2024.The Author(s).5 years rules ranked by highest confidence scores are presented in ►Table 3.Among the top 10 rules, positive extranodal extension (Support of 0.274, Confidence of 0.808, Lift of 1.552, Cosine of 0.669, Correlation coefficient of 0.412, and Chi-square of 135.788) was the major rule of death within 5 years, followed by positive perineural invasion and positive lymphovascular invasion.The top 10 survival for more than 5 years rules ranked by highest confidence scores are presented in ►Table 3.Among the top 10 rules, patient with negative lymphovascular invasion (Support of 0.381,

Table 1 (
Continued) a N/A is a nonavailable data in the nonsurgical group or uninterpreted data in the pathological record.

Table 3
Top 10 Association Rule of 5-year survivability ranked by confidence (death within 5 years and survive more than 5 years) Discovery of Oral Cancer Prognostic Factor Ranking Using Association Rule Mining Chaowchuen et al.
European Journal of Dentistry © 2024.The Author(s).