Predicting Steam Processing Degree of Prepared Radix Rehmanniae (Shudihuang) Using Machine Learning

Qinghua Han; Keyu Zhang; Fangfang Yu; Ye Chen; Jiawen Song; Zhijia Xu; Yichen Zhang; Ruidan Su; Siyang Fan

doi:10.1055/a-2705-8654

Pharmaceutical Fronts, Inhaltsverzeichnis

CC BY 4.0 · Pharmaceutical Fronts
DOI: 10.1055/a-2705-8654

Original Article

Predicting Steam Processing Degree of Prepared Radix Rehmanniae (Shudihuang) Using Machine Learning

Autoren

Qinghua Han

¹National Key Laboratory of Lead Druggability Research, Shanghai Institute of Pharmaceutical Industry Co., Ltd., China State Institute of Pharmaceutical Industry Co., Ltd., Shanghai, People's Republic of China
Keyu Zhang

²School of Computer, Shanghai Jiao Tong University, Shanghai, People's Republic of China
Fangfang Yu

³School of Pharmacy, Shanghai Jiao Tong University, Shanghai, People's Republic of China
Ye Chen

¹National Key Laboratory of Lead Druggability Research, Shanghai Institute of Pharmaceutical Industry Co., Ltd., China State Institute of Pharmaceutical Industry Co., Ltd., Shanghai, People's Republic of China
Jiawen Song

¹National Key Laboratory of Lead Druggability Research, Shanghai Institute of Pharmaceutical Industry Co., Ltd., China State Institute of Pharmaceutical Industry Co., Ltd., Shanghai, People's Republic of China
Zhijia Xu

¹National Key Laboratory of Lead Druggability Research, Shanghai Institute of Pharmaceutical Industry Co., Ltd., China State Institute of Pharmaceutical Industry Co., Ltd., Shanghai, People's Republic of China
Yichen Zhang

¹National Key Laboratory of Lead Druggability Research, Shanghai Institute of Pharmaceutical Industry Co., Ltd., China State Institute of Pharmaceutical Industry Co., Ltd., Shanghai, People's Republic of China
Ruidan Su

²School of Computer, Shanghai Jiao Tong University, Shanghai, People's Republic of China
Siyang Fan

¹National Key Laboratory of Lead Druggability Research, Shanghai Institute of Pharmaceutical Industry Co., Ltd., China State Institute of Pharmaceutical Industry Co., Ltd., Shanghai, People's Republic of China

³School of Pharmacy, Shanghai Jiao Tong University, Shanghai, People's Republic of China

Abstract

Volltext

als PDF herunterladen

Keywords

prepared Radix Rehmanniae - steaming and drying for 9 cycles - machine learning - random forest

Introduction

As widely used Chinese medicinal herb, the roots of Rehmannia glutinosa (Radix Rehmanniae, RR, “Dihuang”) are commonly used in traditional Chinese medicine (TCM) prescriptions to treat anemia, irregular menstruation, renal failure, and other diseases.[1]

There are three processing forms of RR used as decoction pieces, namely, fresh Radix Rehmanniae, dried Radix Rehmanniae (DRR), and prepared Radix Rehmanniae (PRR, Shudihuang).[2] In ancient China, the PRR preparation method of water steaming was first mentioned in the Synopsis of the Golden Chamber (JinKuiYaoLue, Eastern Han Dynasty, A.D. 25–220) and further detailed in QianJinYiFang (QJYF, Tang Dynasty, AD 682). QJYF recorded that the PRR prepared by the 3 to 5 times steaming process from rice wine (RW) – immersed DRR could be analogous to that prepared by the 9 times steaming process (a more ancient method). Later, in Ming and Qing Dynasty (A.D. 1368–1911), the process of “RW (with or without Fructus Amomi) immersion, together with steaming and drying for nine cycles (SD₉),” by which PRR can be obtained successfully by appearance and taste (“black as lacquer, sweet as maltose”), was popularly applied and widely recorded in medicinal works, such as the Compendium of Materia Medica (BenCaoGangMu, A.D. 1552–1578), BenCaoPinHuiJingYao (A.D. 1505), BenCaoBeiYao (A.D. 1694), and others. From ancient times to the present, there is a host of records about processing methods from DRR to RRP, which are slightly.[1] For cost reduction, there has been a trend toward less steaming times and shorter steaming times for PRR manufacturing nowadays. According to the modern processing method described in the Chinese Pharmacopoeia (CP), PRR is prepared by steaming DRR mixed with water (or RW) or stewing DRR mixed with RW, but the number of processing times and steaming time are not defined. The PRR by SD₉ is monographed in only a few local specifications, e.g., Henan province specifications for processing of TCM (2022 edition).[3] The specific processing times are still controversial.[4] Level of steamed PRR imparts unique characteristics to the PRRs and influences the quality and final clinical effectiveness of the PRR produced.[1] Thus, it is imperative to determine the optimal steaming times and duration to make the PRR approaching the quality of the traditionally made PRR-SD₉.

However, it remains a problem for accurate identification of the steaming degree of PRR for industries and regulatory agencies, adding challenges to the quality assurance of PRR products. The liquid chromatography coupled with mass spectroscopic (LC-MS) technique has been used to generate an enormous amount of data about the RR samples, and the multivariate statistical analysis (MSA) has been successfully applied to classify the samples into DRR and PRR groups and to determine which compounds are correlated with the PRR property.[5] However, the MSA models could not identify the exact processing degree of a new PRR unknown, which needs a more powerful data analysis with a larger number of variables (more complex datasets). Within this context, the machine learning (ML) method is a promising alternative to address this issue.[6] ML techniques have been successfully used in conjunction with LC-MS for TCM quality control (QC)[7] but have only been used once to predict the steaming time (0–15 hours) of PRR[8] with an error rate of 8% (two misidentified samples of the total 24 blind samples). However, the study is missing for intensive steaming degree (more than 15 hours) predictions, where the dataset only includes the visualized oligosaccharides distribution.[8] As the steaming time increases, the change in oligosaccharide profile is initially significant and then later slight.[9] [10] [11] The decrease in relevant features is harmful for a distinction between PRR samples with a deeper steamed degree.

Therefore, the present study aims to (1) determine the optimal processing conditions to make PRR approaching the SD₉ quality by an overall assessment in both appearances and chemicals and (2) establish validated models to identify the processing degree of new uninvestigated PRRs by combining metabolomic studies, MSA, and ML algorithms. The best processing degree (or the most feasible endpoint) that matched the quality of traditionally made PRR-SD₉ was discussed. We also introduced random forest (RF) methods with LC-MS-based metabolomic datasets that were able to discriminate a broad range of PRR processing degrees. Therefore, this study could guide modern manufacturing processes of PRR preparation and provide a useful tool for assessing the PRR quality.

Material and Methods

Chemicals and Reagents

The LC-MS grade acetonitrile and formic acid were purchased from Macklin (Shanghai, China), and the high-performance liquid chromatography (HPLC)-grade methanol and acetonitrile were acquired from Adamas (Shanghai, China). The additives (phosphoric acid and ammonia) for the mobile phase were obtained from Sinopharm Chemical Reagent (Shanghai, China). The ultra-pure water was in-house prepared by a Milli-Q Integral 5 system (Millipore, Massachusetts, United States). Rehmannioside D (112063–202103, purity 94.2%), catalpol (110808–202313, purity 99.6%), sucrose (111507–202105, purity 99.8%), raffinose (16042, purity 83.6%), stachyose (112031–202203, purity 94.9%), melibiose (13549, purity ≥ 98.0%), manninotriose (16187, purity ≥ 98.0%), fructose (100231–202008, purity 99.9%), mannitol (100533–202207, purity 99.3%), and glucose (110833–202109, purity 99.9%) were purchased from Nature Standard Co., Ltd. (Shanghai, China). The yellow RW (20231028348, 10.5% AbV, 40 mg/mL of glucose) was obtained from Anhui Matouqiang Wine Co., Ltd. (Anhui, China).

Prepared Radix Rehmanniae Samples Preparation and Collection

A total of 80 PRR samples ([Table 1], [Fig. 1]) with different processing excipients and different steaming degrees were prepared from three batches of DRRs, which were collected in Mengzhou county (Henan, China). The preparation procedure was detailed as follows.

Table 1
Prepared Radix Rehmanniae samples
PRR (prepared in the laboratory)			PRR (purchased from the market)
Plant material (DRR)	PRR samples	Batch numbers	Lot number	Origin
PRR (with RW)			240306	Anhui, China
DRR-1	PRR-RW-SD₁	4	2305001	Jiangxi, China
	PRR-RW-SD₂	4	C22404065	Guangdong, China
	PRR-RW-SD₃	4	240400399	Anhui, China
	PRR-RW-SD₄	4	240100249	Anhui, China
	PRR-RW-SD₅	4	240901	Guangdong, China
	PRR-RW-SD₆	4	292240601	Hebei, China
	PRR-RW-SD₇	4	202309011143	Hebei, China
	PRR-RW-SD₈	4	220110	Shanghai, China
	PRR-RW-SD₉	4	230307	Shanghai, China
PRR-RW-SD₅ (with RW)
DRR-2	PRR-RW-SD₅-1, 2, 3	3
DRR-3	PRR-RW-SD₅-4, 5	2
PRR (without RW)
DRR-1	PRR-SD₁	4
	PRR-SD₂	4
	PRR-SD₃	4
	PRR-SD₄	4
	PRR-SD₅	4
	PRR-SD₆	4
	PRR-SD₇	4
	PRR-SD₈	4
	PRR-SD₉	4
PRR (with FA)
DRR-1	PRR-FA	3

Abbreviations: DRR, dried Radix Rehmanniae; PRR, prepared Radix Rehmanniae; RW, yellow rice wine; FA, Fructus Amomi; SD, steaming and drying.

Fig. 1 Samples. DRR, dried Radix Rehmanniae; PRR, prepared Radix Rehmanniae; RW, yellow rice wine; FA, Fructus Amomi; SD, steaming and drying; ML, machine learning.

Prepared Radix Rehmanniae (with Rice Wine) by SD for 1–9 Cycles

The DRR samples were cleaned, dried, and divided into four groups by size (#1, ∼13 roots per 100 g; #2, ∼18 roots per 100 g; #3, ∼28 roots per 100 g; #4, ∼40 roots per 100 g) to prepare PRRs under the same processing conditions. The PRR samples (36 batches) with SD for different cycles were obtained as follows: DRR (100 g) was mixed with RW (35 mL), thoroughly moistened for 24 hours, put in the glass dish, water-steamed for 12 hours, and dried at 50°C to 80% of dryness, to obtain the prepared Radix Rehmanniae with SD for 1 cycle (PRR-RW-SD₁). Meanwhile, the oily juice was collected in the glass dish. PRR-RW-SD₁ was mixed with the juice, thoroughly moistened for 24 hours, water-steamed for 12 hours, and dried at 50°C to 80% of dryness to obtain the PRR-RW-SD₂. PRR-RW-SD₂ was mixed with the collected juice, thoroughly moistened for 24 hours, water-steamed for 12 hours, and dried at 50°C to 80% of dryness to obtain the PRR-RW-SD₃. The PRRs-RW-SD_{4 to 9} were further prepared in the same way, except that the times to steam PRRs-RW-SD_{4 to 6} and PRRs-RW-SD_{7 to 9} were 8 and 6 hours, respectively. After the last steaming, the oily and lustrous roots were sliced and dried at 50°C for 9 hours to obtain the PRR-RW-SD_{1 to 9} samples, respectively.

Another five batches of PRRs with RW by 5 SD cycles were also repeatedly prepared from two batches of DRR.

Prepared Radix Rehmanniae (without Rice Wine) by SD for 1 to 9 Cycles

All the procedures for PRRs-SD_1–9 (without RW) preparations were the same as those for PRRs-RW-SD_{1 to 9} preparations, except that the RW (35 mL) was replaced with drinking water (35 mL) as a processing excipient. After the last steaming, the oily and lustrous roots were sliced and dried at 50°C for 9 hours to obtain the PRR-SD_1–9 samples (36 batches), respectively.

Prepared Radix Rehmanniae with Fructus Amomi by SD

According to the Henan province specifications for processing of TCM, the PRR-Fructus Amomi (FA) was prepared as follows: the DRR samples (100 g) were cleaned, dried at 55°C for 45 hours, mixed with RW (50 mL) and FA (0.9 g), and thoroughly moistened for 24 hours. The moistened roots were put in the glass dish and steamed for 48 hours. The steamed roots were sliced and dried at 50°C for 9 hours to obtain the PRR-FA samples (n = 3).

In addition, 10 commercial samples with unknown steaming degrees purchased from vendors were detailed in [Table 1].

The Determination of the Optimal Steaming Times by Apparent and Chemical Assessment

Color and Gloss Determination

The mean L*ab values (400–700 nm, 10 nm of interval, n = 12) of PRR samples were acquired by an NS800 spectrophotometer (3nh Global, China), which uses a 45°/0° geometrical optical structure complying with CIE No. 15 and GB/T 3978 standards.[12] The DRR sample was used as a reference. The key parameters were set as follows: the light source was D65; the observer's angle was 10 degrees; the color space was CIE LAB and LCh; and the color index was CIE 1976. The reflection ratios (%) at wavelengths of PRRs were obtained from L*ab values by the SQC8 color management control system (3 nh Global, China).

The images of PRR samples were acquired by a Canon EOS M100 camera. The resolution, horizontal resolution, vertical resolution, aperture value, exposure time, ISO speed, and focal length were 6,000 × 4,000, 180 dpi, 180 dpi, f/6.3, 1/40 seconds, ISO-3200, and 45 mm, respectively. An object selection tool (Adobe Photoshop's - Beta) was applied to select the analyzing area in images and view the brightness values[13] of PRR samples. The mean brightness value of all the analyzing areas was calculated for gloss assessment.

Quantification of Iridoids and Mono/Di/Oligosaccharides by High-Performance Liquid Chromatography

The sample pretreatment and HPLC analyses for the measurement of catalpol and rehmannioside D were conducted using the methods of the CP,[2] which was briefly described in the [Supporting Information] (available in online version). Sample solutions preparation, mono/di/oligosaccharides (sucrose, raffinose, stachyose, melibiose, manninotriose, fructose, mannitol, and glucose) quantification, and method validation were also detailed in the [Supporting information] (available in online version).

The Steaming-Induced Hydrolysis of Oligosaccharides

A parallel experiment was performed to understand the mechanism of steaming-induced transformation of saccharides. The chemical changes were detected before (SD₀), during (SD_{1, 2, 3, and 5}) and after (SD₉) steaming the pure compounds of saccharides for nine cycles. Briefly, the weighted-in quantities for sucrose, raffinose, and stachyose were calculated with reference to the exact quantities of them in the DRR. The solutions of stachyose (n = 3), raffinose (n = 3), and sucrose (n = 3) were generated by dissolving approximately 15.0, 4.0, and 12.5 mg of the pure compounds, respectively, in 50.0 mL of water using glass vessels. These solutions were then steamed with the same process (e.g., time) to PRR preparation (section 2.2) for nine cycles. At intervals of steaming cycles, aliquots of solutions were removed for HPLC analysis, and the same volume of water was added back. All solutions were weighed with the vessel and sampled after cooling to room temperature.

Liquid Chromatography Coupled with Mass Spectroscopy Analysis

The preparation of the sample solution (Step 1 in [Fig. 2]) was detailed in the [Supporting Information] (available in online version). LC-MS analyses for small molecules were performed on a Waters ACQUITY Ultra Performance Liquid Chromatographic (UPLC) system, hyphenated to a Waters Xevo G2-XS-quadrupole time-of-flight (QTOF) MS. The separation was achieved on a waters ACQUITY UPLC HSS T3 column (50 mm × 2.1 mm, 1.8 μm) at 35°C with the mobile phase of water with 0.1% (v/v) formic acid (pharse A) and acetonitrile (pharse B) under the following conditions: 0 to 2 minutes, 1% B; 2 to 4 minutes, 1% to 9% B; 4 to 10 minutes, 9% to 29% B; 10 to 12 minutes, 29 to 48% B; 12 to 27 minutes, 48 to 100% B; 27 to 33 minutes, 100% B; 33 to 33.1 minutes, 100 to 1% B; and 33.1 to 34 minutes 1% B. The flow rate was 0.3 mL/min, and the injection volume was 10 μL. The MS was operated using an electrospray ionization source in negative ion mode. The MS parameters in MS^E mode were set as capillary voltage 2.0 kV, source temperature 100°C, desolvation temperature 250°C, cone gas flow 50 L/h, desolvation gas flow 600 L/h, and cone voltage 40 V. All data were collected in the MS^E continuum mode and acquired by MassLynx 4.1 software. Mass accuracy of the parent ions and major fragments was limited to within 5 ppm. Leucine enkephalin (1 ng/mL) was used for the lock mass ([M - H]⁺, m/z 554.2615) at the flow rate of 5 μL/min. The collision energy ranged from 30 to 50 eV for the high-energy function, and the scan time was 0.3 seconds. The mass range was 50 to 1,500 Da.

Fig. 2 A flow chart of sample preparation, data preprocessing, and machine learning. RT, retention time; LR, logistic regression; DT, decision tree; SVM, support vector machine; RF, random forest.

Depending on the untargeted metabolomics experimental design, a QC sample was prepared by mixing equal volumes (50 μL each) of all samples intended for the metabolomics study. Before initiating the injection sequence, the QC sample was run 10 times to condition the system. Subsequently, a random sequence of study samples was injected, with a QC sample inserted at every 5-sample interval to monitor system stability.

Data Preprocessing and Preparation

Progenesis QI (Waters) was used for LC-MS data preprocessing (Step 2 in [Fig. 2]), including retention time (RT) alignment, peak picking, and normalization. Peak alignment was performed by taking the pooled QCs as the reference. Isotope and adduct deconvolution were applied to reduce the overlap in data features. All data were normalized to the summed total ion intensity per chromatogram, and a table with peaks (each with m/z, RT, and normalized abundance values) was obtained for each experiment. The experiments were performed in two replicates for each of the 85 PRRs.

Then, the resultant data matrices were introduced to EZinfo 2.0 software for Principal Component Analysis (PCA, an unsupervised learning method), to preliminarily assess groupings among the samples according to the steaming intensity level. Next, a Partial Least Squares Discriminant Analysis (PLS-DA) was used to select feature peaks with the Variable Importance in Projection scores greater than 1 (VIP > 1).[14]

Two datasets, the full features (the unique m/z_RT pairs) versus corresponding “normalized abundance” (1, ALL) and the features of VIP > 1 versus corresponding “normalized abundance” (2, VIP), were then fed into the ML models for classification training.

Machine Learning

Classification model development was performed using supervised ML (Step 3 in [Fig. 2]), where the dataset has been explicitly labeled or classified, that is, each data point is known to belong to the category. In supervised learning, the process involves learning from labeled data (training data), and it creates a model that maps inputs (features) to outputs with high accuracy on previously unseen data (blind verification data) during the data validation phase.[15] Four algorithms, including Logistic Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and RF were used, and the results were compared with obtain the model with the highest accuracy. These algorithms were selected considering their respective advantages.

ML algorithms better handle datasets where the sample features exceed the number of samples.[16] PCA, linear discriminant analysis (LDA), and regularization, and SelectKBest were used for dimensionality reduction and feature extraction, respectively, to simplify data and enhance model performance. For example, the SelectKBest class in scikit-learn, categorized as a filter-based feature selection method,[17] implements a two-stage procedure for identifying top-performing features from a dataset: relevance metric computation via statistical hypothesis testing (e.g., mutual information for capturing nonlinear dependencies, chi-square for testing feature-target independence), and subset selection through thresholding on computed scores. For classification problems, the mutual_info_classif variant is preferred, which estimates mutual information ([Eq. 1]) between features X and discrete targets Y.[18] This approach effectively identifies predictive features while maintaining computational efficiency through: empirical probability estimation from sample data, and avoidance of high-dimensional covariance matrix computations required by parametric methods. We also set up a custom feature selector using importance_threshold, where the num_features_to_select was set to 80 to calculate the feature importance and select the top 80 features for model training. All modeling was performed using Python programming language (version 3.12), Scikit-learn ML package (version 1.5.1), and PyCharm IDE (version 2024.1.3).

Except for the data in the blind verification dataset, the remaining data were randomly divided into a training set (80%) and a testing set (20%) for all classifiers. To evaluate the performance of the models for all classifiers, the k-fold cross-validation method was employed, with K set to either 3 or 5. Grid search was applied to explore the optimal parameters for the model. To further evaluate the model's performance, indicators such as average cross-validation score, accuracy, precision, recall, F1 score, and confusion matrix were adopted for a comprehensive assessment.[19] We selected the optimal classification model based on its error rate in predicting the degree of PRR processing when the blind verification dataset was used as input.

Finally, the built models were applied to identify the processing degree of commercial PRR samples.

Results and Discussion

Determination of the Optimal SD Cycles Based on Color and Gloss Analyses

A traditional way to determine the endpoint of the PRR processing procedure was visual assessment by skilled professionals from the “black as lacquer” appearance of PRR.[1] From visual appearance, the color change from DRR to PRR was obvious, but the PRRs from different SD cycles were hardly distinguishable ([Fig. 3A]). A colorimeter and a digital camera were used in this study to provide a more objective description of PRRs. As shown in [Fig. 3B], [C], the reflection ratio at wavelengths of PRR-RW significantly decreased by 50.7 to 56.9% (p = 0.007–0.017, PRR-RW-SD₄ vs. PRR-RW-SD₅), at the 5^th SD cycle, and fluctuated slightly (percent decrease: ranging from −7.4 to 29.1% for PRR-RW-SD_6–9), after that; the mean brightness of PRR with RW and PRR without RW increased by 12.4% (p < 0.05, PRR-RW-SD₄ vs. PRR-RW-SD₅) and 5.4% (p > 0.05, PRR-SD₄ vs. PRR-SD₅), respectively, at the 5^th SD cycle, and fluctuated slightly (percent increase: ranging from −5.2 to 2.9% for PRR-RW-SD_6–9; ranging from −3.7 to 1.7% for PRR-SD_6–9), after that. In addition, it was found that mixing back with the oily juice was very necessary to enhance the glossiness of PRR. The PRRs prepared without RW showed a higher brightness, compared with the PRRs prepared with RW, at the same SD cycle ([Fig. 3C]). Consequently, the changes in color and brightness were obvious before but slight after the 5^th SD cycle.

Fig. 3 Dynamic changes in color and gloss, and sugar and iridoid contents of PRRs throughout the 9 processing (SD) cycles. (A) DRR and PRR samples illustration. (B) Heat map of the reflection ratio of PRR during 1 to 9 SD cycles. (C) Changes in mean brightness (mean ± standard deviation) of PRR during 1 to 9 SD cycles. (D) Changes in relative percentages of sugars and contents of iridoids of PRR during 1 to 9 SD cycles. (E) The mechanism of steaming-induced transformation of saccharides. (F) HPLC-ELSD chromatogram of fructose, mannitol, and glucose. (G) HPLC-ELSD chromatogram of sucrose, melibiose, raffinose, manninotriose, and stachyose. (H) The chemical structures of rehmannioside D and catalpol. (I) HPLC-UV chromatogram of rehmannioside D (203 nm). (J) HPLC-UV chromatogram of catalpol (210 nm). PRRs, prepared Radix Rehmanniae; SD, steaming and drying; DRR, dried Radix Rehmanniae; PRR-RW-SD_n: PRR prepared with yellow rice wine after n SD cycles (n = 1–9); PRR-SD_n, PRR prepared without yellow rice wine after n SD cycles (n = 1–9).

Determination of the Optimal SD Cycles Based on Sugar and Iridoid Contents

For the analysis of DRR and PRRs, there are two validated HPLC-ELSD methods (detailed in [Supporting Information], [Table S1] and [Fig. S1] (available in online version) for the quantitative determination of fructose, mannitol, glucose, sucrose, melibiose, raffinose, manninotriose, and stachyose, and two HPLC-UV methods recorded in CP for the quantitative determination of catalpol and rehmannioside D.

As shown in [Fig. 3D], sucrose, raffinose, and stachyose drastically decreased by 76.4, 68.7, and 70.2, respectively, after the 1^st SD cycle, and further totally converted to glucose and fructose, melibiose and fructose, and manninotriose and fructose, respectively ([Fig. 3E]), until the 2^nd or 3^rd SD cycle. Glucose, fructose, and manninotriose contents pronouncedly increased by 3.8-, 5.0-, and 9.3-fold, respectively, from DRR to PRR-RW-SD₂, and slightly changed during the further stages (from PRR-RW-SD₃ to PRR-RW-SD₉). As a by-product of the steaming process, melibiose also showed a pronounced increase at the first two SD cycles and a further steady state at the last seven SD cycles. The sugar conversion with breaking of only the fructosidic bond during the steaming process was confirmed by a parallel experiment for pure di/oligosaccharide compounds. The hydrolysis of the galactosidic bond, which was speculated by Zhou et al, was not found in this study.[9] Only the fructose side units were removed from di/oligosaccharides because of the high reactivity of the furanosidic bonds ([Fig. 3E]). The presence of bond opposition or angle strain in furanose 5-membered ring resulted in an easier hydrolysis of the glycosidic bond in furanoside than in pyranoside.[20]

Catalpol decreased quickly and disappeared completely from the 1^st to the 2^nd SD cycle, whereas rehmannioside D decreased gradually during 9 SD cycles. Catalpol degraded from hydrolysis of the glycosidic bond, ring-opening rearrangement of the hemiacetal moiety, and dehydration of the 6-OH alcohol, subsequently,[21] [22] to form furans and pyrans. Unlike that of catalpol, substitution of the glycosyl group at C-5 of rehmannioside D could inhibit dehydration of the 6-OH alcohol, resulting in the suppression of the degradation rate. Notably, the markers specified in the monograph “PRR” of the CP involve rehmannioside D (specified at ≥0.050%, m/m),[2] which became lower than 0.050% in several PRR-SD_5–9 samples (2 PRR samples prepared from 2 batches of RR, data not shown) in this study.

Consequently, the changes in sugar and iridoid contents were obvious before but slight after the 3^rd SD cycle. PRRs prepared from 3 to 5 SD cycles would be suggested both to mimic the traditional processing method and to meet the speciﬁed criteria.

Determination of the Optimal SD Cycles Based on Untargeted Metabolomic Analyses

Sample solutions contain rich chemical information and can reflect the overall changes in small molecules during the SD processing. The LC-QTOF-MS^E data were processed by Progenesis QI software.[23] An unsupervised PCA[24] model obtained from the LC-MS data of all PRR-RW-SD_n or all PRR-SD_n samples revealed the general structure of the complete dataset, in which the first two PCs cumulatively accounted for 63.1 or 64.8% of the total variation, with PC1 accounting for 39.4% or 48.3% of the variance, discriminating PRRs with different SD cycles ([Fig. 4]). [Fig. 4] revealed two trends of metabolomic profile during the PRR processing both with and without yellow RW. At the first four stages of the SD cycle, small-molecule profiles of PRRs between SD cycles were markedly different. At the last five stages of the SD cycle, multiple replicates of the PRRs-RW-SD_5–9 and the PRRs-SD_5–9 exhibited similar metabolomic profiles (red square in [Fig. 4]). Consequently, the changes in small-molecule profiles were remarked before but slight after the 4^th SD cycle.

Fig. 4 PCA of LC-MSE metabolomic profiles derived from prepared Radix Rehmanniae (PRRs) throughout the 9 processing cycles. PCA, principal component analysis; PRR-RW-SD_n, PRR prepared with yellow rice wine after n SD cycles (n = 1–9); PRR-SD_n, PRR prepared without yellow rice wine after n SD cycles (n = 1–9); SD, steaming and drying.

Thus, this study demonstrated that PRR by 3 to 5 SD cycles could reach the quality of PRR-SD₉ based on the physical and chemical properties.

RR is the typical medicinal herb with the characteristic of “different clinical uses before and after processing.” The previously published data showed a better proliferation effect of polysaccharides from PRR-SD₉ (6 hours × 9) than those from PRR-SD₁ (12 hours × 1) on rat ovarian granulosa cells.[25] A study also reported that the changes in polysaccharides were obvious before but slight after the 5^th SD cycle.[26] Thus, the equivalence of bioactivity between PRR-SD₉ and PRRs-SD_3–5 could be speculated. However, further research, including a comparative study on pharmacological action (or even clinical efficacy) between PRR-SD₉ and PRRs-SD_3–5, is needed to confirm the equivalence.

Prediction of Prepared Radix Rehmanniae Processing Degree by Machine Learning

Although known samples (e.g., PRRs- SD_1–5) could be classified well from LC-MS^E data with PCA, a significant difference could not be observed from intensively steamed PRRs (e.g., PRRs- SD_6–9). A more powerful ML model that can distinguish the samples with deep steaming degree and even predict the processing degree of unknown samples is highly needed.

Our model training process strictly followed the established algorithm framework. As a result, we successfully developed two RF models that can accurately predict the PRR processing degree based on input data and can provide a useful tool for the PRR processing optimization and QC.

Data Processing Summary

The LC-MS^E data must initially be preprocessed to be able to incorporate them into an ML approach. Two preprocessed by QI[27] datasets were used to create the training, testing, and blind verification sets. The two datasets were 1 (ALL), processed data including all MS peaks (a total of 15,847 peaks) with relative abundance, RT, and the m/z; and 2 (VIP), processed data including MS signals (a total of 2,463 peaks) responsible for feature differentiation (VIP > 1 from PLS-DA analysis)[28] with relative abundance, RT, and the m/z.

Machine Learning Models Selection, Training Optimization, Blind Verification, and Application

First, the three preselected ML algorithms, namely, LR, DT, and SVM, were trained and tested to evaluate the accuracy of prediction using both datasets 1 (ALL) and 2 (VIP) as input. PCA, LDA, and regularization were used for dimensionality reduction of data to avoid overfitting, which is a common problem in ML and deep learning.[19] [29] From [Table 2], the evaluation of various models for the identification of the PRR processing degree demonstrated preliminary performance across key metrics. The ALL-SVM showed an accuracy of 70%, VIP-SVM 70%, ALL-LR 67%, VIP-LR 70%, ALL-DT 79%, VIP-DT 74%, respectively. However, none of these models had a good accuracy for the blind verification set ([Supporting Information], [Fig. S2], available in online version). In the features of PRRs-SD_1–5, 83% samples were correctly classified, but in the features of PRRs-SD_{6,8, and 9}, 72% samples were misclassified as PRRs-SD_{7(8),6(7), and 8}, respectively.

Table 2
Model performance evaluation result
Model	Precision	Mean cross-validation score	F1-score	Recall	Accuracy
ALL-SVM	0.73	0.61	0.70	0.70	0.70
VIP-SVM	0.79	0.69	0.70	0.70	0.70
ALL-LR	0.72	0.73	0.66	0.67	0.67
VIP-LR	0.77	0.69	0.71	0.70	0.70
ALL-DT	0.79	0.78	0.76	0.78	0.79
VIP-DT	0.82	0.80	0.75	0.74	0.74

Subsequently, the RandomizedSearchCV algorithm optimized the parameters of the DT model, maximizing predictive power. The “criterion” of “entropy” indicated a more effective information gain metric for our dataset. The “max_depth” was set to 5, the “max_features” to 1966, the “min_samples_leaf” 5, and “min_samples_split” 4, respectively. The model, incorporating RandomizedSearchCV and DT, demonstrated better performance with an accuracy of 85% for the training set. Meanwhile, the accuracy percentage for PRR-SD_6–9 prediction cannot reach > 90% for the blind verification set. The DT model was suitable for classifying the PRRs with a lighter processing degree, but not suitable for the PRRs with intensive processing degrees.

Given the limitations of DT, we utilized a tree-based RF method, where many DTs are calculated based on the original dataset, and each of them predicts a classification.[30] Indeed, the results of model development revealed the superiority of the RF models in estimating the degree of PRR processing in this study. First, the SelectKBest feature selection was employed in the ALL-RF model, with the mutual information classification specified as the scoring function. Then, the top k = 100 features with the highest scores from the original feature set were selected for model training. In the evaluation results, the model trained with a dataset based on all features (ALL-RF) showed much higher values, with Average cross-validation score, Accuracy, Precision, Recall, and F1 score values of 0.93, 0.96, 0.98, 0.96, and 0.96, respectively ([Fig. 5A]). [Fig. 5B] demonstrates the result of using ALL-RF to classify the training set in the confusion matrix. A total of 93% of the reference samples were classified correctly in groups of PRRs with different processing degrees. Only two PRR-SD₈ samples were misclassified in the group of PRR-SD₉.

Fig. 5 Performance results of RF algorithms using different inputs for classifying the steaming levels of PRRs and their application for identifying unknowns. (A) Evaluation report (recall, accuracy, f1 score, precision and average cross-validation score) for ALL-RF and VIP-RF models. (B) Confusion matrix of classification of PRR samples after ALL-RF model training. (C) Confusion matrix of classification of PRR samples after VIP-RF model training. (D) Identification results of the processing degree of 10 batches of commercial PRRs. ALL-RF, the model built by combining the RF algorithm and dataset with all features; VIP-RF, the model built by combining the RF algorithm and dataset with VIP > 1 selected features; PRR-SD_n: PRR prepared without yellow rice wine after n SD cycles (n = 1 to 9); RF, random forest.

Another RF model trained with VIP > 1 (from PLS-DA) dataset (VIP-RF) was also built, when the top 80 features were selected for model training. The identification of processing degree also achieved impressive results, with an Average cross-validation score of 0.93, an Accuracy of 0.93, a Precision of 0.96, a Recall of 0.93, and an F1-score of 0.93, respectively ([Fig. 5A]). [Fig. 5C] represents the result of using VIP-RF for processing degree identification in a confusion matrix. Also, 93% of the reference samples were classified correctly in groups. Only a PRR-SD₆ and a PRR-SD₈ sample were misclassified in the group of PRR-SD₇.

In the blind verification procedure, a total of 15 PRR samples with different processing degrees (including PRR-SD_1–9, PRR-RW-SD_1–9, and PRR-FA) were blindly prepared for the ALL-RF and VIP-RF model verification. Both two models achieved 100% accuracy with an error rate of 0, proving more effectiveness and precision than a reported RF model[8] that can only distinguish PRR samples with a light steaming degree (<18 hours) and mis-distinguish two samples in the verification procedure as well. The unique raffinose family oligosaccharides illustrated the features, which were not enough for enabling the discrimination of PRR with a specific steaming degree (especially with a deep degree) from all other PRRs. It was confirmed by the HPLC results of no or slight changes in sugar contents after a deeper processing procedure ([Fig. 3D]), as well as by some other published reports.[9] [10] [11]

Finally, we applied the ALL-RF and VIP-RF models to identify the processing degree of 10 commercial samples obtained from the market. As can be seen in [Fig. 5D], seven batches were identified as 0 to 12 hours steamed samples (equivalent to PRR-SD₁), two batches as 12 to 24 hours (equivalent to PRR-SD₂), and a batch as 0 to 24 hours (equivalent to PRR-SD_1,2). The results reflect the fact[31] that most PRRs in the market are not steamed or processed intensively enough and cannot reach the quality of traditionally made PRR. PRR, as a typical negative example, is usually manufactured by a simplified or nonimplemented processing procedure.[31] In this sense, methodologies that give a more complete image of the features of traditionally made PRRs may play a significant role in QC and standard establishment. During the establishment of standards for TCM decoction pieces, it is essential to study the experience, techniques, and traditions of processing, and then to find the key factors that affect the quality of decoction pieces due to processing.[31] [32] The state-of-the-art approaches, such as LC-MS analysis combined with ML, could be a good tool for bridging the gap between traditions and modernizations of TCM.

Conclusion

Our results demonstrate dynamic changes in color and gloss, sugar and iridoid contents, and metabolomic profile of PRRs throughout the nine processing (steaming and drying) cycles. All these physical and chemical characteristics tend to a steady state after the 3^rd to 5^th SD cycles, which could be the optimal SD cycles approaching the traditional 9-SD-cycle processing procedure. Notably, our opinion of qualitative equivalence of “3 to 5 times SD cycle”-made PRR with “9 times SD cycle”-made PRR is in good agreement with the ancient record in QJYF (Tang Dynasty, A.D. 682). Understanding these dynamics could lead to improved processing strategies, enhancing both the efficacy and quality of PRR.

Moreover, this study illustrates the potential of LC-MS^E data combined with RF algorithms to identify the processing degree of PRR unknowns. Chemical signatures of PRRs with different processing degrees, acquired by LC-MS^E analysis, can then be subjected to MSA using predictors based on two RF models (ALL-RF and VIP-RF), to predict the degree of identity of PRR unknowns at an error rate of 0, surpassing the accuracy achieved by previous reported models. Instead of the steaming degree determination based on sensory characteristics (color and flavor) by processing experts, our models can be good tools for QC in PRR manufacture and supervision, for their advantage of high capacity and accuracy for identifying the processing degree of PRR unknowns with an impressively wider range of steaming time (0–78 hours).

Consequently, this work could be an expedition from traditional to controlled process or even perspectives for industrialization.

Supporting Information

This section includes the experiment procedure for quantitative analysis of rehmannioside D and catalpol in DRR or PRR; validation of the HPLC-ELSD method for quantification of 8 sugars in DRR or PRR; and sample preparation for LC-QTOF-MS^E analysis.

Method validation for the quantitative determination of fructose, mannitol, glucose, sucrose, melibiose, raffinose, manninotriose, and stachyose ([Table S1], available in online version); chromatographic profiles of the 3 monosaccharides and the 5 di/oligosaccharides ([Fig. S1], available in online version only); and a box plot for the blind verification accuracy obtained by the ML algorithms for identifying the processing levels of PPR ([Fig. S2], available in online version only), were also included.

Referenzen

References
1 Li M, Jiang H, Hao Y. et al. A systematic review on botany, processing, application, phytochemistry and pharmacological action of Radix Rehmnniae. J Ethnopharmacol 2022; 285: 114820
2 National Pharmacopoeia Commission. Pharmacopoeia of the People's Republic of China. Beijing: China Medical Science Press; 2020: 129-130
3 Henan Provincial Drug Administration. Henan province specifications for processing of TCM (2022 Edition). Zhengzhou: Henan Science and Technology Press; 2022: 138-139
4 Xie Y, Zhong LY, Wang Z. et al. Historical evolution and modern research progress of Rehmanniae Radix . Zhongguo Shiyan Fangjixue Zazhi 2022; 24: 273-282
5 Li SL, Song JZ, Qiao CF. et al. A novel strategy to rapidly explore potential chemical markers for the discrimination between raw and processed Radix Rehmanniae by UHPLC-TOFMS with multivariate statistical analysis. J Pharm Biomed Anal 2010; 51 (04) 812-823
6 Boccard J, Kalousis A, Hilario M. et al. Standard machine learning algorithms applied to UPLC-TOF/MS metabolic fingerprinting for the discovery of wound biomarkers in Arabidopsis thaliana . Chemom Intell Lab Syst 2010; 104: 20-27
7 Li Y, Fan J, Cheng X. et al. New Revolution for quality control of TCM in Industry 4.0: focus on artificial intelligence and bioinformatics. TrAC Trends Anal Chem 2024; 181: 118023
8 Li H, Zhang S, Zhao Y, He J, Chen X. Identification of raffinose family oligosaccharides in processed Rehmannia glutinosa Libosch using matrix-assisted laser desorption/ionization mass spectrometry image combined with machine learning. Rapid Commun Mass Spectrom 2023; 37 (22) e9635
9 Zhou L, Xu JD, Zhou SS. et al. Integrating targeted glycomics and untargeted metabolomics to investigate the processing chemistry of herbal medicines, a case study on Rehmanniae Radix. J Chromatogr A 2016; 1472: 74-87
10 Zhou L. Holistic evaluation on quality and efficacy of Rehmanniae Radix Praeparata in “nine cycles of steaming and drying” processing [in Chinese]. [Master's thesis]. Nanjing: Nanjing University of Chinese Medicine; 2017
11 Li Y. Processing technology and mechanism of “repeated steaming and air-exposing” of Rehmanniae Radix Praeparata [in Chinese]. [Master's thesis]. Jinan: Shandong University; 2023
12 Su C, Chen D. A chromometer that can inspect and measure aperture sizes. CN Patent 222069915U. November, 2024
13 Li C, Guo Y, Dong S, Hu Y, Zhang F. Dynamic range adjustment method of the aerospace camera based on histogram distribution. Spacecr Recover Remote Sens 2017; 38: 36-43
14 Gao Q, Jiang H, Tang F. et al. Evaluation of the bitter components of bamboo shoots using a metabolomics approach. Food Funct 2019; 10 (01) 90-98
15 Morales EF, Escalante HJ. Chapter 6 - A brief introduction to supervised, unsupervised, and reinforcement learning. In: Torres-García AA, Reyes-García CA, Villaseñor-Pineda L, Mendoza-Montoya O. eds. Biosignal Processing and Classification Using Computational Learning and Intelligence. Academic Press; 2022: 111-129
16 Dalal N, Sáiz MJ, Caporale AG, Baldini F, Babayan SA, Adamo P. Fishy forensics: FT-NIR and machine learning based authentication of Mediterranean anchovies (Engraulis encrasicolus). J Food Compos Anal 2024; 136: 106847
17 Saeed MH, Hama JI. Cardiac disease prediction using AI algorithms with SelectKBest. Med Biol Eng Comput 2023; 61 (12) 3397-3408
18 Brownlee J. Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery; 2020: 158-163
19 Sha Y, Jiang M, Luo G. et al. HerbMet: enhancing metabolomics data analysis for accurate identification of Chinese herbal medicines using deep learning. Phytochem Anal 2025; 36 (01) 261-272
20 Shafizadeh F. Acidic hydrolysis of glycosidic bonds. Tappi 1963; 46: 381-383
21 Xue S, Fu Y, Sun X, Chen S. Changes in the chemical components of processed rehmanniae radix distillate during different steaming times. Evid Based Complement Alternat Med 2022; 2022: 3382333
22 Yang J, Zhang L, Zhang M. et al. Exploration of the dynamic variations of the characteristic constituents and the degradation products of catalpol during the process of Radix Rehmanniae . Molecules 2024; 29 (03) 705
23 Liao J, Zhang Y, Zhang W. et al. Different software processing affects the peak picking and metabolic pathway recognition of metabolomics data. J Chromatogr A 2023; 1687: 463700
24 Ringnér M. What is principal component analysis?. Nat Biotechnol 2008; 26 (03) 303-304
25 Lin H, Gui SH, Yu BB, Que XH, Zhu JQ. Analysis of polysaccharide monosaccharides of Radix Rehmanniae by different processing processes and their effects on ovarian granulosa cells. Zhongchengyao 2019; 12: 2958-2963
26 Jia H, Zhang WF, Lei JW, Li YY, Yang CJ, Fan KF. UV combined with MIR spectroscopy to discuss the dynamic changes of sugar during the processing of Rehmannia glutinosa . Lishizhen Med Materia Medica Res 2023; 2023: 96-99
27 Wang XC, Ma XL, Liu JN. et al. A comparison of feature extraction capabilities of advanced UHPLC-HRMS data analysis tools in plant metabolomics. Anal Chim Acta 2023; 1254: 341127
28 Tamrakar S, Huerta B, Chung-Davidson YW, Li W. Plasma metabolomic profiles reveal sex- and maturation-dependent metabolic strategies in sea lamprey (Petromyzon marinus). Metabolomics 2022; 18 (11) 90
29 Ponce de Leon-Sanchez ER, Dominguez-Ramirez OA, Herrera-Navarro AM, Rodriguez-Resendiz J, Paredes-Orta C, Mendiola-Santibañez JD. A deep learning approach for predicting multiple sclerosis. Micromachines (Basel) 2023; 14 (04) 749
30 Benes E, Bajusz D, Gere A, Fodor M, Rácz A. Comprehensive chemometric classification of snack products based on their near infrared spectra. Lebensm Wiss Technol 2020; 133: 110130
31 Wang Q, Zhao YX, Gu J. et al. Establishment of traditional Chinese medicine standards reflecting the quality characteristics of Chinese herbal pieces based on processing. Chung Kuo Yao Hsueh Tsa Chih 2025; 40: 114-120
32 Xue R, Zhang Q, Mei X. et al. Research on quality marker based on the processing from Aconiti lateralis radix praeparata to Heishunpian. Phytochem Anal 2024; 35 (06) 1443-1456

Abbildungen

Fig. 1 Samples. DRR, dried Radix Rehmanniae; PRR, prepared Radix Rehmanniae; RW, yellow rice wine; FA, Fructus Amomi; SD, steaming and drying; ML, machine learning.

Fig. 2 A flow chart of sample preparation, data preprocessing, and machine learning. RT, retention time; LR, logistic regression; DT, decision tree; SVM, support vector machine; RF, random forest.

Fig. 3 Dynamic changes in color and gloss, and sugar and iridoid contents of PRRs throughout the 9 processing (SD) cycles. (A) DRR and PRR samples illustration. (B) Heat map of the reflection ratio of PRR during 1 to 9 SD cycles. (C) Changes in mean brightness (mean ± standard deviation) of PRR during 1 to 9 SD cycles. (D) Changes in relative percentages of sugars and contents of iridoids of PRR during 1 to 9 SD cycles. (E) The mechanism of steaming-induced transformation of saccharides. (F) HPLC-ELSD chromatogram of fructose, mannitol, and glucose. (G) HPLC-ELSD chromatogram of sucrose, melibiose, raffinose, manninotriose, and stachyose. (H) The chemical structures of rehmannioside D and catalpol. (I) HPLC-UV chromatogram of rehmannioside D (203 nm). (J) HPLC-UV chromatogram of catalpol (210 nm). PRRs, prepared Radix Rehmanniae; SD, steaming and drying; DRR, dried Radix Rehmanniae; PRR-RW-SD_n: PRR prepared with yellow rice wine after n SD cycles (n = 1–9); PRR-SD_n, PRR prepared without yellow rice wine after n SD cycles (n = 1–9).

Fig. 4 PCA of LC-MSE metabolomic profiles derived from prepared Radix Rehmanniae (PRRs) throughout the 9 processing cycles. PCA, principal component analysis; PRR-RW-SD_n, PRR prepared with yellow rice wine after n SD cycles (n = 1–9); PRR-SD_n, PRR prepared without yellow rice wine after n SD cycles (n = 1–9); SD, steaming and drying.

Fig. 5 Performance results of RF algorithms using different inputs for classifying the steaming levels of PRRs and their application for identifying unknowns. (A) Evaluation report (recall, accuracy, f1 score, precision and average cross-validation score) for ALL-RF and VIP-RF models. (B) Confusion matrix of classification of PRR samples after ALL-RF model training. (C) Confusion matrix of classification of PRR samples after VIP-RF model training. (D) Identification results of the processing degree of 10 batches of commercial PRRs. ALL-RF, the model built by combining the RF algorithm and dataset with all features; VIP-RF, the model built by combining the RF algorithm and dataset with VIP > 1 selected features; PRR-SD_n: PRR prepared without yellow rice wine after n SD cycles (n = 1 to 9); RF, random forest.

Zusatzmaterial

Supplementary Material (PDF) (opens in new window)