dc.description.abstract | This study examines the optimization of the C parameter defined as the inverse of λ in the
LASSO logistic regression model by comparing three cross-validation methods: KFold,
Stratified KFold (SKF), and Repeated Stratified KFold (RSKF). Three ranges of C values
(0.01–0.1, 0.0001–0.0251, and 0.1–316.2) were evaluated using log loss as the primary
metric and F1-score as a secondary measure, while also considering the number of
selected variables. The results show that the optimal C value leads to varying levels of
variable selection. On the original dataset, C = 0.1 selected 8 out of 11 variables with
an F1-score of 0.911 and a log loss of 0.303. For simulated data I (n = 150), C = 0.1
retained all 11 variables with no selection and an F1-score of 0.837. On simulated data
II (n = 550),C = 0.0251 selected 11 of 15 variables (removing 4 noise variables) with
an F1-score of 0.845. For simulated data III (n = 1500), the same C value selected 16
of 20 variables, eliminating 4 noise variables, with an F1-score of 0.884. The findings
indicate that Stratified KFold provides the most stable results for imbalanced data. The
smaller C range (0.0251–0.0001) was effective in filtering out noise variables, whereas
larger ranges tended to retain more variables. A small difference between training and
testing F1-scores (< 0.05) suggests stable models. | en_US |