A comprehensive analysis of stroke risk factors and development of a predictive model using machine learning approaches
Stroke is a leading cause of death and disability globally, particularly in China. Identifying risk factors for stroke at an early stage is critical to improving patient outcomes and reducing the overall disease burden. However, the complexity of stroke risk factors requires advanced approaches for accurate prediction. The objective of this study is to identify key risk factors for stroke and develop a predictive model using machine learning techniques to enhance early detection and improve clinical decision-making. Data from the China Health and Retirement Longitudinal Study (2011-2020) were analyzed, classifying participants based on baseline characteristics. We evaluated correlations among 12 chronic diseases and applied machine learning algorithms to identify stroke-associated parameters. A dose-response relationship between these parameters and stroke was assessed using restricted cubic splines with Cox proportional hazards models. A refined predictive model, incorporating age, sex, and key risk factors, was developed. Stroke patients were significantly older (average age 69.03 years) and had a higher proportion of women (53%) compared to non-stroke individuals. Additionally, stroke patients were more likely to reside in rural areas, be unmarried, smoke, and suffer from various diseases. While the 12 chronic diseases were correlated (p < 0.05), the correlation coefficients were generally weak (r < 0.5). Machine learning identified nine parameters significantly associated with stroke risk: TyG-WC, WHtR, TyG-BMI, TyG, TMO, CysC, CREA, SBP, and HDL-C. Of these, TyG-WC, WHtR, TyG-BMI, TyG, CysC, CREA, and SBP exhibited a positive dose-response relationship with stroke risk. In contrast, TMO and HDL-C were associated with reduced stroke risk. In the fully adjusted model, elevated CysC (HR = 2.606, 95% CI 1.869-3.635), CREA (HR = 1.819, 95% CI 1.240-2.668), and SBP (HR = 1.008, 95% CI 1.003-1.012) were significantly associated with increased stroke risk, while higher HDL-C (HR = 0.989, 95% CI 0.984-0.995) and TMO (HR = 0.99995, 95% CI 0.99994-0.99997) were protective. A nomogram model incorporating age, sex, and the identified parameters demonstrated superior predictive accuracy, with a significantly higher Harrell's C-index compared to individual predictors. This study identifies several significant stroke risk factors and presents a predictive model that can enhance early detection of high-risk individuals. Among them, CREA, CysC, SBP, TyG-BMI, TyG, TyG-WC, and WHtR were positively associated with stroke risk, whereas TMO and HDL-C were opposite. This serves as a valuable decision-support resource for clinicians, facilitating more effective prevention and treatment strategies, ultimately improving patient outcomes.
Dose–response relationship; Feature parameters; Machine learning; Nomogram predictive modelling; Risk factors; Stroke.
© 2025. The Author(s).
Declarations. Conflict of interest: The authors declare no conflict of interest. Informed consent: Not applicable. Institutional review board: Not applicable.
- Barthels D, Das H (2020) Current advances in ischemic stroke research and therapies. Biochimica Et Biophysica Acta Mol Basis Dis 1866(4):165260 - DOI