Abstract:ObjectiveBased on lymphocyte subset count indicators, diagnostic models were constructed using different machine learning methods to distinguish non-tuberculous mycobacterial pulmonary disease(NTM-PD), pulmonary tuberculosis(PTB), and other common confounding pulmonary diseases, to provide a scientific basis for the early identification of infectious pulmonary diseases. MethodsThe patients diagnosed with active tuberculosis(ATB), NTM-PD, or other pulmonary diseases(including inflammatory and neoplastic conditions) admitted to the Department of Tuberculosis at Shanghai Pulmonary Hospital from January to December in 2023 were included in this study. Lymphocyte subset counts were measured using flow cytometry. Four machine learning algorithms—multinomial logistic regression, naive Bayes, random forest, and XGBoost—were employed for model development. Hyperparameter tuning was performed using Bayesian optimization and cross-validation. The variables with P<0.1 from univariate analysis were selected and further refined via correlation analysis and LASSO for final model input. The models were evaluated using area under the receiver operating characteristic curve(AU-ROC), average precision-precision recall curve(AP-PR), and decision curve analysis(DCA) curves on the test set. ResultsA total of 1 383 patients were included, with 836 cases in the ATB group, 254 in the NTM group, and 293 in the OTHER group. Using selected demographic data, comorbidities, and lymphocyte subset indices as input variables and disease category as the outcome variable, four machine learning models were successfully constructed. Among them, the random forest model demonstrated the best predictive performance; the top contributing variables in the models were body mass index(BMI), CD3+T cells, CD16+56+NK cells, CD8+T cells(cytotoxic T cells), age, %CD3+T cells, CD19+B cells, CD4+T cells(helper T cells), gender, anemia, diabetes, leukopenia, hypoproteinemia, and autoimmune disease; and BMI, CD3+T cells, CD16+56+NK cells, and CD+T cells(cytotoxic T cells) contributed most significantly. ConclusionThe machine learning models developed in this study successfully differentiated ATB, NTM-PD, and other pulmonary diseases by integrating lymphocyte subset profiles with clinical features. These models provide novel approaches for the early diagnosis and personalized management of pulmonary diseases.