Risk factors related to the severity of COVID-19 in Wuhan

Objective: To evaluate the characteristics at admission of patients with moderate COVID-19 in Wuhan and to explore risk factors associated with the severe prognosis of the disease for prognostic prediction. Methods: In this retrospective study, moderate and severe disease was defined according to the report of the WHO-China Joint Mission on COVID-19. Clinical characteristics and laboratory findings of 172 patients with laboratory-confirmed moderate COVID-19 were collected when they were admitted to the Cancer Center of Wuhan Union Hospital between February 13, 2020 and February 25, 2020. This cohort was followed to March 14, 2020. The outcomes, being discharged as mild cases or developing into severe cases, were categorized into two groups. The data were compared and analyzed with univariate logistic regression to identify the features that differed significantly between the two groups. Based on machine learning algorithms, a further feature selection procedure was performed to identify the features that can contribute the most to the prediction of disease severity. Results: Of the 172 patients, 112 were discharged as mild cases, and 60 developed into severe cases. Four clinical characteristics and 18 laboratory findings showed significant differences between the two groups in the statistical test (P<0.01) and univariate logistic regression analysis (P<0.01). In the further feature selection procedure, six features were chosen to obtain the best performance in discriminating the two groups with a linear kernel support vector machine. The mean accuracy was 91.38%, with a sensitivity of 0.90 and a specificity of 0.94. The six features included interleukin-6, high-sensitivity cardiac troponin I, procalcitonin, high-sensitivity C-reactive protein, chest distress and calcium level. Conclusions: With the data collected at admission, the combination of one clinical characteristic and five laboratory findings contributed the most to the discrimination between the two groups with a linear kernel support vector machine classifier. These factors may be risk factors that can be used to perform a prognostic prediction regarding the severity of the disease for patients with moderate COVID-19 in the early stage of the disease.


Introduction
COVID-19 was initially reported in Wuhan, China, in December 2019 and rapidly spread to all other provinces in China and throughout the world [1,2]. Without specific treatment or prevention options for COVID-19, such as targeted antiviral drugs and vaccines, China has focused on isolation, quarantine, social distancing, and community containment to contain the outbreak [3]. By May 18, 2020, there were 84,494 confirmed cases of COVID-19 and 4,645 deaths in China and 4,534,327 confirmed cases and 307,202 deaths outside of China [4]. The pandemic of COVID-19 has raised wide public concern and imposed a heavy burden on global health care systems because approximately 15-20% of patients develop severe interstitial pneumonia [5]. In addition, COVID-19 patients admitted to ICUs experienced Ivyspring International Publisher higher mortality (38%) than non-ICU patients (4%) [6]. A mortality rate of 50-60% was reported in patients developing ARDS and requiring invasive mechanical ventilation therapy in the ICU [7].
Patients with COVID-19 were divided into mild, moderate, severe, and critical cases [8]. Because of the high mortality rate in severe or critical patients [6,7], early identification of patients' risk of developing into severe or critical cases is important so that patients with a poor prognosis can receive timely intervention and minimize the progression of the disease [9]. Therefore, prognostic tools and biomarkers are urgently needed [10,11]. However, most studies have focused mainly on identifying the factors related to death and recovery [12][13][14]. Although some prognostic information has been revealed by using univariate-or multivariate analyses based on prior clinical knowledge or evidence [9,[15][16][17][18][19][20][21], these studies have not paid enough attention to feature selection in multivariate prognostic prediction modeling. As a result, the published prognostic prediction tools may not make the most of patient data to perform prognostic prediction modeling.
This study built a multivariate prognostic prediction model to predict the risk of developing severe cases among patients with moderate COVID-19. With the patients' characteristics at admission and outcomes, a feature selection procedure based on machine learning algorithms was conducted to identify the features contributing the most to distinguishing between the two groups. These features were then chosen as the risk factors on which to build the prognostic prediction model. We believe that this multivariate prognostic prediction tool will be of considerable value for patients with moderate COVID-19 in isolation or self-quarantine so that they can receive timely intervention and active intensive care to minimize progression of the disease and so that health care agencies can prioritize their services, especially in resource-constrained areas.

Materials and methods
The ethics committees of Union Hospital, Tongji Medical College, Huazhong University of Science and Technology approved this retrospective study. The requirement for informed consent was waived. This study was conducted in accordance with the Declaration of Helsinki.

Patients
In this retrospective, single-center study, a search for patient data in the electronic record system was performed for patients admitted to the Cancer Center of Wuhan Union Hospital between February 13, 2020 and February 25, 2020. According to the report of the WHO-China Joint Mission on COVID-19,  patients with COVID-19 were divided into mild  (laboratory  confirmed,  without  pneumonia),  moderate  (laboratory  confirmed  and with pneumonia), severe (dyspnea, respiratory frequency ≥30 beats per minute (bpm), oxygen saturation (SpO2) ≤93%, PaO 2 /FiO 2 ratio <300, and/or lung infiltrates >50% of the lung field within 24-48 hours), and critical (respiratory failure requiring mechanical ventilation, shock, or other organ failure that requires intensive care) cases [8]. Our institution was a designated hospital capable of receiving patients with moderate, severe and critical cases of COVID-19. The inclusion criteria in this study were as follows: (1) patients with laboratory-confirmed COVID-19 according to viral nucleic acid detection using RT-PCR with samples from pharynx swabs; (2) patients who underwent complete laboratory tests (routine blood tests, biochemistry analysis, cytokine tests, immunology tests, and L subset tests) and clinical recording at admission; and (3) patients diagnosed with moderate COVID-19 at admission. The flow diagram of the exclusion criteria is shown in Figure 1.

Treatments
The number of patients with COVID-19 is rapidly growing worldwide, and specialized treatment has not been available in the early stage of the global outbreak. Patients in this study were moderate cases, and their treatments followed the therapeutic principles based on the 2019-nCoV guidelines (Trial Version 5) proposed by the China National Health Commission [22]. The basic treatment included antiviral treatment (abidor 200 mg three times daily, orally), antibacterial treatment (moxifloxacin 400 mg once daily, orally), recombinant human interferon α2b (aerosol inhalation) and symptomatic treatment. Some of the moderate cases were treated with traditional Chinese medicine. However, they were not included in this study, as shown in Figure 1.

Data collection
This cohort was followed to March 14, 2020. Patient data were obtained at admission, including demographics, comorbidities, signs and symptoms, and laboratory findings. The assessed comorbidities included hypertension, cardiovascular disease, diabetes, malignancy, cerebrovascular disease, COPD, chronic kidney disease, chronic liver disease, HIV infection, rheumatic disease and hyperuricemia. The laboratory findings were obtained through the complete laboratory tests mentioned earlier. Finally, 172 patients with moderate COVID-19 at admission were included in this study. In addition, 112 cases were discharged as mild cases, whereas 60 cases developed into severe or critical cases.

Statistical analysis and prognostic prediction modeling
Descriptive statistics were used to describe the demographics, comorbidities, signs and symptoms, and laboratory findings of the 172 moderate cases. Between the two groups, categorical data were compared by using the chi-squared test (Fisher's exact test if the expected count was fewer than 5 for at least one cell). Continuous variables were compared using the independent variable t-test (Mann-Whitney U test if the data were not normally distributed). Univariate logistic regression models were also built to identify the potential risk factors related to the severe prognosis of COVID-19. As a result, the features that differed significantly between the two groups in the above statistics test (P<0.01) and were significant in univariate logistic regression (P<0.01) were chosen as candidates for further processing as follows.
First, multivariate logistic regression analysis with L1 regularization was performed with feature standardization. The regression aimed to identify a subset of features from the aforementioned candidate features that could contribute the most to the discrimination between the two groups [23]. In addition, a parameter sweep on the C parameter was performed in the mentioned L1 regularization. Second, an SVM classifier with a linear kernel was adopted to measure the prediction performance of the top k features from the aforementioned subset. The top k features were selected according to their coefficients in the previous multivariate logistic regression (k ranged from 1 to the size of the subset). Finally, the top k features with the highest 5-fold SVM classification accuracy were chosen as the most predictive features to perform the clinical prognostic prediction modeling with a linear kernel SVM model. The predicted probability (pi) of a moderate case developing into a severe one was calculated with the following equation For the ith individual, x ik was the kth indicator variable in the prognostic prediction model, and a ik was the weight for the kth feature. a 0 is the intercept.

Clinical characteristics
The basic information about the two groups is summarized in Table 1. The median age of the patients was 65 years (IQR 57-71 years). Patients in the severe group were significantly older than those in the mild group (P<0.001), with an average age of 70.6 (SD 11.6) and a median age of 64 (IRQ 50-67), respectively. Among the 172 cases, 52.3% of the patients were female. There was a higher female proportion in the mild group than in the severe group (P=0.01). Comorbidities were present in 55.2% of the patients, but the difference between the two groups was not significant. Hypertension and diabetes were the most common comorbidities. For each kind of comorbidity, there was no significant difference between the two groups.
As shown in Table 2, 15 signs or symptoms were recorded in these moderate cases at admission. Fever, dry cough, and fatigue were the most common initial symptoms (63.4%, 55.2%, and 58.1%, respectively). But fever and dry cough showed no significant difference between the two groups, while fatigue was significantly different. Chest distress and anorexia were significantly more common in the severe group than in the mild group (P<0.001). In the univariate logistic regression analysis, age, chest distress, fatigue, and anorexia showed significant differences regarding to the outcomes (P<0.01), as shown in Table 4.

Laboratory findings
Laboratory findings on hospital admission are summarized in Table 3. Among the 172 patients, 19 laboratory findings showed significant differences between the two groups. Patients in the severe group demonstrated significantly increased WBC, N count, N percentage, AST, LDH, TNI, CK, CK-MB, CysC, ESR, CRP, PCT, IL-6, and IL-10 but significantly decreased L count, L percentage, A/G, ALB, and Ca (P<0.01). In the univariate logistic regression analysis, these laboratory findings also showed significant differences regarding to the outcomes (P<0.01), except CK-MB, as shown in Table 4.

Prediction model for severe prognosis
Based on the results of statistical analysis and univariate logistic regression, four clinical characteristics and 18 laboratory findings differed significantly between the two groups (P<0.01) and were significant in univariate logistic regression (P<0.01), as shown in Table 4. In the further feature selection procedure, these features were used to perform multivariate logistic regression with L1 regularization and feature standardization. With a parameter C=3.999 in the regularization, 17 features were finally selected as shown in Table 4. Eventually, the top six features ranked by regression coefficients were chosen as the most predictive features with which to build a prognostic prediction model for severity for patients with moderate COVID-19. The highest prediction accuracy of 91.38% was reached with the selected six features and a linear kernel SVM with 5-fold cross validation, as shown in Figure 2. The SVM model for prognostic prediction had a sensitivity of 0.90 and a specificity of 0.94, with a mean area under the ROC curve of 0.94, as shown in Figure 3. In the formula for the multivariate prognostic prediction model, the intercept a 0 was -0.14, and the feature weights for the six most predictive features are shown in Table 4.

Discussion
The number of patients infected with COVID-19 is still increasing rapidly worldwide. However, specialized and effective treatment is not yet available. Therefore, the early identification of a moderate case's risk developing into a severe or critical one is of great importance due to the high mortality rate among severe and critical cases [6,7]. Timely intervention and active intensive care can help to minimize the progression of the disease. Therefore, paying attention to prognostic prediction for moderate cases can be important. We enrolled 172 patients hospitalized with laboratory-confirmed COVID-19 and diagnosed with moderate COVID-19 at admission. The patient data were systematically analyzed. As a result, six risk factors were identified as the most predictive ones with which to perform prognostic prediction of severity for patients with moderate COVID-19. Consistent with previous research results, patients in the severe group were significantly older than those in the mild group. Older patients presented more comorbidities and were more likely to develop severe or critical COVID-19 [5,6,25,26]. The proportion of severe cases in males was 61.7%, which was significantly higher than that in females (38.3%), as reported in a recent study [9]. Several studies concerning comorbid disease with COVID-19 suggested that adequate attention should be paid to comorbidity [18,26,27]. In our study, the most common comorbidity was hypertension (36.6%, 63/172), followed by diabetes (15.7%, 27/172) and cardiovascular disease (9.9%, 17/172). Due to the limited sample size, there were no significant differences in comorbidities between the two groups. For all the moderate cases in this study, chest distress, fatigue, and anorexia were significantly different between the two groups, which was consistent with previous reports [28,29].
In addition, in laboratory findings of the moderate cases at admission, significantly higher levels of WBC, N count, N percentage, AST, LDH, TNI, CK, CK-MB, CysC, ESR, CRP, PCT, IL-6, and IL-10 were found in the severe group compared with the mild group. L count, L percentage, A/G, ALB, and Ca, however, were found to be at lower levels in the severe group. The severity of COVID-19 infection may activate neutrophils to produce an immune response to the virus and cause a cytokine storm. Furthermore, considering that aging is related to decreased immune competence [30], elderly patients who died consequently may be due to a weak immune response [16]. There were no significant differences in the L subsets between the two groups, possibly because of the limited sample size. However, some impressive different results were reported in previous research, indicating that CD3 + and CD4 + T cells might protect patients from developing ARDS [16].
In this study, a feature selection procedure based on machine learning algorithms was performed to identify the most predictive features for a multivariate prognostic prediction of severity. Feature selection was an essential step for multivariate modeling. In previous studies, this was mainly implemented by performing univariate statistical analysis or with prior clinical knowledge or evidence [9,15,17,19]. In our study, a two-step feature selection procedure was performed based completely on patient data. Finally, one clinical characteristic (chest distress) and five laboratory findings (IL-6, TNI, PCT, CRP, Ca) were identified as the most predictive risk factors for prognostic prediction of severity for patients with moderate COVID-19. Without adequate knowledge about COVID-19, information completely extracted from the data may be of great value to perform valid prognostic prediction.
The level of IL-6 was mildly elevated or within the normal range in the mild group but markedly elevated in the severe group. Elevated cytokines were likely produced by highly inflammatory macrophages that were implicated in a cytokine storm [31]. Myocardial damage with biomarker elevations was a prominent feature in COVID-19 and was related to a worse prognosis [32,33]. TNI, as one of the most predictive factors in our prognostic prediction model, was used to evaluate patients with suspected acute coronary syndrome [34]. TNI level in plasma and CRP level were positively and significantly related [35]. This implies that the pathological process of myocardial damage and inflammation might have a close relationship in the course of COVID-19 disease. Chest distress was an important factor that could contribute to the prognostic prediction of severity in this study and was reported as one of the most common symptoms in COVID-19 [28]. Increasing PCT level was reported in the discrimination between mild and moderate cases [36]. In our study, PCT was identified as a risk factor for prognostic prediction of severity. Lower level of Ca was recognized as a predictor of severity in our prognostic prediction model. This was an interesting result that has not yet been reported.
There are also some limitations in our research. First, a larger sample size may result in a more convincing prognostic prediction model with the feature selection procedure based on machine learning algorithms. Due to the limited size, the comorbidities and some signs and symptoms showed no significant difference between the two groups. Moreover, age was not one of the most predictive factors related to severity in this cohort of moderate cases, but age was an important risk factor in prognosis in previous research [9]. Second, the temporal changes in patient characteristics from admission to outcome were not included in this study because most of the tests were performed only once at admission. The prognosis may benefit from information on temporal changes [19]. Third, the patients' stage of disease progression at admission, such as time since symptom onset and exposure history, was not considered as a candidate predictor. This information relied on patients' memory and might have been affected by recall bias. Fourth, the data in this study were from an outbreak, so these might differ in a nonoutbreak situation. However, since we included as many COVID-19 cases as possible in our hospital, we believe our study population is representative of cases diagnosed and treated in Wuhan.

Conclusions
With the data collected at admission, one clinical characteristic (chest distress) and five laboratory findings (IL-6, TNI, PCT, CRP, Ca) showed the best performance in discriminating between the two groups with a linear kernel SVM model. They may be risk factors that can be used to perform a prognostic prediction of severity for patients with moderate COVID-19 in the early stage of the disease and thus help minimize the progression of the disease.