Integrative analysis of DNA methylation-driven genes for the prognosis of lung squamous cell carcinoma using MethylMix

Background: DNA methylation acts as a key component in epigenetic modifications of genomic function and functions as disease-specific prognostic biomarkers for lung squamous cell carcinoma (LUSC). This present study aimed to identify methylation-driven genes as prognostic biomarkers for LUSC using bioinformatics analysis. Materials and Methods: Differentially expressed RNAs were obtained using the edge R package from 502 LUSC tissues and 49 adjacent non-LUSC tissues. Differentially methylated genes were obtained using the limma R package from 504 LUSC tissues and 69 adjacent non-LUSC tissues. The methylation-driven genes were obtained using the MethylMix R package from 500 LUSC tissues with matched DNA methylation data and gene expression data and 69 non-LUSC tissues with DNA methylation data. Gene ontology and ConsensusPathDB pathway analysis were performed to analyze the functional enrichment of methylation-driven genes. Univariate and multivariate Cox regression analyses were performed to identify the independent effect of differentially methylated genes for predicting the prognosis of LUSC. Results: A total of 44 methylation-driven genes were obtained. Univariate and multivariate Cox regression analyses showed that twelve aberrant methylated genes (ATP6V0CP3, AGGF1P3, RP11-264L1.4, HIST1H4K, LINC01158, CH17-140K24.1, CTC-523E23.14, ADCYAP1, COX11P1, TRIM58, FOXD4L6, CBLN1) were entered into a Cox predictive model associated with overall survival in LUSC patients. Methylation and gene expression combined survival analysis showed that the survival rate of hypermethylation and low-expression of DQX1 and WDR61 were low. The expression of DQX1 had a significantly negatively correlated with the methylation site cg02034222. Conclusion: Methylation-driven genes DQX1 and WDR61 might be potential biomarkers for predicting the prognosis of LUSC.


Introduction
Lung cancer is the leading cause of cancer related deaths worldwide [1]. Lung squamous cell carcinoma (LUSC) accounts for about 30% of all lung cancers with the highest mortality in the world [2]. LUSC accounts for approximately more than 400,000 deaths worldwide each year [3]. The five year survival rate of lung squamous cell carcinoma is less than 15% [4]. Due to the limitation of treatment and poor survival rate for lung squamous cell carcinoma [5], it is imperative that we explore specific diagnostic and Ivyspring International Publisher prognostic biomarkers for LUSC. In the present study, we aimed to identify novel specific diagnostic and prognostic biomarkers for predicting survival in LUSC.
Genetic aberrant expression is important for the etiology of human cancer and the combined effect of genetic and epigenetic changes contribute to the progress of human cancer [6][7][8][9][10]. DNA methylation is one of the most important elements in epigenetic modifications and participates in the regulation of cellular functions and carcinogenesis [11]. Epigenetic modification, especially DNA methylation, plays a significant role in predicting the prognosis of lung cancer [12][13][14][15]. For instance, the identification of eight DNA methylation biomarkers using high-throughput DNA methylation analysis can predict the prognosis of lung squamous cell carcinoma [16]. Aberrant ANK1 methylation contributes to miR-486-5p repression and discriminates lung tumors by histology and smoking status [17]. Pharmacologic inhibition of DNA methylation combined with gene expression reveals new specific diagnostic and prognostic biomarkers in lung squamous cell carcinoma [18].
MethylMix is an R package used for identifying disease-specific hyper-and hypo-methylated genes [19]. More precisely, MethylMix includes two major characteristics: automatic download of DNA methylation and gene expression data sets from TCGA and automated pre-processing of such data sets: value interpolation, batch correction, and CpG positions within each gene point clustering. Currently, few studies based on using MethylMix R package to identify specific methylation-driven genes have been reported [20][21][22]. Recently, a study based on MethylMix reveals potential prognostic methylationdriven genes for predicting the prognosis in lung adenocarcinoma (LUAD) has been reported [23]. In the present study, we extracted the DNA methylation and RNA-Seq data using bioinformatics methods from The Cancer Genome Atlas (TCGA) database, and then the MethylMix R package was performed to obtain methylation-driven genes. Furthermore, a Cox predictive model was established to predict the diagnosis and prognosis of LUSC. Eventually, the joint of methylation and gene expression combined survival analysis was performed to reveal potential specific methylation-driven genes for predicting the prognosis of LUSC.

Data extraction and analysis
RNA-Seq data and methylation data were downloaded from the Cancer Genome Atlas (TCGA) database. The methylation data consists of 504 LUSC samples and 69 normal samples from the Illumina Infinium Human Methylation 450k platform. The RNA-Seq data (level 3) incorporating lncRNA and mRNA expression was obtained from 502 LUSC samples and 49 normal samples from the IlluminaHiSeq_RNASeq platform. First, based on the limma R package, we retrieved the aberrant methylated genes with the screening criteria absolute fold change (log2) > 0 and adjusting the false discovery rate (FDR) to a P value < 0.05. Then, we obtained the differentially expressed lncRNA and mRNA using the edge R package in R software with the absolute fold change (log2) > 0 and adjusted the FDR to a P value < 0.05. Next, we used the MethylMix R package with the screening criteria (|logFC|>0, P<0.05, Cor<-0.3) to extract the methylation-driven genes. Analysis of average methylation differences at 3000 bp (base pair) sites upstream of genes to identify differential methylation levels in gene promoters [23][24][25]. The differential level of methylation in the promoter of genes was performed by using the limma R package [26]. Eventually, we identified methylation-driven genes and aberrant methylated genes to establish a β-mixture model. The data was directly from the TCGA database. No approval was required from the local ethics committee.

Enrichment analysis of methylation-driven genes in LUSC
We used the Database for Annotation, Visualization and Integrated Discovery (DAVID) (http://david.abcc.ncifcrf.gov/) database to analyze the biological function in methylation-driven genes by using Gene ontology and ConsensusPathDB pathway analysis. In the GO analysis, a P value was less than 0.05 was considered as statistically significance. We used the GOCircle and GOChord plotting to explore the relationship between the methylation-driven genes and their biological function. ConsensusPathDB (http://cpdb.molgen.mpg.de/) is an online software incorporating gene regulatory, drug-target interactions and binary complex signaling. In the ConsensusPathDB pathway analysis, P < 0.05 was considered as statistically significance.

Establishment of predictive model of differentially methylated genes based on DNA methylation in LUSC
First, we obtained the expression of differentially methylated genes and retrieved the survival time and survival status of 366 LUSC patients, then used the survival.pl script to extract the data that combined the expression of differentially methylated genes and survival data. Next, we used the univariate R package to obtain the result of univariate Cox regression. We selected 92 aberrant methylated genes to submit to multivariate Cox regression analysis with the screening criteria P < 0.05. Based on the median risk score, LUSC patients were divided into two groups, incorporating high-risk groups and low-risk groups. To test the influence on differentially methylated genes signature (high risk vs low risk) on overall survival, we performed the receiver operating characteristic (ROC) curves to calculate the area under the curve (AUC) to reveal prognostic specific biomarkers for predicting survival in LUSC.

Univariate and multivariate Cox regression independent prognosis analysis of patients of LUSC
In order to further explore the twelve gene signature that can be used as independent prognosis factor, we performed the univariate and multivariate Cox regression independent prognosis analysis of LUSC patients. We extracted 276 LUSC patients with complete clinical information and combined the twelve genes signature risk score of 276 LUSC patients and the expression data of twelve genes of 276 LUSC patients to perform the univariate and multivariate Cox regression independent prognosis analysis. P < 0.05 was considered as statistically significant.

Methylation and gene expression combined survival analysis in LUSC
We used the methylation and gene expression combined survival analysis to further explore the effect of methylation-driven genes on patient prognosis in terms of expression and methylation levels. We performed the joint of methylation and gene expression combined survival analysis to identify potential methylation-driven genes for predicting the prognosis of LUSC patients. Therefore, we performed the Kaplan-Meier curve analysis. P < 0.05 was considered as statistically significant.

Correlation analysis between site methylation and gene expression in LUSC
In order to further explore the relationship between the methylation-driven genes expression and their methylation sites, we used the R package and Perl Package to perform the correlation analysis. First, site methylation data for methylation-driven genes associated with overall survival extracted from the TCGA database using the Perl package, then, we merged site methylation with gene expression data. Finally, we used the R package to figure out the correlation between site methylation and methylation-driven genes expression.

Survival analysis of methylation site in LUSC patients
In order to further explore the survival rate of methylation site cg02034222 in DQX1 in LUSC patients, we performed the Kaplan-Meier curve analysis of methylation site cg02034222 in DQX1 in LUSC patients. P value less than 0.05 was considered as statistically significant.

Identification of methylation-driven genes in LUSC
A total of 44 methylation-driven genes were identified to be connected with DNA methylation using the MethylMix R package. The methylationdriven genes incorporating 42 methylation-driven mRNAs and two methylation-driven lncRNAs were shown in Table 1. Figure 1B and 1D shows methylation-driven gene ATL3 and DQX1 have significant negative correlation in methylation and gene expression level. The distribution of the methylation degree shows that ATL3 is hypermethylated in LUSC patients and hypo-methylated in normal patients ( Figure 1C). The distribution of the methylation degree shows that DQX1 is hypermethylated in non-LUSC patients and hypomethylated in LUSC patients ( Figure 1E). A flow diagram of methylation-driven genes is shown in Figure 1A. A heat-map of methylation driven genes mRNAs and lncRNAs is shown in Figure 2.

Functional enrichment analysis of methylation-driven genes in LUSC
Functional enrichment analysis shows that eight GO terms (transcription, DNA-templated; transcription factor activity, sequence-specific DNA binding; RNA polymerase II complex import to nucleus; RNA polymerase III complex localization to nucleus; regulation of transcription, DNA-templated; Smc5-Smc6 complex; ligase activity; histone H3-K4 tri-methylation) with statistical significance (P < 0.05). The highest GO term was biological process "GO0006351 transcription, DNA templated" ( Figure  3A and 3C). Figure 3B shows the all methylationdriven mRNAs with their related eight GO terms. Figure 4 shows that 13 pathways (BARD1 signaling events; Protein processing in endoplasmic reticulum -Homo sapiens (human); E3 ubiquitin ligases ubiquitinate target proteins; SUMOylation of DNA damage response and repair proteins; Protein ubiquitination; Apoptosis Modulation and Signaling; SUMO E3 ligases SUMOylate target proteins; SUMOylation; Keratinization; Spliceosome -Homo sapiens (human); Ubiquitin mediated proteolysis -Homo sapiens (human); NRF2 pathway; EMT transition in Colorectal Cancer) were considered statistically significant (P < 0.05). As we can see from the Figure 4, the methylation-driven genes were most enriched in BARD1 signaling events, Protein processing in endoplasmic reticulum -Homo sapiens (human) and E3 ubiquitin ligases ubiquitinate target proteins (P<0.01).
The pathway analysis is shown in Table 2.    Table 3.

ROC curve analysis and risk groupings
The heat-map shows that 12 differentially methylated genes (ATP6V0CP3, AGGF1P3, RP11-264L1.4, HIST1H4K, LINC01158, CH17-140K24.1, CTC-523E23.14, ADCYAP1, COX11P1, TRIM58, FOXD4L6, CBLN1) were divided into two groups based on the median risk scores ( Figure 5A). A total of 366 patients with complete survival information were divided into a high-risk group (n=183) and a low-risk group (n=183). The Kaplan-Meier curve with a Log-rank statistical examination was used to perform survival analysis ( Figure 5B). As shown in Figure 5B, patients in the high-risk group had significantly poor survival rate than in the low-risk group (P =1e-05). Receiver operating characteristic (ROC) curve was performed to identify the effect of 12 aberrant methylated genes signature associated with overall survival in LUSC ( Figure 5C). The graph of the risk score between high risk group and low risk group in the Cox model is shown in Figure 5D.

Univariate and multivariate Cox regression analysis for the various DNA methylation classifiers of patients with LUSC
In order to further verify the twelve gene signature is an independent prognosis factor, we used the univariate and multivariate Cox regression analysis. We extracted the complete LUSC clinical features with 276 samples and then combined the twelve gene expression data and the prognostic risk score with the LUSC clinical features, including LUSC age, gender, pathology stage, pathology T stage, pathology M stage, pathology N stage to perform the univariate and multivariate Cox regression analysis. The forest plot of univariate and multivariate Cox regression analysis was showed in Figure 6. Univariate Cox regression analysis showed that pathology stage, pathology T stage and twelve gene signature risk score can act as independent prognostic factor ( Figure 6A). Multivariate Cox regression analysis showed that twelve gene signature is an independent prognosis factor ( Figure 6B) (Table 4).

Methylation and gene expression combined survival analysis in LUSC
The combined survival analysis revealed that the joint of methylation and gene expression of the genes (DQX1, WDR61) had significant correlation with the prognosis of LUSC patients ( Figure 7A and 7B). As shown in Figure 6, the hypermethylation and lowexpression survival rate of DQX1 and WDR61 were low. The joint of methylation and gene expression survival analysis shows that the DNA methylation and gene expression of DQX1 and WDR61 were associated with overall survival in LUSC patients (P < 0.05). We used the median to define the DNA methylation cut-off to classify a sample as hypermethylated or hypomethylated about the survival rate based on DQX1 and WDR61 data.

Correlation analysis between gene expression and methylation sites
To figure out the relationship between the DNA methylation and gene expression level in DQX1 and WDR61, we used the Perl package to obtain the methylation sites in DQX1 and WDR61, the expression of DQX1 had 8 methylation sites in TCGA database, while only one methylation site had significant correlation between expression and DNA methylation level. The expression of WDR61 had two methylation sites in TCGA database, while no methylation site had significant correlation with the expression of WDR61. As is shown in Figure 7C, the expression of DQX1 had a significant correlation (Cor = -0.725) with cg02034222 methylation (P = 1.26e-74).

Survival analysis of methylation site cg02034222 in DQX1 in LUSC patients
In order to further figure out whether the one methylation site (cg02034222) in DQX1 is responsible of the survival rate of LUSC patients, we performed survival analysis of methylation site (cg02034222) in DQX1 in hypermethylation and hypomethylation LUSC patients ( Figure 7D). The one CpG site cg02034222 in DQX1 in hypermethylation LUSC patients had a poor survival rate than the one CpG site cg02034222 in DQX1 in hypomethylation LUSC patients (P = 0.021).

Discussion
In recent years, with the increasing numbers of advanced diagnosis and poor prognosis in lung squamous cell carcinoma, it is crucial to explore more effective diagnostic and prognostic biomarkers for predicting survival in LUSC. Genetic and epigenetic changes facilitate the progression of LUSC. DNA methylation and RNA-Seq data analysis provide a novel perspective to reveal disease specific diagnostic and prognostic biomarkers in human lung cancer [27,28]. The rapid development of RNA-Seq analysis technologies provides a novel perspective to explore the molecular characteristic and pathogenesis of LUSC and provides significant evidence for predicting the prognosis of LUSC. Emerging evidences shows that the studies on the molecular mechanism of LUSC and the prognostic biomarkers of the LUSC associated with methylation driven genes is still lacking. In our study, we used the MethylMix R package to identify methylation driven genes and a cox predictive model was established to predict the prognosis of LUSC and the joint of methylation and gene expression combined survival analysis reveals potential prognostic biomarkers for predicting the prognosis of LUSC.
Epigenetics modification, especially DNA methylation, participates in the pathogenesis of LUSC. Accumulating evidences have demonstrated that DNA methylation acts as the major molecular mechanism of epigenetic modification was associated with the human malignant cancer, incorporating lung cancer [29][30][31]. The joint of gene and DNA methylation using bioinformatics analysis revealing new diagnostic and prognostic biomarkers for predicting the prognosis of cancer [32][33][34]. Therefore, it is pivotal to identify disease-specific prognostics biomarkers to determine the exploration of the molecular mechanism of LUSC. The methylation of L1RE1, RARB, and RASSF1 acts as potential disease specific biomarkers for predicting the prognosis of lung cancer [35]. TRIM58/cg26157385 methylation associated with eight genes including A2ML1, CCNE1, COBL, ESCO2, GPR115, MMP10, OVOL1 and SCGB1A1 in lung squamous cell carcinoma [36]. The prognostic value of HOXA9 promoter methylation was associated with lung cancer and can become a new diagnostic and prognostic biomarker for predicting the prognosis of stage I lung adenocarcinoma [37]. Therefore, bioinformatics analysis of DNA methylation and gene expression provide a significant horizon for identifying disease specific diagnostic and prognostic biomarkers in lung squamous cell carcinoma.
Functional enrichment analysis of methylationdriven mRNA reveals the methylation-driven genes might be mostly associated with the biologic function of transcription, DNA template; transcription factor activity; RNA polymerase II complex import to nucleus (P < 0.05). The pathway analysis shows that the methylation-driven genes were associated with 13 pathways (P < 0.05); the most enriched pathway was the Protein processing in endoplasmic reticulum-Homo sapiens (human).
In our study, univariate and multivariate Cox regression analysis shows that ATP6V0CP3, AGGF1P3, RP11-264L1.4, HIST1H4K, LINC01158, CH17-140K24.1, CTC-523E23.14, ADCYAP1, COX11P1, TRIM58, FOXD4L6, CBLN1 were associated with overall survival and could establish a survival predictive model to predict the prognosis of LUSC. The 12 differentially methylated genes could also act as an independent prognostic factor for predicting the prognosis of LUSC. HIST1H4K can be used as prognostic factors for predicting the prognosis of cervical cancer patients [38]. HIST1H4K can be used as DNA methylation biomarkers for predicting the prognosis of prostate cancer [39]. ADCYAP1 can act as prognostic biomarkers to predict risk of endometrial cancer [40]. Recently a study based on TCGA database reveal ADCYAP1 can act as prognostic biomarkers for predicting the prognosis of LUSC [4]. ADCYAP1 can act as novel biomarkers for predicting the prognosis of cervical precancer and cancer [41]. TRIM58/cg26157385 methylation may play a significant role for predicting the prognosis of LUSC [36]. CBLN1 can be used as marker genes for predicting the prognosis of the Ventromedial Hypothalamic Nucleus [42].
In our study, the joint of methylation and gene expression combined survival analysis shows that the survival rate of hypermethylation and low-expression of DQX1 was significant lower than the survival rate of hypomethylation and high-expression of DQX1. Epigenome-wide association studies (EWASs) suggested a novel association between T2D and cg06721411(DQX1; P value=1.18×10(-9)) [43]. The survival rate of hypomethylation and high-expression of WDR61 was significant higher than the survival rate of hypermethylation and low-expression of WDR61. Furthermore, we performed the correlation analysis between DNA methylation sites and gene expression. The expression of DQX1 was significantly negatively associated with the methylation sites cg02034222 (P value= 1.26e-74), which might provide a significant horizon to explore prognostic biomarkers for predicting the diagnosis and prognosis of LUSC. Compared with previous studies [4], our study first obtained aberrant methylated genes using the limma R package, obtained differentially expressed genes using the edge R package and then we filtered the missing values and low expression genes and intersected gene expression data with DNA methylation data to input MethylMix R package to identify methylation driven genes, which may provide a novel perspective to reveal disease-specific prognostic biomarkers in LUSC and may play a significant role in predicting the diagnosis and prognosis of LUSC.

Conclusions
In our study, we aimed to identify methylationdriven genes using MethylMix between LUSC patients and normal samples from the TCGA database for predicting the prognosis of lung squamous cell carcinoma. Univariate and multivariate Cox regression analysis showed that the survival predictive model established by twelve differentially methylated genes (ATP6V0CP3, AGGF1P3, RP11-264L1.4, HIST1H4K, LINC01158, CH17-140K24.1, CTC-523E23.14, ADCYAP1, COX11P1, TRIM58, FOXD4L6, CBLN1) can act as independent prognostic factor for predicting the prognosis of LUSC. In addition, the DNA methylation and gene expression levels of DQX1 and WDR61 are significantly associated with overall survival and the expression of DQX1 has a significantly negatively correlated with the methylation site cg02034222. Our study may provide a novel perspective for predicting the prognosis of LUSC.

Abbreviations
AUC: area under the curve; DAVID: the database for annotation, visualization and integrated discovery; EWASs: epigenome-wide association studies; FC: fold change; FDR: false discovery rate; GO: gene ontology; LUSC: lung squamous cell carcinoma; ROC: receiver operating characteristic; TCGA: the cancer genome atlas. drafted the manuscript; YQQ and YEY revised the manuscript. All authors read and approved the final manuscript.

Ethics Committee Approval and Patient Consent
Consent for participation from all patients was obtained through The Cancer Genome Atlas Project.