P Raman S Zimmerman and K S Rathi et
P. Raman, S. Zimmerman and K.S. Rathi et al.
using a threshold, and standard workflows for dichotomizing data can be applied. Although these approaches are attractive because of their simplicity, the specification of an appropriate threshold is not a trivial question to solve. Fundamentally, the issue that complicates biomarker prediction for patient sur-vival, and prevents its straightforward expansion from discrete variables to continued-based ones, like gene expression, is the degree of variability in the data. Simply put, a question that remains unaddressed is whether it is more effective to derive biomarkers based on approaches that dichotomize a gene’s L-NAME hydrochloride profile, and if so, how can an optimal break-point be identified to facilitate this dichotomization?
To our knowledge, a comprehensive investigation to ad-dress this important question has not been performed, and therefore is the focus of this study. The performance of differ-ent statistical methods that estimate the effect of gene expres-sion from RNA-seq data and survival status were compared to determine what is the optimal strategy for identifying pre-dictive markers of cancer patient survival. RNA-seq datasets from four TCGA studies - ovarian serous cystadenocarcinoma (OV), prostate adenocarcinoma (PRAD), kidney renal clear cell carcinoma (KIRC), and head and neck cancer squamous cell carcinoma (HNSC) – were used to test the reliability and accuracy of eight competing survival analysis methods. The methods selected were based on the Cox regression model, k-means, the concordance index (C-index), the D-index, and dichotomization using the median, distributional shapes, Ka-planScan, and the 25th–75th percentile split. The four can-cers were selected as a representative panel to evaluate the eight survival analysis methods. Given that there are over 30 tumor types represented in TCGA now, a large number of combinations could have been sourced. However, the ovarian and prostate cancers were selected to represent two different kinds of sex-specific tumors. The head and neck and kidney cancers were also included because they represent less com-mon tumor types (∼4% of all new cancer cases in the United States) .
As with any applied statistical approach, the eight methods selected each bring their own set of advantages and limita-tions for finding biomarkers of survival based on gene expres-sion data (Fig. 1). Cox regression is a flexible, well-established method that allows for the inclusion of multiple covariates to adjust for explanatory variables. This provides a way to fur-ther improve the accuracy of the estimate between patient survival and gene expression by accounting for other con-tributing factors, including batch effects, biometric or clinical variables. K-means has been borrowed from exploratory data analysis methods, where k-means clustering is used to split the gene expression data into two patient groups through an unsupervised, non-parametric approach that iterates until the method arrives at convergence. A log-rank test assesses the difference in survival for these two patient groups to determine a gene’s utility as a prognostic biomarker. The KaplanScan method identifies the optimal breakpoint by considering mul-tiple candidates, and chooses the one that creates the most significant separation between the two patient groups. While the KaplanScan method is advantageous because it avoids relying on an arbitrary threshold for dichotomization, it suf-fers from an increased rate of false positives, and correction methods to adjust for multiple hypothesis must be used .
In our study, two of the methods that were selected are those based on quantile dichotomization. The simplest
Comparing survival analysis methods for cancer RNA-seq data