Following a complete predictor run, you will want to evaluate the performance of the predictor and examine the nature of feature-selected variables.

Software Requirements

For this example to work, you will require the following components to be installed. You may have already installed these when you first setup netDx but if not, make sure these are set up now:

  • Cytoscape must be installed and running.
  • Cytoscape apps EnrichmentMap (v3.0.0) and AutoAnnotate (v1.2) must be installed.
  • R packages: httr, RJSONIO
  • R packages r2cytoscape and EasycyRest.

Note: If Cytoscape is not running, this example will not work!

Let us see if the required dependencies can be installed and/or loaded:

#httr
tryCatch(expr = { library(httr)}, 
          error = function(e) { install.packages("httr")}, finally = library(httr))

#RJSONIO
tryCatch(expr = { library(RJSONIO)}, 
          error = function(e) { install.packages("RJSONIO")}, finally = library(RJSONIO))

#r2cytoscape
tryCatch(expr = { library(r2cytoscape)}, 
          error = function(e) { devtools::install_github('cytoscape/cytoscape-automation/for-scripters/R/r2cytoscape')}, finally = library(r2cytoscape))
## Loading required package: XML
# EasycyRest
tryCatch(expr = { library(EasycyRest); detach(package:EascyRest,unload=TRUE)}, 
          error = function(e) { devtools::install_github('BaderLab/Easycyrest/EasycyRest@0.1')}, finally = {})
## 
## Attaching package: 'EasycyRest'
## The following objects are masked from 'package:r2cytoscape':
## 
##     createNetwork, createStyle
## Skipping install of 'EasycyRest' from a github remote, the SHA1 (3f474b6e) has not changed since last install.
##   Use `force = TRUE` to force installation

Now let us test if Cytoscape is open:

# Basic settings
version.url <-"http://localhost:1234/v1/version"
cytoscape.open <- TRUE

tryCatch(expr = { GET(version.url)}, 
         error = function(e) { return (cytoscape.open = FALSE)}, finally =function(r){ return(cytoscape.open = TRUE)})
## Response [http://localhost:1234/v1/version]
##   Date: 2017-09-13 15:53
##   Status: 200
##   Content-Type: application/json
##   Size: 46 B
if(!cytoscape.open){
  #try and launch cytoscape
 stop("Cytoscape is not open.  Please launch cytoscape.")
} else{
  cytoscape.version <-  GET(version.url)
  cy.version <- fromJSON(rawToChar(cytoscape.version$content))
  print(cy.version)
}
##       apiVersion cytoscapeVersion 
##             "v1"          "3.5.1"

Predictor Design

For this example, the predictor has been run using a nested cross-validation design. In nested cross-validation, cross-validation is performed repeatedly over multiple random splits of the data into train and blind test partitions. Feature selected networks are those that consistently score highly across the multiple splits. Conceptually, this is what the higher-level logic looks like for a nested cross-validation design with 10-fold CV in the inner loop, and 100 splits in the outer loop:

(Note: these aren’t real function calls; this block just serves to illustrate the concept of the nested CV design for our purposes)

outerLoop <- 100     # num times to split data into train/blind test samples
innerLoop <- 10      # num folds for cross-validation, also max score for a network
netScores <- list()  # collect <outerLoop> set of netScores
perf <- list()       # collect <outerLoop> set of test evaluations

for k in 1:outerLoop
 [train, test] <- splitData(80:20) # split data using RNG seed
 netScores[[k]] <- runCV(train)
 perf[[k]] <- collectPerformance(netScores[[k]], test)
end

Output directory structure

netDx expects a nested directory structure with the predictor results. The top level should contain one directory for each train/test split. Within each of these directories are the predictor results for the corresponding cross-validation. Here is the directory structure for a dataset with rootDirectory dataset_yymmdd.

dataset_yymmdd/
  + rng1/
    + tmp/       # directory created by netDx, containing input data for GeneMANIA database
    + networks/  # PSN created by calls to makePSN_NamedMatrix()
    +-- Class1
       + tmp/
       + networks/                               # networks for test classification for this split
       + GM_results/                             # results of inner loop (10-fold CV)
       + Class1_pathway_CV_score.txt             # network scores for inner CV fold
       +--- CV_1.query                           # query for CV fold
       +--- CV_1.query-results.report.txt.NRANK  # network weights for CV fold
       ...
       +--- CV_10.query                           
       +--- CV_10.query-results.report.txt.NRANK  
    +-- Class2
    + predictionResults.txt  # test predictions for this train/test split
  + rng2/
  + rng3/
  + rng4/
  ...
  + rng100/

Set up

suppressMessages(require(netDx))
suppressMessages(require(netDx.examples))

Load data for plotting

In this example, we use data from The Cancer Genome Atlas (http://cancergenome.nih.gov/), downloaded from the PanCancer Survival project (https://www.synapse.org/#!Synapse:syn1710282). We use gene expression profiles from renal clear cell carcinoma tumours to predict poor and good survival after Yuan et al. (2014) (Refs 1-2). The data consists of 150 tumours. Here we work only with the gene expression profiles generated.

phenoFile <- sprintf("%s/extdata/KIRC_pheno.rda",path.package("netDx.examples"))
lnames <- load(phenoFile)
head(pheno)
##             ID age grade     stage STATUS_INT     STATUS
## 1 TCGA-AK-3428  62    G2 Stage III          1 SURVIVEYES
## 2 TCGA-AK-3434  72    G2   Stage I          1 SURVIVEYES
## 3 TCGA-B0-4688  46    G4  Stage IV          0  SURVIVENO
## 4 TCGA-B0-4690  65    G3  Stage IV          0  SURVIVENO
## 5 TCGA-B0-4691  55    G3  Stage IV          0  SURVIVENO
## 6 TCGA-B0-4693  72    G4 Stage III          0  SURVIVENO

Create a directory to store output in:

outDir <- paste(getwd(),"plots",sep="/")
if (!file.exists(outDir)) dir.create(outDir)
setwd(outDir)

Now compile the paths to output files. First get the list of all directories corresponding to the outer loops:

inDir <- sprintf("%s/extdata/KIRC_output",
    path.package("netDx.examples"))
all_rngs <- list.dirs(inDir, recursive = FALSE)
print(head(basename(all_rngs)))
## [1] "rng1"   "rng10"  "rng100" "rng11"  "rng12"  "rng13"

Each rngX/ directory contains the results of a particular train/test split. The two classes in question are SURVIVEYES (good survival) and SURVIVENO (poor survival).

dir(all_rngs[1])
## [1] "predictionResults.txt" "SURVIVENO"             "SURVIVEYES"

Plotting overall predictor performance

First we evaluate the average performance of the predictor over the 100 train/blind test splits. Performance is defined as the Area under the Receiver Operator Characteristic curve (AUROC) or Area under the Precision-Recall curve (AUPR).

First compile prediction files for all the 100 splits in the example data. These paths are stored in predFiles. Then call the performance-plotting function, which is plotPerf().

predClasses <- c("SURVIVEYES","SURVIVENO")
predFiles <- unlist(lapply(all_rngs, function(x) 
        paste(x, "predictionResults.txt", sep = "/")))
predPerf <- plotPerf(inDir, predClasses=predClasses)
## Single directory provided, retrieving prediction files

Feature-level scores

Feature-level scores are one of the outputs of feature-selection. These indicate the predictive strength of a feature, so that networks with higher scores indicate those with greater predictive power for the class in question. Note that in netDx, feature selection occurs once per patient class (i.e. here, once for SURVIVEYES and once for SURVIVENO), so each feature gets a predictive score for each class.

This nested design has an outer loop of 100 splits with 10-fold cross-validation in the inner loop. Each network can therefore score between 0 and 10 (max number of cross-validation folds) for a given split. For the 100 splits, feature scores can be viewed as a matrix X which is N-by-100, where N is the number of networks. X[i,j] is the score for network i for split j.

Note:In practice, we do not include networks that never score >0, so the number of rows is < N.

featScores <- getFeatureScores(inDir,predClasses=c("SURVIVEYES","SURVIVENO"))
## SURVIVEYES
##  Single directory provided, retrieving CV score files
## Got 100 iterations
## * Computing consensus
## SURVIVENO
##  Single directory provided, retrieving CV score files
## Got 100 iterations
## * Computing consensus

The size of featScores should correspond to number of networks that score >0 at least once (rows) and the number of train/test splits (here, 100):

dim(featScores[[1]])
## [1]  25 101

The first column shows feature names:

head(featScores[[1]][,1:10])
##                                           PATHWAY_NAME rng1 rng10 rng100
## 30   ACTIVATION_OF_THE_PRE-REPLICATIVE_COMPLEX.profile    4    10     10
## 64                       ANDROGEN_BIOSYNTHESIS.profile    9     9      4
## 237                      BIOCARTA_STEM_PATHWAY.profile   10    10      8
## 264                CALNEXIN_CALRETICULIN_CYCLE.profile   10    10      6
## 355      DEFECTS_IN_COBALAMIN__B12__METABOLISM.profile   10     9     10
## 356 DEFECTS_IN_VITAMIN_AND_COFACTOR_METABOLISM.profile    9     9     10
##     rng11 rng12 rng13 rng14 rng15 rng16
## 30      8     9     6     6    10     8
## 64      8     5     6     4     9     9
## 237    10     9     7    10     9     4
## 264     7     7    10     9     8     7
## 355     9    10     8    10     8    10
## 356     9    10     9    10     8     9

Let us define feature-selected networks as those that score 10 out of 10 in at least 70% of the train/test splits.

featSelNet <- lapply(featScores, function(x) {
    callFeatSel(x, fsCutoff=10, fsPctPass=0.7)
})

Let’s take a look at the top networks for each class:

tmp <- lapply(featSelNet,print)
## [1] "REACTIONS_SPECIFIC_TO_THE_COMPLEX_N-GLYCAN_SYNTHESIS_PATHWAY.profile"
## [2] "THYROXINE_BIOSYNTHESIS.profile"                                      
## [1] "ABACAVIR_TRANSPORT_AND_METABOLISM.profile"                 
## [2] "BILE_SALT_AND_ORGANIC_ANION_SLC_TRANSPORTERS.profile"      
## [3] "METABOLISM_OF_WATER-SOLUBLE_VITAMINS_AND_COFACTORS.profile"
## [4] "PLATELET_ADHESION_TO_EXPOSED_COLLAGEN.profile"             
## [5] "REGULATION_OF_IFNA_SIGNALING.profile"                      
## [6] "THYROXINE_BIOSYNTHESIS.profile"

Plotting the Enrichment Map

An EnrichmentMap is a network-based visualization for a group of overlapping sets (Ref 3). Here, we use this visualization to examine which pathways and functional themes are feature-selected for each patient class. So, this is a network where nodes are features (here, pathways), and edges indicate shared members between the features (here, genes common to the pathway).

By clustering and annotating the EnrichmentMap using the AutoAnnotate app (Ref 4), we can identify functional themes that are common among high-scoring features.

Get the set of valid gene sets:

pathFile <- sprintf("%s/extdata/Human_160124_AllPathways.gmt",
           path.package("netDx.examples"))
pathwayList <- readPathways(pathFile)
## ---------------------------------------
## File: Human_160124_AllPathways.gmt
## 
## Read 2760 pathways in total, internal list has 2712 entries
##  FILTER: sets with num genes in [10, 500]
##    => 911 pathways excluded
##    => 1801 left

Filter for the genes measured in this dataset:

xpr_genes <- sprintf("%s/extdata/EMap_input/genenames.txt",
      path.package("netDx.examples"))
xpr_genes <- read.delim(xpr_genes,h=FALSE,as.is=TRUE)[,1]
head(xpr_genes)
## [1] "ZNF121"  "OR2J3"   "HMOX1"   "SYT4"    "GPR137C" "AKAP12"

Filter:

pathwayList <- lapply(pathwayList, function(x) x[which(x %in% xpr_genes)])

To create an enrichment map in netDx, you need two files:

  1. **A .gmt file:** A file with top-scoring genesets in the GMT format, similar to the example pathway file.
  2. Node attribute table: A table containing the names of features and the maximum score each achieves across cross-validation.

Here we generate the above two files for each patient class. We have two sets of files because each class has its own set of predictive features and therefore, its own enrichment map (EM).

In addition to objects we have seen before, this step requires a table indicating what type of data each network represents. This may be used to assign visual features (e.g. node shapes) to distinguish different data sources in the EM.

netInfoFile <- sprintf("%s/extdata/KIRC_output/inputNets.txt",
      path.package("netDx.examples"))
netInfo <- read.delim(netInfoFile,sep="\t",h=FALSE,as.is=TRUE)
head(netInfo)
##    V1
## 1 rna
## 2 rna
## 3 rna
## 4 rna
## 5 rna
## 6 rna
##                                                                       V2
## 1                      GUANOSINE_NUCLEOTIDES__I_DE_NOVO__I__BIOSYNTHESIS
## 2                                                   RETINOL_BIOSYNTHESIS
## 3                         MUCIN_CORE_1_AND_CORE_2__I_O__I_-GLYCOSYLATION
## 4 SUPERPATHWAY_OF_D-_I_MYO__I_-INOSITOL__1,4,5_-TRISPHOSPHATE_METABOLISM
## 5                D-_I_MYO__I_-INOSITOL__1,4,5_-TRISPHOSPHATE_DEGRADATION
## 6                                                           MRNA_CAPPING

Create the EnrichmentMap input:

EMap_input <- writeEMapInput_many(featScores,pathwayList,
      netInfo,outDir=outDir)

Finally, plot the EnrichmentMap:

pngFiles <- list()
for (curGroup in names(EMap_input)[1:2]) {
    pngFiles[[curGroup]] <- plotEmap(gmtFile=EMap_input[[curGroup]][1], 
                                nodeAttrFile=EMap_input[[curGroup]][2],
                                netName=curGroup,outDir=outDir)
}
##       apiVersion cytoscapeVersion 
##             "v1"          "3.5.1" 
## * Applying AutoAnnotate
## * Importing node attributes
## * Creating or applying style
## * Final cleanup
## [1] "http://localhost:1234/v1/commands/view/export?OutputFile=/Users/ahmadshah/netDx/examples/plots/EnrichmentMap_SURVIVEYES.png"
##       apiVersion cytoscapeVersion 
##             "v1"          "3.5.1" 
## * Applying AutoAnnotate
## * Importing node attributes
## * Creating or applying style
## * Final cleanup
## [1] "http://localhost:1234/v1/commands/view/export?OutputFile=/Users/ahmadshah/netDx/examples/plots/EnrichmentMap_SURVIVENO.png"

The enrichment map showing the top-scoring networks for SURVIVEYES should now be active in Cytoscape. It should look like this:

EnrichmentMap for the SURVIVEYES class

EnrichmentMap for the SURVIVEYES class

Nodes are coloured by the highest consistent score the particular feature achieved; consistency is defined as having achieved the score for >70% of the splits.

Note: Node labels for individual features have been turned off to remove visual clutter. These can be enabled interactively in Cytoscape.

There should also be a second EnrichmentMap for the SURVIVENO class:

EnrichmentMap for the SURVIVENO class

EnrichmentMap for the SURVIVENO class

Computing the integrated patient similarity network

An integrated patient network is constructed by aggregating feature-selected networks for all classes. It represents the overall view of patient similarity that results from identifying predictive networks through the netDx predictor.

For the visualization we actually prefer to magnify dissimilarity so that similar patients are closer (i.e. have smaller edge weight) and dissimilar patients are farther apart (i.e. have larger edge weight). So the network generated in this step is really a dissimilarity network. The steps for computing this network are:

  1. Concatenate all feature-selected networks
  2. For all patient pairs, collapse edges by taking the mean similarity.
  3. Convert to dissimilarity (1-mean_similarity).
  4. For the Cytoscape visualization, keep a certain fraction of the top edges (e.g. top 20% disimilarities).

The function plotIntegratedPSN() runs these computations. It therefore needs access to the folder that contains all the feature-selected networks. Here we provide rng1/ for that purpose using the baseDir argument.

netInfo <- plotIntegratedPSN(pheno=pheno,baseDir=sprintf("%s/rng1",inDir),
    netNames=featSelNet,outDir=outDir)
## 2 classes: { SURVIVEYES,SURVIVENO }
## * Style exists, not creating
## Group SURVIVEYES
## Group SURVIVENO
## * Computing aggregate net
## 
## Writing aggregate PSN
## Loading required package: reshape2
## 
##  12234 pairs have no edges (counts directed edges)
##  Sparsity = 10266/11175 (92 %)
## * Creating network in Cytoscape
## * Create network URL
## Network ID is : 65311 
## * Applying layout
## * Applying style
## * Exporting to PNG

The Cytoscape session should now show a network that looks like this: View of patient similarity in KIRC using feature-selected networks. Nodes are patients, with fill colour indicating patient class. Patients of type SURVIVENO are shown in orange, while those of type SURVIVEYES are shown in green. Edges are weighted by average dissimilarity. Only top 20% of the largest weights are used here.

The returned netInfo object contains several pieces of information related to the integrated network shown above, including the path to the full similarity network (aggPSN_FULL), the pruned dissimilarity network (aggPDN_pruned), and information about the resulting view in Cytoscape (network_suid and netView).

summary(netInfo)
##               Length Class  Mode     
## aggPSN_FULL   1      -none- character
## aggPDN_pruned 1      -none- character
## incNets       7      -none- character
## network_suid  1      -none- numeric  
## netView       1      -none- character

sessionInfo

sessionInfo()
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## locale:
## [1] C/C/C/C/C/en_CA.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.2       netDx.examples_0.1   netDx_0.94          
##  [4] RColorBrewer_1.1-2   pracma_2.0.7         ROCR_1.0-7          
##  [7] gplots_3.0.1         GenomicRanges_1.26.4 GenomeInfoDb_1.10.3 
## [10] IRanges_2.8.2        S4Vectors_0.12.2     BiocGenerics_0.20.0 
## [13] combinat_0.0-8       doParallel_1.0.10    iterators_1.0.8     
## [16] foreach_1.4.3        bigmemory_4.5.19     bigmemory.sri_0.1.3 
## [19] EasycyRest_0.1       r2cytoscape_0.0.3    XML_3.98-1.9        
## [22] RJSONIO_1.3-0        httr_1.3.1          
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0       colorspace_1.3-2   htmltools_0.3.6   
##  [4] yaml_2.1.14        rlang_0.1.1        withr_1.0.2       
##  [7] plyr_1.8.4         stringr_1.2.0      zlibbioc_1.20.0   
## [10] munsell_0.4.3      gtable_0.2.0       devtools_1.13.2   
## [13] caTools_1.17.1     codetools_0.2-15   memoise_1.1.0     
## [16] evaluate_0.10      knitr_1.16         curl_2.8.1        
## [19] Rcpp_0.12.11       KernSmooth_2.23-15 backports_1.1.0   
## [22] scales_0.4.1       gdata_2.18.0       XVector_0.14.1    
## [25] ggplot2_2.2.1      digest_0.6.12      stringi_1.1.5     
## [28] rprojroot_1.2      grid_3.3.3         quadprog_1.5-5    
## [31] tools_3.3.3        bitops_1.0-6       magrittr_1.5      
## [34] RCurl_1.95-4.8     lazyeval_0.2.0     tibble_1.3.3      
## [37] pkgconfig_2.0.1    rmarkdown_1.6      R6_2.2.2          
## [40] igraph_1.1.2       git2r_0.18.0

References

  1. Yuan, Y. et al. (2014) Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat Biotechnol 32, 644-52.
  2. The Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature499, 43-9.
  3. Merico, D., Isserlin, R. & Bader, G.D. (2011). Visualizing gene-set enrichment results using the Cytoscape plug-in enrichment map. Methods Mol Biol 781, 257-77.
  4. Kucera, M., Isserlin, R., Arkhangorodsky, A. & Bader, G.D. (2016). AutoAnnotate: A Cytoscape app for summarizing networks with semantic annotations. F1000Res 5, 1717.