Application2: Construction of a systems-biology based classifier of HCC differential diagnosis

  Step1: Identification of HCC differentially expressed genes

  1. Collection of gene expression microarray data

    Two microarray datasets (GSE5364 and GSE2109) of gene expression profiles of multi-cancers including hepatocellular carcinoma, breast carcinoma, lung cancer, colon carcinoma, thyroid gland carcinoma, and esophageal carcinoma were selected from LiverAtlas and GEO. Please see detailed information about the datasets in Table 4.

  2. Selection of differentially expressed genes

    The data mining strategy for selecting the differentially expressed genes for this HCC diagnostic classifier was based on our previous published methodology[15]. 126 up-regulated and 126 down-regulated genes in GSE5364 of multi-cancers were chosen as the candidate markers for the HCC classifier (The gene lists are shown in Table 5).

  Step2: Selection of marker genes for the HCC classifier

  1. Network construction of candidate marker genes

    As above, the HCC differentially expressed genes in GSE5364 were selected as candidate marker genes for further network analysis. The PPI information of these genes was downloaded from LiverAtlas ftp. To create the network, the genes (nodes) and connections (edges) were plotted using UCINET for Windows operating system (Figure 3A). The network architecture is consistent with a scale-free network and represents interactions between individual nodes.

  2. Network analysis of candidate marker genes

    For each node i in the above PPI network, we examined six topological features to identify HCC markers: (1) 'Degree' is defined as the number of links to node i; (2) 'Closeness centrality' is defined as the normalized number of steps required to access every other node from node i in the network; (3) ¡®Betweenness centrality¡¯ is defined as the number of shortest paths from all the vertices to the rest that pass through node i; (4) 'K value' is used to measure the centrality of node i by K-core analysis[1]; (5) ¡®Authority score¡¯ is proportional to the sum of the hub scores of node i on the in-coming ties;(6) ¡®Hub score¡¯ is proportional to the authority scores of node i on the out-going ties. These two scores are usually used as centrality scores.

  3. Selection of marker genes for the HCC diagnostic classifier

    The values of the above six features were calculated by UCINET for each node, and the average values of these features were then calculated. Those candidate marker genes which have all six measures higher than their corresponding average values were selected as the markers of the HCC diagnostic classifier. As a result, four genes, PBX1, TCF4, IQGAP1, and RTN4 were identified as HCC markers. The detail information on these genes from LiverAtlas was summarized in Table 6.

  Step3: Construction of Systems-biology based classifier of HCC differential diagnosis

  1. Searching for the optimal parameters of the HCC classifier

    According to our previous published methodology[15] an algorithm named Parial Least Squares (PLS) was used to compute the optimal weight coefficients of the markers¡¯ expression level. This algorithm can be programmed in MATLAB. With these optimal weights, the ROC curve denotes the change trends of sensitivity and specificity against the threshold. The optimal threshold value is the one leading to the highest accuracy.

  2. Performance evaluation of the classifier

    Five-fold cross-validation on the GSE5364 dataset was performed to evaluate the performance of this HCC differential classifier. The resulting ROC curves were illustrated in Figure 3B. Because the AUC is an indicator of the discriminatory power for the classifier, it was also used here to evaluate the predictive efficacy of our classifier. From Figure 3B, we can find that our classifier had an AUC value approximating 1.0 in all the five tests, suggesting that it is highly reliable and efficient for identifying the true HCC tissues against different test datasets.

    Moreover, the independent dataset GSE2109 was also used to test this HCC differential classifier. It was randomly separated into the training and test datasets for 100 times. The overall predictive accuracy was 81.32¡À0.03% and AUC value was 0.93¡À0.02, which suggests that our classifier may effectively and specifically identify HCC samples from other different cancer samples.

  3. Conclusion

    This application of LiverAtlas shows that the new data resource in this database may facilitate research on the improvement of the ability of HCC differential diagnosis.

    Please see detailed information on patients, tumor tissue specimens, immunohistochemical protocols and statistical analysis in Supplement.