In-silico classification of the pathogenic status of somatic variants is shown to be promising in promoting the clinical utilization of genetic tests. Majority of the available classification tools are designed based on the characteristics of germline variants or the combination of germline and somatic variants. Significance of somatic variants in
cancer initiation and progression urges for development of classifiers specialized for classifying pathogenic status of
cancer somatic variants based on the model trained on
cancer somatic variants. We established a gold standard exclusively for
cancer somatic single
nucleotide variants (SNVs) collected from the catalogue of somatic mutations in
cancer. We developed two support vector machine (SVM) classifiers based on genomic features of
cancer somatic SNVs located in coding and non-coding regions of the genome, respectively. The SVM classifiers achieved the area under the ROC curve of 0.94 and 0.89 regarding the classification of the pathogenic status of coding and non-coding
cancer somatic SNVs, respectively. Our models outperform two well-known classification tools including FATHMM-FX and CScape in classifying both coding and non-coding
cancer somatic variants. Furthermore, we applied our models to predict the pathogenic status of somatic variants identified in young
breast cancer patients from METABRIC and TCGA-BRCA studies. The results indicated that using the classification threshold of 0.8 our "coding" model predicted 1853 positive SNVs (out of 6,910) from the TCGA-BRCA dataset, and 500 positive SNVs (out of 1882) from the METABRIC dataset. Interestingly, through comparative survival analysis of the positive predictions from our models, we identified a young-specific pathogenic somatic variant with potential for the prognosis of early onset of
breast cancer in young women.