A Bioinformatics Toolkit: In Silico Tools and Online Resources for Investigating Genetic Variation

Abstract With the advent of large-scale next-generation sequencing initiatives, there is an increasing importance to interpret and understand the potential phenotypic influence of identified genetic variation and its significance in the human genome. Bioinformatics analyses can provide useful information to assist with variant interpretation. This review provides an overview of tools/resources currently available, and how they can help predict the impact of genetic variation at the deoxyribonucleic acid, ribonucleic acid, and protein level.


Introduction
Clinical, diagnostic and research groups working in the field of hemostasis and thrombosis generate considerable data concerning genetic variation. Traditionally, this information has derived from targeted analysis of genes linked to a specific disease phenotype (e.g. investigating von Willebrand factor (VWF) in patients diagnosed with von Willebrand disease). 1 Additional data also derives from genomewide association studies (GWAS) aimed at identifying genetic loci that may influence plasma protein levels 2,3 or that are associated with a specific phenotype, e.g. coronary artery disease. 4,5 The advent of next generation sequencing (NGS) has increased the amount of genetic information obtained from targeted analysis [6][7][8] and is also generating a wealth of information on genetic variation throughout the human genome. 9,10 Although this information on genetic variation represents an invaluable resource, it is essential to properly interpret and understand the relevance of identified genetic variants within the human genome in order to determine whether they have a potential functional effect. Current guidelines from the American College of Medical Genetics and Genomics highlight that many lines of evidence are required to effectively classify genetic variants and assign pathogenicity, 11 one of which is information obtained from bioinformatics analyses. This review aims to provide an overview of the many free in silico tools and resources currently available online that can help clinicians / scientists predict the potential impact of genetic variants at the DNA, RNA and protein level, and therefore assist with variant classification.

Online resources for DNA level investigations
Descriptions for the majority of reported genetic variants would usually be at the DNA level using either genomic coordinates (e.g. chr12:g.6044368T>C) or a specific location within a genetic locus (e.g. VWF:c.2365A>G). Usually, the first stage in evaluating genetic variants is to investigate the literature and databases for existing knowledge. categories. Similarly, the residual variation intolerance score (RVIS; Table 1) uses data derived from both ExAC and gnomAD to rank genes based on whether they have more or less common functional genetic variation relative to the genome-wide expectation. 28 A negative RVIS score and low percentile highlights a gene with fewer common functional mutations than expected (LoF intolerant) while a positive score and high percentile highlights a LoF tolerant gene.

Online resources for RNA level investigations
Analysis of genetic variation at a RNA level primarily concerns those tools applicable to predicting their effect on RNA splicing. However, genetic variants can influence RNA in other ways, so additional tools / resources can also be of use.

RNA splicing prediction tools
Genetic variants that occur within consensus motifs for 5' splice acceptors, 3' splice donors or intronic branch points can interfere with the interaction of the spliceosome complex, influencing the splicing of intronic sequence from the mature RNA causing full / partial exon skipping [29][30][31][32] or intron retention. 30 In addition, deep intronic variants can activate cryptic splice acceptors or donors causing intron retention 33 or the formation of a pseudo-exon. 34,35 There are several in silico tools available to help predict the effect of variants on RNA splicing ( Genetic variants (e.g. c.2365A>G and c.2385T>C in VWF 53 ) can also influence the secondary structure of transcribed mRNA, thereby impacting on the overall RNA stability, which in turn can influence RNA production. 54 Rtools provides a useful suite of prediction programs designed to compare inputted wild-type and variant DNA sequence and to highlight any differences in RNA secondary structure ( Table 2).
The abundance of tRNA molecules available for a given amino acid codon sequence can affect the rate at which mature mRNA is translated into protein via a process  Table 2).

Online resources for protein level investigations
Analysis at the protein level utilizes those tools applicable to predicting the effect of non-synonymous amino acid variation and those resources that provide further information on the structure and function of proteins found to harbor potentially pathogenic variants.

Amino acid prediction tools
Non-synonymous amino acid substitutions can have profound effects on protein structure and function leading to disease. It is therefore useful to predict the impact of these changes on a protein in order to differentiate disease causing variants from  56 Several studies have demonstrated that variants affecting protein function are more frequently found at positions conserved throughout evolution. 57 In addition, variants that affect protein stability are crucial for molecular function and are also more likely to be deleterious. 58,59 Based on these assumptions, multiple prediction tools have been developed that use sequence and/or structural information to predict the pathogenicity of a given variant (Table 3).
PolyPhen-2 uses both sequence and structural information to predict the effect of a given variant. 61 This is achieved by constructing a MSA, performing functional annotation of SNV, extracting protein sequence and structural information and building a conservation profile. Based on these properties PolyPhen-2 then estimates the probability that the missense mutation is 'probably damaging', 'possibly damaging' or 'benign'. 62 It is important to remember that each tool and the algorithm it employs will provide varying levels of prediction accuracy. When compared to known deleterious variants, impact predictions of most tools were found to be accurate in ~60-80% of cases. 63 A recent study assessing the use of in silico tools to predict the pathogenicity of known deleterious variants in antithrombin found that performance varied depending on the localization of the substitution within the secondary structure, with those in α-helices often misclassified as benign. 64 In addition, variants known to disrupt posttranslational modifications were also misclassified. 64 As with RNA splicing predictions, it is therefore useful to utilize several prediction tools to achieve an accurate consensus (e.g. hemostasis / thrombosis studies investigating variants in Tools for assessing protein stability The effect of an amino acid substitution on the protein stability and function is an important consideration when trying to determine pathogenicity. Using a protein databank (PDB) file and a specified variant, tools such as Site Directed Mutator (SDM; Table 3) can calculate a stability difference score between the wild-type and the variant protein. 69 Where the tertiary structure of a protein of interest is unknown and no PDB structure file exists, machine learning programs such as MUpro (Table   3) predict protein stability changes using primary sequence data alone. 70 However, while these tools may be useful in a research context, providing an extra line of evidence, they do not make any predictions about whether a substitution is damaging or deleterious.

Other useful protein tools and resources
There are several resources that can be utilized in the analysis of proteins (e.g. to identify protein domains / motifs or to investigate protein-protein interactions). The Swiss Institute for Bioinformatics ExPASy resource (Table 3) contains a comprehensive list of protein analysis tools along with useful summary descriptions. 71 PDB provides 3D protein models that when imported into specialized molecular graphics programs such as Jmol and PyMOL (

Concluding remarks
In silico tools and online resources serve as useful sources of information for clinicians / scientists investigating genetic variation. However, this information is only a prediction and not a definitive answer; it will provide evidence to link a variant to disease pathogenicity or help confirm / direct further investigations, e.g. in vitro and in vivo studies. For the most accurate and informative analyses of a variant(s) users should consider its effect at the DNA, RNA and protein level ( Figure 2) utilizing all the tools / resources highlighted in this review as a bioinformatics toolkit ( Figure 3).    the tools / resources that could be used.