navigation |
The format of variant in NCVD is Chromosome-Position-ReferenceAllele-AlternativeAllele, for example, 13-48045719-C-T. The position, reference allele and alternative allele of variants are left-aligned and normalized (https://genome.sph.umich.edu/wiki/Variant_Normalization). The position coordinate of NCVD is based on human assembly GRch38/hg38.
Searching can be done on the Home page and Search page. Currently, the database supports the following 4 types of search keys to query a variant or gene or region:
After submitting the search key, there will be a feedback table which contains variants matching the querying. From the table, some overview information of the variants can be glanced, including Variant ID, dbSNP ID, information resulting from Ensembl gene annotation (region, Gene ID, Exonic Function, consequence), and statisctical information from own 2999 Chinese high depth genome sequencing (Allele Count, Allele Number, Allele Frequency).
From the search result, every variant can be linked to a detail variant annotation page which is modularized. First it’s the basic information of variant including Allele Count, Allele Number, Allele Frequency and Number of Homozygotes as the same of search result. If the variant is also included in dbSNP database or gnomAD database, the links of variant in those database are provided. The Browser is linked to the genome browser to present the variant in the region of upstream 100bp and downstream 100bp.
Then present the Genotype Quality metrics and Site Quality Metrics of the variant. These quality metrics distributions are all counted from our own genome sequencing data. The genotype quality metrics include genotype quality (GQ), approximate read depth (DP) and allele balance for heterozygotes. The site quality metrics include SiteQuality, FS, MQRankSum, InbreedingCoeff, ReadPosRankSum, VQSLOD, QD, DP, BaseQRankSum, MQ, ClippingRankSum. FS refers to phred-scaled p-value using Fisher's exact test to detect strand bias. MQRankSum refers to Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities. InbreedingCoeff refers to Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation. ReadPosRankSum refers to Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias. VQSLOD refers to Log odds of being a true variant versus being false under the trained Gaussian mixture model, likely the reason why the variant was filtered out. QD refers to Variant Confidence/Quality by Depth. DP refers to depth of informative coverage for each sample, reads with MQ=255 or with bad mates are filtered. BaseQRankSum refers to Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities. MQ refers to RMS Mapping Quality. ClippingRankSum refers to Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases.
The database also reviews the alternative allele frequency from 1000 Genomes Project Phase 3 (1KGP3) dataset (2,504 genomes) [1] and gnomAD version 3 genome dataset (71,702 genomes)(https://gnomad.broadinstitute.org). The samples in these two dataset are both divided into races as follow.
dataset | Population | Description | Genomes |
---|---|---|---|
1KGP3 | AFR | African | 661 |
AMR | American | 347 | |
EAS | East Asian | 504 | |
EUR | European | 503 | |
SAS | South Asian | 489 | |
total | 2,504 | ||
gnomAD v3 Genomes | AFR | African/African-American | 21,042 |
AMI | Amish | 450 | |
AMR | Latino/Admixed American | 6,835 | |
ASJ | Ashkenazi Jewish | 1,662 | |
EAS | East Asian | 1,567 | |
FIN | Finnish | 5,244 | |
NFE | Non-Finnish European | 32,299 | |
SAS | South Asian | 1,526 | |
OTH | Other (population not assigned) | 1,077 | |
total | 71,702 |
The region annotation is the gene-based annotation refers to Ensembl Gene and RefSeq Gene by software ANNOVAR [2]. The Region column contains one or two of the following items: exonic, splicing, UTR5, UTR3, intronic, ncRNA_exonic, ncRNA_splicing, ncRNA_intronic, unstream, downstream, intergenic. The Gene ID and Gene Detail columns tell the name of gene and the relative positin of the gene where the variant is located, respecitvely. The Exonic Function column tells the functional consequences of the variant (possible values in this fields include: nonsynonymous SNV, synonymous SNV, frameshift insertion, frameshift deletion, nonframeshift insertion, nonframeshift deletion, unknown). The Consequence column contains the gene name, the transcript identifier and the sequence change in the corresponding transcript (eg. NUDT15:NM_018283:exon3:c.C415T:p.R139C).
The nonsynonymous impact presents the results of 5 prediction softwares (SIFT, PolyPhen2_HDIV, PolyPhen2_HVAR, FATHMM, CADD) on nonsynonymous SNV. These information is also adopted from ANNOVAR [2].
Loss of Function (LoF) variants are indentified by package LOFTEE developed recently by gnomAD group to assess stop-gained, splice site disrupting and frameshift variants as “low-confidence” (LC) or “high-confidence” (HC) LoF variants [3].
The disease annotation is annotated by clinvar disease database [4], which is also adopted from ANNOVAR [2].
The pharmacogenomics variants and related drug information were collected from 34 Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines (https://cpicpgx.org/). Then add the pharmacogenomics annotation to the variants in this database.
The browser jumps to genome browser webpage to see the variants coordinates on human genome along with other tracks such as Genes and Gene predictions, Comparative Genomics and variation.