Variant ID in NyuWa Chinese Population Variant Database (NCVD)

The format of variant in NCVD is Chromosome-Position-ReferenceAllele-AlternativeAllele, for example, 13-48045719-C-T. The position, reference allele and alternative allele of variants are left-aligned and normalized (https://genome.sph.umich.edu/wiki/Variant_Normalization). The position coordinate of NCVD is based on human assembly GRch38/hg38.

Search result

After submitting the search key, there will be a feedback table which contains variants matching the querying. From the table, some overview information of the variants can be glanced, including Variant ID, dbSNP ID, information resulting from Ensembl gene annotation (region, Gene ID, Exonic Function, consequence), and statisctical information from own 2999 Chinese high depth genome sequencing (Allele Count, Allele Number, Allele Frequency).

Variant annotation page

Basic information

From the search result, every variant can be linked to a detail variant annotation page which is modularized. First it’s the basic information of variant including Allele Count, Allele Number, Allele Frequency and Number of Homozygotes as the same of search result. If the variant is also included in dbSNP database or gnomAD database, the links of variant in those database are provided. The Browser is linked to the genome browser to present the variant in the region of upstream 100bp and downstream 100bp.

Quality metrics

Then present the Genotype Quality metrics and Site Quality Metrics of the variant. These quality metrics distributions are all counted from our own genome sequencing data. The genotype quality metrics include genotype quality (GQ), approximate read depth (DP) and allele balance for heterozygotes. The site quality metrics include SiteQuality, FS, MQRankSum, InbreedingCoeff, ReadPosRankSum, VQSLOD, QD, DP, BaseQRankSum, MQ, ClippingRankSum. FS refers to phred-scaled p-value using Fisher's exact test to detect strand bias. MQRankSum refers to Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities. InbreedingCoeff refers to Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation. ReadPosRankSum refers to Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias. VQSLOD refers to Log odds of being a true variant versus being false under the trained Gaussian mixture model, likely the reason why the variant was filtered out. QD refers to Variant Confidence/Quality by Depth. DP refers to depth of informative coverage for each sample, reads with MQ=255 or with bad mates are filtered. BaseQRankSum refers to Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities. MQ refers to RMS Mapping Quality. ClippingRankSum refers to Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases.

External data population frequency

The database also reviews the alternative allele frequency from 1000 Genomes Project Phase 3 (1KGP3) dataset (2,504 genomes) [1] and gnomAD version 3 genome dataset (71,702 genomes)(https://gnomad.broadinstitute.org). The samples in these two dataset are both divided into races as follow.

datasetPopulationDescriptionGenomes
1KGP3AFRAfrican661
AMRAmerican347
EASEast Asian504
EUREuropean503
SASSouth Asian489
total2,504
gnomAD v3 GenomesAFRAfrican/African-American21,042
AMIAmish450
AMRLatino/Admixed American6,835
ASJAshkenazi Jewish1,662
EASEast Asian1,567
FINFinnish5,244
NFENon-Finnish European32,299
SASSouth Asian1,526
OTHOther (population not assigned)1,077
total71,702
Region annotation

The region annotation is the gene-based annotation refers to Ensembl Gene and RefSeq Gene by software ANNOVAR [2]. The Region column contains one or two of the following items: exonic, splicing, UTR5, UTR3, intronic, ncRNA_exonic, ncRNA_splicing, ncRNA_intronic, unstream, downstream, intergenic. The Gene ID and Gene Detail columns tell the name of gene and the relative positin of the gene where the variant is located, respecitvely. The Exonic Function column tells the functional consequences of the variant (possible values in this fields include: nonsynonymous SNV, synonymous SNV, frameshift insertion, frameshift deletion, nonframeshift insertion, nonframeshift deletion, unknown). The Consequence column contains the gene name, the transcript identifier and the sequence change in the corresponding transcript (eg. NUDT15:NM_018283:exon3:c.C415T:p.R139C).

Nonsynonymous impact

The nonsynonymous impact presents the results of 5 prediction softwares (SIFT, PolyPhen2_HDIV, PolyPhen2_HVAR, FATHMM, CADD) on nonsynonymous SNV. These information is also adopted from ANNOVAR [2].

Loss of function predict

Loss of Function (LoF) variants are indentified by package LOFTEE developed recently by gnomAD group to assess stop-gained, splice site disrupting and frameshift variants as “low-confidence” (LC) or “high-confidence” (HC) LoF variants [3].

Disease annotation

The disease annotation is annotated by clinvar disease database [4], which is also adopted from ANNOVAR [2].

Pharmacogenomics

The pharmacogenomics variants and related drug information were collected from 34 Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines (https://cpicpgx.org/). Then add the pharmacogenomics annotation to the variants in this database.

Browser

The browser jumps to genome browser webpage to see the variants coordinates on human genome along with other tracks such as Genes and Gene predictions, Comparative Genomics and variation.

Reference

[1] AUTON A, ABECASIS G R, ALTSHULER D M, et al. A global reference for human genetic variation [J]. Nature, 2015, 526(7571): 68-74.
[2] WANG K, LI M Y, HAKONARSON H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data [J]. Nucleic Acids Research, 2010, 38(16).
[3] KARCZEWSKI K J, FRANCIOLI L C, TIAO G, et al. The mutational constraint spectrum quantified from variation in 141,456 humans [J]. Nature, 2020, 581(7809): 434-43.
[4] LANDRUM M J, LEE J M, BENSON M, et al. ClinVar: improving access to variant interpretations and supporting evidence [J]. Nucleic acids research, 2018, 46(D1): D1062-D7.