This track contains information about a subset of the
single nucleotide polymorphisms
and small insertions and deletions (indels) — collectively Simple
Nucleotide Polymorphisms — from
dbSNP
build 150, available from
ftp.ncbi.nlm.nih.gov/snp.
Only SNPs that have a minor allele frequency (MAF) of at least 1% and
are mapped to a single location in the reference genome assembly are
included in this subset. Frequency data are not available for all SNPs,
so this subset is incomplete.
Allele counts from all submissions that include frequency data are combined
when determining MAF, so for example the allele counts from
the 1000 Genomes Project and an independent submitter may be combined for the
same variant.
dbSNP provides
download files
in the
Variant Call Format (VCF)
that include a "COMMON" flag in the INFO column. That is determined by a different method,
and is generally a superset of the UCSC Common set.
dbSNP uses frequency data from the
1000 Genomes Project
only, and considers a variant COMMON if it has a MAF of at least 0.01 in any of the five
super-populations:
African (AFR)
Admixed American (AMR)
East Asian (EAS)
European (EUR)
South Asian (SAS)
In build 151 (which has replaced build 150 on the dbSNP web and download site),
dbSNP marks approximately 38M variants as COMMON; 23M of those have a
global MAF < 0.01. The remainder should be in agreement with UCSC's Common subset.
The selection of SNPs with a minor allele frequency of 1% or greater
is an attempt to identify variants that appear to be reasonably common
in the general population. Taken as a set, common variants should be
less likely to be associated with severe genetic diseases due to the
effects of natural selection,
following the view that deleterious variants are not likely to become
common in the population.
However, the significance of any particular variant should be interpreted
only by a trained medical geneticist using all available information.
The remainder of this page is identical on the following tracks:
Common SNPs(150) - SNPs with >= 1% minor allele frequency (MAF), mapping
only once to reference assembly.
Flagged SNPs(150) - SNPs < 1% minor allele frequency (MAF) (or unknown),
mapping only once to reference assembly,
flagged in dbSnp as "clinically associated"
-- not necessarily a risk allele!
Mult. SNPs(150) - SNPs mapping in more than one place on reference assembly.
All SNPs(150) - all SNPs from dbSNP mapping to reference assembly.
Interpreting and Configuring the Graphical Display
Variants are shown as single tick marks at most zoom levels.
When viewing the track at or near base-level resolution, the displayed
width of the SNP corresponds to the width of the variant in the reference
sequence. Insertions are indicated by a single tick mark displayed between
two nucleotides, single nucleotide polymorphisms are displayed as the width
of a single base, and multiple nucleotide variants are represented by a
block that spans two or more bases.
On the track controls page, SNPs can be colored and/or filtered from the
display according to several attributes:
Class: Describes the observed alleles
Single - single nucleotide variation: all observed alleles are single nucleotides
(can have 2, 3 or 4 alleles)
Microsatellite - the observed allele from dbSNP is a variation in counts of short tandem repeats
Named - the observed allele from dbSNP is given as a text name instead of raw sequence, e.g., (Alu)/-
No Variation - the submission reports an invariant region in the surveyed sequence
Mixed - the cluster contains submissions from multiple classes
Multiple Nucleotide Polymorphism (MNP) - the alleles are all of the same length, and length > 1
Insertion - the polymorphism is an insertion relative to the reference assembly
Deletion - the polymorphism is a deletion relative to the reference assembly
Unknown - no classification provided by data contributor
Validation: Method used to validate
the variant (each variant may be validated by more than one method)
By Frequency - at least one submitted SNP in cluster has frequency data submitted
By Cluster - cluster has at least 2 submissions, with at least one submission assayed with a non-computational method
By Submitter - at least one submitter SNP in cluster was validated by independent assay
By 2 Hit/2 Allele - all alleles have been observed in at least 2 chromosomes
By HapMap (human only) - submitted by HapMap project
By 1000Genomes (human only) - submitted by
1000Genomes project
Unknown - no validation has been reported for this variant
Function: dbSNP's predicted functional effect of variant on RefSeq transcripts,
both curated (NM_* and NR_*) as in the RefSeq Genes track and predicted (XM_* and XR_*),
not shown in UCSC Genome Browser.
A variant may have more than one functional role if it overlaps
multiple transcripts.
These terms and definitions are from the Sequence Ontology (SO); click on a term to view it in the
MISO Sequence Ontology Browser.
Unknown - no functional classification provided (possibly intergenic)
synonymous_variant -
A sequence variant where there is no resulting change to the encoded amino acid
(dbSNP term: coding-synon)
intron_variant -
A transcript variant occurring within an intron
(dbSNP term: intron)
upstream_gene_variant -
A sequence variant located 5' of a gene
(dbSNP term: near-gene-5)
nc_transcript_variant -
A transcript variant of a non coding RNA gene
(dbSNP term: ncRNA)
stop_gained -
A sequence variant whereby at least one base of a codon is changed, resulting in
a premature stop codon, leading to a shortened transcript
(dbSNP term: nonsense)
missense_variant -
A sequence variant, where the change may be longer than 3 bases, and at least
one base of a codon is changed resulting in a codon that encodes for a
different amino acid
(dbSNP term: missense)
stop_lost -
A sequence variant where at least one base of the terminator codon (stop)
is changed, resulting in an elongated transcript
(dbSNP term: stop-loss)
frameshift_variant -
A sequence variant which causes a disruption of the translational reading frame,
because the number of nucleotides inserted or deleted is not a multiple of three
(dbSNP term: frameshift)
inframe_indel -
A coding sequence variant where the change does not alter the frame
of the transcript
(dbSNP term: cds-indel)
3_prime_UTR_variant -
A UTR variant of the 3' UTR
(dbSNP term: untranslated-3)
5_prime_UTR_variant -
A UTR variant of the 5' UTR
(dbSNP term: untranslated-5)
splice_acceptor_variant -
A splice variant that changes the 2 base region at the 3' end of an intron
(dbSNP term: splice-3)
splice_donor_variant -
A splice variant that changes the 2 base region at the 5' end of an intron
(dbSNP term: splice-5)
In the Coloring Options section of the track controls page,
function terms are grouped into several categories, shown here with default colors.
If a SNP has more than one of these attributes, the stronger color will override
the weaker color. The order of colors, from strongest to weakest, is red, green,
blue, gray, and black.
Genomic - variant discovered using a genomic template
cDNA - variant discovered using a cDNA template
Unknown - sample type not known
Unusual Conditions (UCSC): UCSC checks for several anomalies
that may indicate a problem with the mapping, and reports them in the
Annotations section of the SNP details page if found:
AlleleFreqSumNot1 - Allele frequencies do not sum
to 1.0 (+-0.01). This SNP's allele frequency data are
probably incomplete.
DuplicateObserved,
MixedObserved - Multiple distinct insertion SNPs have
been mapped to this location, with either the same inserted
sequence (Duplicate) or different inserted sequence (Mixed).
FlankMismatchGenomeEqual,
FlankMismatchGenomeLonger,
FlankMismatchGenomeShorter - NCBI's alignment of
the flanking sequences had at least one mismatch or gap
near the mapped SNP position.
(UCSC's re-alignment of flanking sequences to the genome may
be informative.)
MultipleAlignments - This SNP's flanking sequences
align to more than one location in the reference assembly.
NamedDeletionZeroSpan - A deletion (from the
genome) was observed but the annotation spans 0 bases.
(UCSC's re-alignment of flanking sequences to the genome may
be informative.)
NamedInsertionNonzeroSpan - An insertion (into the
genome) was observed but the annotation spans more than 0
bases. (UCSC's re-alignment of flanking sequences to the
genome may be informative.)
NonIntegerChromCount - At least one allele
frequency corresponds to a non-integer (+-0.010000) count of
chromosomes on which the allele was observed. The reported
total sample count for this SNP is probably incorrect.
ObservedContainsIupac - At least one observed allele
from dbSNP contains an IUPAC ambiguous base (e.g., R, Y, N).
ObservedMismatch - UCSC reference allele does not
match any observed allele from dbSNP. This is tested only
for SNPs whose class is single, in-del, insertion, deletion,
mnp or mixed.
ObservedTooLong - Observed allele not given (length
too long).
ObservedWrongFormat - Observed allele(s) from dbSNP
have unexpected format for the given class.
RefAlleleMismatch - The reference allele from dbSNP
does not match the UCSC reference allele, i.e., the bases in
the mapped position range.
RefAlleleRevComp - The reference allele from dbSNP
matches the reverse complement of the UCSC reference
allele.
SingleClassLongerSpan - All observed alleles are
single-base, but the annotation spans more than 1 base.
(UCSC's re-alignment of flanking sequences to the genome may
be informative.)
SingleClassZeroSpan - All observed alleles are
single-base, but the annotation spans 0 bases. (UCSC's
re-alignment of flanking sequences to the genome may be
informative.)
Another condition, which does not necessarily imply any problem,
is noted:
SingleClassTriAllelic, SingleClassQuadAllelic -
Class is single and three or four different bases have been
observed (usually there are only two).
Miscellaneous Attributes (dbSNP): several properties extracted
from dbSNP's SNP_bitfield table
(see dbSNP_BitField_v5.pdf for details)
Clinically Associated (human only) - SNP is in OMIM and/or at
least one submitter is a Locus-Specific Database. This does
not necessarily imply that the variant causes any disease,
only that it has been observed in clinical studies.
Has Microattribution/Third-Party Annotation - At least
one of the SNP's submitters studied this SNP in a biomedical
setting, but is not a Locus-Specific Database or OMIM/OMIA.
Submitted by Locus-Specific Database - At least one of
the SNP's submitters is associated with a database of variants
associated with a particular gene. These variants may or may
not be known to be causative.
MAF >= 5% in Some Population - Minor Allele Frequency is
at least 5% in at least one population assayed.
MAF >= 5% in All Populations - Minor Allele Frequency is
at least 5% in all populations assayed.
Genotype Conflict - Quality check: different genotypes
have been submitted for the same individual.
Ref SNP Cluster has Non-overlapping Alleles - Quality
check: this reference SNP was clustered from submitted SNPs
with non-overlapping sets of observed alleles.
Some Assembly's Allele Does Not Match Observed -
Quality check: at least one assembly mapped by dbSNP has an allele
at the mapped position that is not present in this SNP's observed
alleles.
Several other properties do not have coloring options, but do have
some filtering options:
Average heterozygosity should not exceed 0.5 for bi-allelic
single-base substitutions.
Weight: Alignment quality assigned by dbSNP. Before dbSNP build
147, weight had values 1, 2 or 3, with 1 being the highest quality
(mapped to a single genomic location). As of dbSNP build 147, dbSNP
now releases only the variants with weight 1.
Submitter handles: These are short, single-word identifiers of
labs or consortia that submitted SNPs that were clustered into this
reference SNP by dbSNP (e.g., 1000GENOMES, ENSEMBL, KWOK). Some SNPs
have been observed by many different submitters, and some by only a
single submitter (although that single submitter may have tested a
large number of samples).
AlleleFrequencies: Some submissions to dbSNP include
allele frequencies and the study's sample size
(i.e., the number of distinct chromosomes, which is two times the
number of individuals assayed, a.k.a. 2N). dbSNP combines all
available frequencies and counts from submitted SNPs that are
clustered together into a reference SNP.
You can configure this track such that the details page displays
the function and coding differences relative to
particular gene sets. Choose the gene sets from the list on the SNP
configuration page displayed beneath this heading: On details page,
show function and coding differences relative to.
When one or more gene tracks are selected, the SNP details page
lists all genes that the SNP hits (or is close to), with the same keywords
used in the function category. The function usually
agrees with NCBI's function, except when NCBI's functional annotation is
relative to an XM_* predicted RefSeq (not included in the UCSC Genome
Browser's RefSeq Genes track) and/or UCSC's functional annotation is
relative to a transcript that is not in RefSeq.
Insertions/Deletions
dbSNP uses a class called 'in-del'. We compare the length of the
reference allele to the length(s) of observed alleles; if the
reference allele is shorter than all other observed alleles, we change
'in-del' to 'insertion'. Likewise, if the reference allele is longer
than all other observed alleles, we change 'in-del' to 'deletion'.
UCSC Re-alignment of flanking sequences
dbSNP determines the genomic locations of SNPs by aligning their flanking
sequences to the genome.
UCSC displays SNPs in the locations determined by dbSNP, but does not
have access to the alignments on which dbSNP based its mappings.
Instead, UCSC re-aligns the flanking sequences
to the neighboring genomic sequence for display on SNP details pages.
While the recomputed alignments may differ from dbSNP's alignments,
they often are informative when UCSC has annotated an unusual condition.
Non-repetitive genomic sequence is shown in upper case like the flanking
sequence, and a "|" indicates each match between genomic and flanking bases.
Repetitive genomic sequence (annotated by RepeatMasker and/or the
Tandem Repeats Finder with period >= 12) is shown in lower case, and matching
bases are indicated by a "+".
Coordinates, orientation, location type and dbSNP reference allele data
were obtained from b150_SNPContigLoc_N.bcp.gz and
b150_ContigInfo_N.bcp.gz. (N = 105 for hg19, 107 for hg38)
b150_SNPMapInfo_N.bcp.gz provided the alignment weights.
Functional classification was obtained from
b150_SNPContigLocusId_N.bcp.gz. The internal database representation
uses dbSNP's function terms, but for display in SNP details pages,
these are translated into
Sequence Ontology terms.
Validation status and heterozygosity were obtained from SNP.bcp.gz.
SNPAlleleFreq.bcp.gz and ../shared/Allele.bcp.gz provided allele frequencies.
For the human assembly, allele frequencies were also taken from
SNPAlleleFreq_TGP.bcp.gz .
Submitter handles were extracted from Batch.bcp.gz, SubSNP.bcp.gz and
SNPSubSNPLink.bcp.gz.
SNP_bitfield.bcp.gz provided miscellaneous properties annotated by dbSNP,
such as clinically-associated. See the document
dbSNP_BitField_v5.pdf for details.
The header lines in the rs_fasta files were used for molecule type,
class and observed polymorphism.
For the human assembly, we provide a related table that contains
orthologous alleles in the chimpanzee, orangutan and rhesus macaque
reference genome assemblies.
We use our liftOver utility to identify the orthologous alleles.
The candidate human SNPs are a filtered list that meet the criteria:
class = 'single'
mapped position in the human reference genome is one base long
aligned to only one location in the human reference genome
not aligned to a chrN_random chrom
biallelic (not tri- or quad-allelic)
In some cases the orthologous allele is unknown; these are set to 'N'.
If a lift was not possible, we set the orthologous allele to '?' and the
orthologous start and end position to 0 (zero).
Masked FASTA Files (human assemblies only)
FASTA files that have been modified to use
IUPAC
ambiguous nucleotide characters at
each base covered by a single-base substitution are available for download:
GRCh37/hg19,
GRCh38/hg38.
Note that only single-base substitutions (no insertions or deletions) were used
to mask the sequence, and these were filtered to exclude problematic SNPs.