Description
The NCBI RefSeq Genes composite track shows human protein-coding and non-protein-coding
genes taken from the NCBI RNA reference sequences collection (RefSeq). All subtracks use
coordinates provided by RefSeq, except for the UCSC RefSeq track, which UCSC produces by
realigning the RefSeq RNAs to the genome. This realignment may result in occasional differences
between the annotation coordinates provided by UCSC and NCBI. See the
Methods section for more details about how the different tracks were
created.
Please visit NCBI's Feedback for Gene and Reference Sequences (RefSeq) page to make suggestions,
submit additions and corrections, or ask for help concerning RefSeq records.
For more information on the different gene tracks, see our Genes FAQ.
Display Conventions and Configuration
This track is a multi-view composite track that contains differing data set views.
Instructions for configuring multi-view tracks are
here.
To show only a selected set of subtracks, uncheck the boxes next to the tracks that you wish to
hide.
The views available for this track include:
- RefSeq annotations and alignments
-
RefSeq All – all curated and predicted annotations provided by
RefSeq.
-
RefSeq Curated – subset of RefSeq All that includes only those
annotations whose accessions begin with NM, NR, or YP.
-
RefSeq Predicted – subset of RefSeq All that includes those annotations whose
accessions begin with XM or XR.
-
RefSeq Other – all other annotations produced by the RefSeq group that
do not fit the requirements for inclusion in the RefSeq Curated or the
RefSeq Predicted tracks.
-
RefSeq Alignments – alignments of RefSeq RNAs to the human genome provided
by the RefSeq group.
-
RefSeq Diffs – alignment differences between the human reference genome(s)
and RefSeq transcripts. Note: track not currently available for every assembly.
-
RefSeq HGMD – only show RefSeq Curated transcripts mentioned in the Human
Gene Mutation Database. This track is only available on the human genomes hg19 and hg38.
- UCSC annotations
-
UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM
and NR accessions to the human genome. This track was previously known as the "RefSeq
Genes" track.
The RefSeq All, RefSeq Curated, RefSeq Predicted, RefSeq Clinical
and UCSC RefSeq tracks follow the display conventions for
gene prediction tracks.
The color shading indicates the level of review the RefSeq record has undergone:
predicted (light), provisional (medium), or reviewed (dark), as defined by RefSeq.
Color |
Level of review |
|
Reviewed: the RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information. |
|
Provisional: the RefSeq record has not yet been subject to individual review. The initial sequence-to-gene association has been established by outside collaborators or NCBI staff. |
|
Predicted: the RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted. |
The RefSeq Alignments track follows the display conventions for
PSL tracks.
The item labels and codon display properties for features within this track can be configured
through the controls at the top of the track description page. Click the view name
(NCBI RefSeq or UCSC RefSeq) to globally modify the settings for all subtracks in
the view. To adjust the settings for an individual subtrack, click the wrench icon next to the
track name in the subtrack list (available only for views containing more than one track).
-
Label: By default, items are labeled by gene name. Click the appropriate Label
option to display the accession name or OMIM identifier instead of the gene name, show all or a
subset of these labels including the gene name, OMIM identifier and accession names, or turn off
the label completely.
-
Codon coloring: This track has an optional codon coloring feature that
allows users to quickly validate and compare gene predictions. To display codon colors, select the
genomic codons option from the Color track by codons pull-down menu. For more
information about this feature, go to the Coloring Gene Predictions and Annotations by Codon page.
The RefSeq Diffs track contains five different types of inconsistency between the
reference genome sequence and the RefSeq transcript sequences. The five types of differences are
as follows:
-
mismatch – aligned but mismatching bases, plus HGVS g.
to show the genomic change required to match the transcript and HGVS c./n.
to show the transcript change required to match the genome.
-
short gap – genomic gaps that are too small to be introns (arbitrary cutoff of
< 45 bp), most likely insertions/deletion variants or errors, with HGVS g. and c./n.
showing differences.
-
shift gap – shortGap items whose placement could be shifted left and/or right on
the genome due to repetitive sequence, with HGVS c./n. position range of ambiguous region
in transcript. Here, thin and thick lines are used -- the thin line shows the span of the
repetitive sequence, and the thick line shows the rightmost shifted gap.
-
double gap – genomic gaps that are long enough to be introns but that skip over
transcript sequence (invisible in default setting), with HGVS c./n. deletion.
-
skipped – sequence at the beginning or end of a transcript that is not aligned to
the genome
(invisible in default setting), with HGVS c./n. deletion
HGVS Terminology (Human Genome Variation Society):
g. = genomic sequence ; c. = coding DNA sequence ; n. = non-coding RNA reference sequence.
When reporting HGVS with RefSeq sequences, to make sure that results from
research articles can be mapped to the genome unambigously,
please specify the RefSeq annotation release displayed on the transcript's
Genome Browser details page and also the RefSeq transcript ID with version
(e.g. NM_012309.4 not NM_012309).
Methods
Tracks contained in the RefSeq annotation and RefSeq RNA alignment views were created at UCSC using
data from the NCBI RefSeq project. Data files were downloaded from RefSeq in GFF file format and
converted to the genePred and PSL table formats for display in the Genome Browser. Information about
the NCBI annotation pipeline can be found
here.
The RefSeq Diffs track is generated by UCSC using NCBI's RefSeq RNA alignments.
The UCSC RefSeq Genes track is constructed using the same methods as previous RefSeq Genes tracks.
RefSeq RNAs were aligned against the human genome using BLAT. Those with an alignment of
less than 15% were discarded. When a single RNA aligned in multiple places, the alignment
having the highest base identity was identified. Only alignments having a base identity
level within 0.1% of the best and at least 96% base identity with the genomic sequence were
kept.
Data Access
The raw data for these tracks can be accessed in multiple ways. It can be explored interactively
using the Table Browser or
Data Integrator. The tables can also be accessed programmatically through our
public MySQL server or downloaded from our
downloads server for local processing.
The data in the RefSeq Other and RefSeq Diffs tracks are organized in
bigBed file format; more
information about accessing the information in this bigBed file can be found
below. The other subtracks are associated with database tables as follows:
- genePred format:
- RefSeq All - ncbiRefSeq
- RefSeq Curated - ncbiRefSeqCurated
- RefSeq Predicted - ncbiRefSeqPredicted
- RefSeq HGMD - ncbiRefSeqHgmd
- UCSC RefSeq - refGene
- PSL format:
- RefSeq Alignments - ncbiRefSeqPsl
The first column of each of these tables is "bin". This column is designed
to speed up access for display in the Genome Browser, but can be safely ignored in downstream
analysis. You can read more about the bin indexing system
here.
The annotations in the RefSeqOther and RefSeqDiffs tracks are stored in bigBed
files, which can be obtained from our downloads server here,
ncbiRefSeqOther.bb and
ncbiRefSeqDiffs.bb.
Individual regions or the whole set of genome-wide annotations can be obtained using our tool
bigBedToBed which can be compiled from the source code or downloaded as a precompiled
binary for your system from the utilities directory linked below. For example, to extract only
annotations in a given region, you could use the following command:
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/ncbiRefSeq/ncbiRefSeqOther.bb
-chrom=chr16 -start=34990190 -end=36727467 stdout
The genePred format tracks can also be downloaded in GTF format using the
genePredToGtf utility, available from the
utilities directory on the UCSC downloads
server. The utility can be run from the command line like so:
genePredToGtf hg38 ncbiRefSeqPredicted ncbiRefSeqPredicted.gtf
Note that using genePredToGtf in this manner accesses our public MySQL server, and you therefore
must set up your hg.conf as described on the MySQL page linked near the beginning of the Data Access
section.
A file containing the RNA sequences in FASTA format for all items in the RefSeq All, RefSeq Curated,
and RefSeq Predicted tracks can be found on our downloads server
here.
Please refer to our mailing list archives for questions.
Credits
This track was produced at UCSC from data generated by scientists worldwide and curated by the
NCBI RefSeq project.
References
Kent WJ.
BLAT - the BLAST-like
alignment tool. Genome Res. 2002 Apr;12(4):656-64.
PMID: 11932250; PMC: PMC187518
Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J,
Landrum MJ, McGarvey KM et al.
RefSeq: an update on mammalian reference sequences.
Nucleic Acids Res. 2014 Jan;42(Database issue):D756-63.
PMID: 24259432; PMC:
PMC3965018
Pruitt KD, Tatusova T, Maglott DR.
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts
and proteins.
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D501-4.
PMID: 15608248; PMC: PMC539979
|