Description
The GENCODE Genes track (version 28, Apr 2018) shows high-quality manual
annotations merged with evidence-based automated annotations across the entire
human genome generated by the
GENCODE project.
The GENCODE gene set presents a full merge
between HAVANA manual annotation process and Ensembl automatic annotation pipeline.
Priority is given to the manually curated HAVANA annotation using predicted
Ensembl annotations when there are no corresponding manual annotations.
The 28 annotation was carried out on genome assembly GRCh38 (hg38).
The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the
corresponding release.
Display Conventions and Configuration
This track is a multi-view composite track that contains differing data sets
(views). Instructions for configuring multi-view tracks are
here.
To show only selected subtracks, uncheck the boxes next to the tracks that
you wish to hide.
Views available on this track are:
- Genes
- The gene annotations in this view are divided into three subtracks:
- GENCODE Basic set is a subset of the Comprehensive set.
The selection criteria are described in the methods section.
- GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations,
including polymorphic pseudogenes. This includes both manual and
automatic annotations. This is a super-set of the Basic set.
- GENCODE Pseudogenes include all annotations except polymorphic pseudogenes.
- 2-way
- GENCODE 2-way Pseudogenes contains pseudogenes predicted by both the Yale
Pseudopipe and UCSC Retrofinder pipelines.
The set was derived by looking for 50 base pairs
of overlap between pseudogenes derived from both sets based on their
chromosomal coordinates. When multiple Pseudopipe
predictions map to a single Retrofinder prediction, only one match is kept
for the 2-way consensus set.
- PolyA
- GENCODE PolyA contains polyA signals and sites manually annotated on
the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of
transcripts containing at least 3 A's not matching the genome.
Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks
using the following criteria:
- Transcript class: filter by the basic biological function of a transcript
annotation
- All - don't filter by transcript class
- coding - display protein coding transcripts, including polymorphic pseudogenes
- nonCoding - display non-protein coding transcripts
- pseudo - display pseudogene transcript annotations
- problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain)
- Transcript Annotation Method: filter by the method used to create the annotation
- All - don't filter by transcript class
- manual - display manually created annotations, including those that are
also created automatically
- automatic - display automatically created annotations, including those that are
also created manually
- manual_only - display manually created annotations that were
not annotated by the automatic method
- automatic_only - display automatically created annotations that were
not annotated by the manual method
- Transcript Biotype: filter transcripts by
biotype
- Support Level: filter transcripts by transcription support level
Coloring for the gene annotations is based on the annotation type:
- coding
- non-coding
- pseudogene
- problem
- all 2-way pseudogenes
- all polyA annotations
Methods
The GENCODE project aims to annotate all evidence-based gene features on the
human and mouse reference sequence with high accuracy by integrating
computational approaches (including comparative methods), manual
annotation and targeted experimental verification. This goal includes identifying
all protein-coding loci with associated alternative variants, non-coding
loci which have transcript evidence, and pseudogenes.
For a detailed description of the methods and references used, see
Harrow et al. (2006).
GENCODE Basic Set selection:
The GENCODE Basic Set is intended to provide a simplified subset of
the GENCODE transcript annotations that will be useful to the majority of
users. The goal was to have a high-quality basic set that also covered all loci.
Selection of GENCODE annotations for inclusion in the basic set
was determined independently for the coding and non-coding transcripts at each
gene locus.
- Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given
locus:
- All full-length coding transcripts (except problem transcripts or transcripts that are
nonsense-mediated decay) was included in the basic set.
- If there were no transcripts meeting the above criteria, then the partial coding
transcript with the largest CDS was included in the basic set (excluding problem transcripts).
- Criteria for selection of non-coding transcripts at a given locus:
- All full-length non-coding transcripts (except problem transcripts)
with a well characterized biotype (see below) were included in the
basic set.
- If there were no transcripts meeting the above criteria, then the largest non-coding
transcript was included in the basic set (excluding problem transcripts).
- If no transcripts were included by either the above criteria, the longest
problem transcript is included.
Non-coding transcript categorization:
Non-coding transcripts are categorized using
their biotype
and the following criteria:
- well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA
- poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping
Transcription Support Level (TSL):
It is important that users understand how to assess transcript annotations
that they see in GENCODE. While some transcript models have a high level of
support through the full length of their exon structure, there are also
transcripts that are poorly supported and that should be considered
speculative. The Transcription Support Level (TSL) is a method to highlight the
well-supported and poorly-supported transcript models for users. The method
relies on the primary data that can support full-length transcript
structure: mRNA and EST alignments supplied by UCSC and Ensembl.
The mRNA and EST alignments are compared to the GENCODE transcripts and the
transcripts are scored according to how well the alignment matches over its
full length.
The GENCODE TSL provides a consistent method of evaluating the
level of support that a GENCODE transcript annotation is
actually expressed in mouse. Mouse transcript sequences from the
International Nucleotide
Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as
the evidence for this analysis.
Exonerate RNA alignments from Ensembl,
BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in
the analysis. Erroneous transcripts and libraries identified in lists
maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as
suspect. GENCODE annotations for protein-coding and non-protein-coding
transcripts are compared with the evidence alignments.
Annotations in the MHC region and other immunological genes are not
evaluated, as automatic alignments tend to be very problematic.
Methods for evaluating single-exon genes are still being developed and
they are not included
in the current analysis. Multi-exon GENCODE annotations are evaluated using
the criteria that all introns are supported by an evidence alignment and the
evidence alignment does not indicate that there are unannotated exons. Small
insertions and deletions in evidence alignments are assumed to be due to
polymorphisms and not considered as differing from the annotations. All
intron boundaries must match exactly. The transcript start and end locations
are allowed to differ.
The following categories are assigned to each of the evaluated annotations:
- tsl1 - all splice junctions of the transcript are supported by
at least one non-suspect mRNA
- tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
- tsl3 - the only support is from a single EST
- tsl4 - the best supporting EST is flagged as suspect
- tsl5 - no single transcript supports the model structure
- tslNA - the transcript was not analyzed for one of the following reasons:
- pseudogene annotation, including transcribed pseudogenes
- immunoglobin gene transcript
- T-cell receptor transcript
- single-exon transcript (will be included in a future version)
APPRIS
is a system to annotate alternatively spliced transcripts based on a range of computational
methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes.
APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal
isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable.
- PRINCIPAL:1 - Transcript(s) expected to code for the main functional
isoform based solely on the core modules in the APPRIS.
- PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear
principal variant (approximately 25% of human protein coding genes), the
database chooses two or more of the CDS variants as "candidates" to be the
principal variant.
- PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear
principal variant and more than one of the variants have distinct
CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier
as the principal variant. The lower the CCDS identifier, the earlier it
was annotated.
- PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear
principal CDS and there is more than one variant with distinct (but
consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as
the principal variant.
- PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear
principal variant and none of the candidate variants are annotated by CCDS,
APPRIS selects the longest of the candidate isoforms as the principal variant.
For genes in which the APPRIS core modules are unable to choose a clear
principal variant (approximately 25% of human protein coding genes), the
"candidate" variants not chosen as principal are labeled in the following way:
- ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at
least three tested species.
- ALTERNATIVE:2 - Candidate transcript(s) models that appear to be
conserved in fewer than three tested species. Non-candidate transcripts are
not tagged and are considered as "Minor" transcripts. Further information and
additional web services can be found at the APPRIS website.
Downloads
GENCODE GFF3 and GTF files are available from the
GENCODE release 28 site.
Verification
Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing.
Those experiments can be found at GEO:
- GSE30619:[E-MTAB-612] - Batch I is based on annotation from July 2008 (without pseudogenes).
- GSE25711:[E-MTAB-407] - Batch II is based on annotation from April 2009.
- GSE30612:[E-MTAB-533] - Batch III is verifying RGASP models for c.elegans and human.
- GSE34797:[E-MTAB-684] - Batch IV is based on chromosome 3, 4 and 5 annotations from GENCODE 4 (January 2010).
- GSE34820:[E-MTAB-737] - Batch V is based on annotations from GENCODE 6 (November 2010).
- GSE34821:[E-MTAB-831] - Batch VI is based on annotations from GENCODE 6 (November 2010) as well as transcript models predicted by the Ensembl Genebuild group based on the Illumina Human BodyMap 2.0 data.
See Harrow et al. (2006) for information on verification
techniques.
Release Notes
GENCODE version 28 corresponds to Ensembl 92.
See also: The GENCODE Project
Credits
This GENCODE release is the result of a collaborative effort among
the following laboratories: (contact:
GENCODE at the Sanger Institute)
Lab/Institution |
Contributors |
GENCODE Principal Investigator, EMBL European Bioinformatics Institute, Cambridge, UK |
Paul Flicek |
GENCODE Co-Principal Investigator, EMBL European Bioinformatics Institute, Cambridge, UK |
Adam Frankish |
GENCODE Co-Principal Investigator, Wellcome Trust Sanger Institute (WTSI), Cambridge, UK |
Bronwen Aken |
Kings College, London, UK |
Tim Hubbard |
HAVANA manual annotation group, EMBL European Bioinformatics Institute, Cambridge, UK |
Timothy Cutts, Jyoti Choudhary, Ed Griffiths, Ewan Birney, Jose Manuel
Gonzalez, Stephen Fitzgerald, Andrew Berry, Alexandra Bignell, Claire
Davidson, Gloria Despacio-Reyes, Mike Kay, Deepa Manthravadi, Gaurab
Mukherjee, Gemma Barson, Matt Hardy, Angela Macharia |
Ensembl, EMBL European Bioinformatics Institute, Cambridge, UK |
Carlos Garcia, Fergal Martin, Osagie Izuogu |
Centre de Regulació Genòmica (CRG), Barcelona, Spain |
Roderic Guigó, Julien Lagarde, Barbara Uszczyńska |
UC Santa Cruz Genomics Institute, University of California Santa Cruz (UCSC), USA |
David Haussler, Mark Diekhans, Benedict Paten, Joel Armstrong, Ian Fiddes |
Computer Science and Artificial Intelligence Lab,Broad Institute of MIT and Harvard, USA |
Manolis Kellis, Irwin Jungreis |
Computational Biology and Bioinformatics, Yale University (Yale), USA |
Mark Gerstein, Ekta Khurana, Cristina Sisu, Baikang Pei, Yan Zhang, Mihali Felipe |
Center for Integrative Genomics,University of Lausanne, Switzerland |
Alexandre Reymond, Cedric Howald, Anne-Maud Ferreira, Jacqueline Chrast |
Structural Computational Biology Group, Centro Nacional de Investigaciones Oncologicas (CNIO), Madrid, Spain |
Alfonso Valencia, Michael Tress, José Manuel Rodríguez, Victor de la Torre |
Former members of the GENCODE project |
Jennifer Harrow, James Gilbert, Electra Tapanari, Stephen Searle,
Rachel Harte, Daniel Barrell, Felix Kokocinski, Veronika Boychenko,
Toby Hunt, Catherine Snow, Gary Saunders, Sarah Grubb, Thomas Derrien,
Andrea Tanzer, Gang Fang, Mihali Felipe, Joanne Howes, Reena Halai,
Pablo Roman-Garcia, Michael Brent, Randall Brown, Jeltje van Baren
|
References
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa
A, Searle S et al.
GENCODE: the reference human genome annotation for The ENCODE Project.
Genome Res. 2012 Sep;22(9):1760-74.
PMID: 22955987; PMC: PMC3431492
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R,
Swarbreck D et al.
GENCODE: producing a reference annotation for ENCODE.
Genome Biol. 2006;7 Suppl 1:S4.1-9.
PMID: 16925838; PMC: PMC1810553
A full list of GENCODE publications are available
at The GENCODE Project
web site.
Data Release Policy
GENCODE data are available for use without restrictions.
|
|