UCSC Genome Browser: Data Organization

Data Organization and Format

The data for the working draft are organized hierarchically by chromosome and by the sequenced-clone contigs within each chromosome. At the top level there are 25 folders; 22 of these are for the numbered chromosomes (autosomes), folders X and Y are for the sex chromosomes, and Un is for clone contigs that cannot be placed confidently on a chromosome. Each of the 25 chromosomal folders contains a separate clone contig folder for each of the clone contigs for that chromosome.

There are two primary files in each clone contig folder; these have suffixes .fa and .agp respectively. The .fa files gives the working draft sequence for the clone contig. The format is Fasta format, e.g.

>NT_077768
GAATTCTCTGTAACACTAAGCTCTCTTCCTCAAAACCAGAGGTAGATAGA
ATGTGTAATAATTTACAGAATTTCTAGACTTCAACGATCTGATTTTTTAA
ATTTATTTTTATTTTTTCAGGTTGAGACTGAGCTAAAGTTAATCTGTGGC
...

The .agp file is a kind of index that tells how the .fa file is built. It looks like

17/NT_077768    1       6538    1       D       AC021317.18     122280  128817  -
17/NT_077768    6539    56206   2       D       AC021317.18     128918  178585  -
17/NT_077768    56207   56306   3       N       100     fragment        yes
17/NT_077768    56307   117971  4       D       AC021317.18     47188   108852  -
17/NT_077768    117972  170563  5       F       AC115992.13     23659   76250   +
17/NT_077768    170564  274979  6       D       AC124789.11     1       104416  -
...

Each line represents either an actual sequence record or a gap (unless it begins with "#", in which case it is a comment.) If the line represents an actual sequence record then it has the form

and if it represents a gap it has the form

The positions <start-in-ctg> and <end-in-ctg> are the start and end positions for where the sequence is to be put in the .fa file. For a sequence record, the positions <start> and <end> are the start and end positions of where the sequence came from in the GenBank record <accession>.<version>. The field <orientation> tells whether or not the sequence must be reverse complemented before it is inserted into its place in the .fa file. For example, the records above mean that to build the .fa file for clone contig NT_077768 from chromosome 17 you take

AC021317 version 18, residues 122280 to 128817, reverse complemented, followed by 
AC021317 version 18, residues 128918 to 178585, reverse complemented, followed by 
a gap of 100 Ns, followed by 
AC021317 version 18, residues 47188 to 108852, reverse complemented, followed by 
AC115992 version 13, residues 23659 to 76250, followed by 
AC124789 version 11, residues 1 to 104416, reverse complemented, followed by 
...

The joins perfectly abut. In a sequence record, <type> can be

F - Finished, A - in Active finishing, D - Draft, P - PreDraft, O - Other sequence

and in a gap record it is always N. The <number> field just sequentially numbers the records.

In a gap record, <number-of-Ns> is the size of the gap and <kind> is

fragment - a gap between two sequence contigs (also called a "sequence gap")
split_finished - a special sized gap between two finished sequence contigs
clone - a gap between two clones that do not overlap
contig - a gap between clone contigs in the genome layout (also called a "layout gap")
centromere - a gap inserted for the centromere
short_arm - a gap inserted at the start of an acrocentric chromosome
heterochromatin - a gap inserted for an especially large region of heterochromatin (may include the centromere as well.)
telomere - a gap inserted for a telomere

<bridged?> is "yes" if there is a cDNA or BACend pair or plasmid end pair that spans the gap, else it is "no".

We provide three ways you can download these .fa and .agp files:

full data set: the entire hierarchy in a zipped format.
by chromosome: one zipped file for each chromosome containing all the sequence ordered along that chromosome.
by individual clone contig: separate files, not zipped, for each clone contig.