SmProt

1.Why to create SmProt database?

Identification of coding elements in the genome is a fundamental step to understanding the building blocks of living systems. Previous genome annotation pipeline mainly focused on the proteins longer than 100 amino acids. However, recent works have identified that many proteins shorter than 100 amino acids (small proteins) also play important roles in diverse functions such as development, muscle contraction, and DNA repair. Identification of previously neglected small proteins may contribute in important ways to cellular and organismal biology, emphasizing the need for an unbiased and comprehensive strategy to evaluate translation empirically. In recent years, the use of comparative genomics, proteomics, and a combination of evolutionary conservation and ribosome profiling data have shown that the number of small proteins is probably much more than previously suspected.

Deep transcriptome sequencing has revealed the existence of many transcripts that lack long or conserved open reading frames, which have been termed long non-coding RNAs. Although several lncRNAs have regulatory functions but the vast majorities of lncRNAs do not have known functions. While their existence is undisputed, their coding potential and functionality have remained controversial. Ribosome profiling, a technique that measures ribosome occupancy and translation genome-wide, has indicated that translation is far more pervasive than anticipated and takes place on many transcripts previously assumed to be non-coding RNAs. Besides, several small proteins encoded by ncRNAs have also been shown to be functional. These small proteins have diverse regulatory roles. A small protein database will offer new avenues of research into lncRNA regulatory mechanisms.

2. SmProt versions

The current version is v2.0. The v1.0 website is here: SmProt v1.0 website

We also provide tool of ID conversion from v2.0 to v1.0

3.The data sources of SmProt database.

Data sources	Description
Ribosome profiling	Ribosome profiling data sets are collected from GEO database, and our new pipeline based on RiboTISH was used for small proteins identification. Ribosome profiling includes the regular Ribo-seq and TI-seq. Regular Ribo-seq utilizes cycloheximide (CHX), a drug binding at the ribosome E-site, as a translation elongation inhibitor to freeze translating ribosomes. TI-seq use different translation inhibitors, usually lactomidomycin (LTM), to induce ribosomes stasis at translation initiation (TI) sites.
Literature Mining	Literature is obtained from PubMed. Including Low-throughput Literature and High-throughput Literature. We used a set of keywords to retrieve literature from PubMed, and then extracted detailed information manually from the literature which focused on a specific small protein.
Databases	We also collected small proteins from other databases. We only obtained the reliable small proteins （such as ha ving a manual test）and reprocessed according to the flow chart.
MS data	MS data sets are collected from ENCODE project, and then we analyzed these data to obtain small proteins encoded by ncRNAs.

4.How to use SmProt database?

SmProt2_tutorial.pdf

Search	ID Search: search through SmProt ID, NONCODE ID, ENSEMBL ID. Location Search: search concerned location of chromosome in specific species. Hits of small proteins will be reported if their locations are overlapped with the input location.
Browse	On Browse webpage, users can choose species (human, mouse, etc.), start codon (ATG, non-ATG), data source (ribosome profiling, mass spectrum, etc.), predicted function (yes/no, means whether have function domain prediction). Click Browse button and the filtered results with brief information will be listed below. Click on one SmProt_ID to jump to the page with detailed information.
Variants	On Variants webpage, variants related to small ORFs in 5'UTR called from WGS data of multiple projects are provided, as well as their effects on downstream gene expressions and translated uORF in SmProt. Users can choose data source (WGS project) and variant type (uAUG_gained, uSTOP_lost, etc., means effects of variants). Click on one variant to jump to the page with detailed information.
Diseases	On Diseases webpage, disease-specific translation events and variants in small proteins predicted from ribosome profiling data are provided (confidence: predicted specific), as well as disease-related small proteins reported in literature (confidence: reported related). Users choose species, then diseases list will be attached to the chosen species. Users can further choose confidence and start codon of small proteins.
Human Microbio	On HumanMicroBio webpage, users can choose body site (skin, gut, etc.) to see small proteins identified from microorganism samples from the body site. The brief results show total number, length and representative sequence of each family. Click on the Family ID to jump to the page with corresponding detailed information.
Inner BLAST	On Blast webpage, users can assess sequence similarity of small proteins in multiple species. All small proteins in SmProt v2.0 were added to the blast database. Program blastp means from protein to protein, blastx means from translated nucleotide to protein. Users can enter fasta format sequence directly or load fasta files from disk. The results can be generated with default parameters or specified parameters.
Genome Browser	Users can click Genome Button on Navigation Bar, or location link in General Information table in any small protein page, or genome browser link on Dataset table in any small protein page, to jump to Genome browser webpage to check small proteins on a genomic region. Users can manually change tracks to be shown or hiden.
Terminology Explaination	PhyloCSF: conservation of genomic region which reflects the coding potential. RiboPvalue: One tailed rank sum test p-value for regular riboseq frame bias inside ORF (frame test). TISPvalue: One tailed negative binomial test p-value for TISCount (TIS test). MS evidence: translation evidence from mass spectrum experiments. TISCount: Number of reads with P-site at TIS site. Kozak sequence: (GCC)GCCA/GCCATGG, emerges as the consensus sequence for initiation of translation in vertebrates. Kozak Strength: the likelyhood of an AUG initiating translation. oORF: overlapping open reading frame (with downstream gene). methyl_bin: For C>T changes at CpG sites, the mutability adjusted proportion of singletons are calculated separately for three distinct bins of methylation. AC: allele count. AF: allele frequency.

a reliable repository with comprehensive annotation of small proteins derived from ribosome profiling