Oral Presentation 26th Annual Lorne Proteomics Symposium 2021

Identification of novel proteins encoded by the human genome (#5)

Hitesh Kore 1 2 , Keshava Datta 1 , Shivashankar H Nagaraj 2 , Harsha Gowda 1 2 3
  1. Cancer Precision Medicine Group, QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia
  2. Faculty of Health, Queensland University of Technology, Brisbane, QLD, Australia
  3. Faculty of Medicine, The University of Queensland, Brisbane, QLD, Australia

The estimate of protein-coding genes is largely unchanged in two decades since the completion of the human genome project. According to current estimates, there are ~20,500 protein-coding genes. This catalog serves as the basis for most biomedical research. Therefore, any missing gene in this catalog is less likely to draw the attention of most researchers. Recently, various ribosome profiling and mass-spectrometry studies have reported several novel proteins encoded by lncRNAs and UTR regions of protein-coding genes. Some of these novel proteins have been shown to play an important role in various biological processes including development, muscle performance, and DNA repair. This suggests that genome annotation pipelines may have potentially missed some of the protein-coding regions and might have annotated them as non-coding. Most of these transcripts lack obvious open reading frames (ORFs>300nt) and display poor evolutionary conservation across vertebrate lineage. We developed a workflow to identify potential novel proteins encoded by lncRNAs and UTR regions of protein-coding genes and generated ORFome database by computationally translating unique lncRNAs from GENCODE, NONCODE and LNCipedia and UTRs of known protein-coding genes. ORFs that do not qualify the cross-species conservation and NMD (Nonsense Mediated Decay) criteria were filtered. Our ORFome database contains 44,844 candidates ORFs. This database was searched against the proteomic data available for 30 human tissues. We identified hundreds of lncRNA and UTR ORFs with protein evidence. Moreover, a subset of them showed tissue-specific expression pattern. Novel proteins encoded by lncRNAs and UTRs of protein-coding genes should enable researchers to elucidate their roles in various diseases.