Open Regulatory Annotation Database
The Open Regulatory Annotation Database (also known as ORegAnno) is designed to promote community-based curation of regulatory information. Specifically, the database contains information about regulatory regions, transcription factor binding sites, regulatory variants, and haplotypes.
Overview
Data Management
For each entry, cross-references are maintained to EnsEMBL, dbSNP, Entrez Gene, the NCBI Taxonomy database and PubMed. The information within ORegAnno is regularly mapped and provided as a UCSC Genome Browser track. Furthermore, each entry is associated with its experimental evidence, embedded as an Evidence Ontology within ORegAnno. This allows the researcher to analyze regulatory data using their own conditions as to the suitability of the supporting evidence.
Software and data access
The project is open source - all data and all software that is produced in the project can be freely accessed and used.
Database contents
As of December 20, 2006, ORegAnno contained 4220 regulatory sequences (excluding deprecated records) for 2190 transcription factor binding sites, 1853 regulatory regions (enhancers, promoters, etc.), 170 regulatory polymorphisms, and 7 regulatory haplotypes for 17 different organisms (predominantly Drosophila melanogaster, Homo sapiens, Mus musculus, Caenorhabditis elegans, and Rattus norvegicus in that order). These records were obtained by manual curation of 828 publications by 45 ORegAnno users from the gene regulation community. The ORegAnno publication queue contained 4215 publications of which 858 were closed, 34 were in progress (open status), and 3321 were awaiting annotation (pending status). ORegAnno is continually updated and therefore current database contents should be obtained from www.oreganno.org.
RegCreative Jamboree 2006
The RegCreative jamboree was stimulated by a community initiative to curate in perpetuity the genomic sequences which have been experimentally determined to control gene expression. This objective is of fundamental importance to evolutionary analysis and translational research as regulatory mechanisms are widely implicated in species-specific adaptation and the etiology of disease. This initiative culminated in the formation of an international consortium of like-minded scientists dedicated to accomplishing this task. The RegCreative jamboree was the first opportunity for these groups to meet to be able to accurately assess the current state of knowledge in gene regulation and to begin to develop standards by which to curate regulatory information.
In total, 44 researchers attended the workshop from 9 different countries and 23 institutions. Funding was also obtained from ENFIN, the BioSapiens Network, FWO Research Foundation, Genome Canada and Genome British Columbia.
The specific outcomes of the RegCreative meeting to date are:
- Prior to the RegCreative Jamboree, attendees were asked to participate in an interannotator agreement assessment. Two ORegAnno mirrors were established with identical sets of publications to be annotated in their queue. In total, 33 redundant annotations from 18 publications were collected. (79 annotations for 31 papers and 60 annotations for 21 papers were collected on servers 1 and 2, respectively.) This effort was used as a baseline from which to establish annotator efficiency.
- Hands-on annotation activities occurred during the first 2 days of the 3-day workshop. In total, 39 researchers contributed 184 TFBS and 317 Regulatory Regions from 96 papers. Many of these researchers were also trained on the ORegAnno system, significantly increasing its experienced-user community. The contribution of these annotations to individual species was 339 annotations in Homo sapiens, 42 annotations in Mus musculus, 72 annotations in Drosophila melanogaster, 24 annotations in Ciona intestinalis, 14 annotations in Rattus norvegicus, 6 annotations in Halocynthia roretzi, 2 annotations in Ciona savignyi and 2 annotations in HIV. Within these annotations, one new dataset was added to ORegAnno; 274 human enhancers were programmatically annotated by Maximillian Haessler, Institute Alfred Fessard, from Visel et al., Nucleic Acids Research, 2006. In total, 130 scientific studies were examined in depth. The annotated papers were pre-selected from expert-curated publications in the ORegAnno queue that had full-text available through HighWire Press.
- There exists an immediate need for improved data standardization and development of associated ontologies. Specifically, this should include the open access development and integration of transcription factor naming conventions and sequence, cell type, cell line, tissue, and evidence ontologies. The groundwork for addressing and prioritizing these needs was accomplished in several ways during the meeting:
- Transcription factor naming issues were addressed through discussion of integration of transcription factor prediction pipelines, such as DBD or flyTF, which have been supplemented with manual curation versus solely manual curated implementations like TFcat.
- Marc Halfon, University at Buffalo, led a breakout session to improve the Sequence Ontology from existing ORegAnno and REDfly database conventions within the framework being developed as part of the Open Biomedical Ontologies. A preliminary version of these improvements can be found on the ORegAnno wiki.
- Learning-based ontology development was widely regarded as an essential feature of the annotation process. Such that, annotators are not restricted from annotating based on the limitations of the controlled vocabulary and that these exceptions can be used to further develop the backbone ontologies.
- Ontology development should be decentralized from the ORegAnno annotation framework. Specifically, it is planned that the ORegAnno evidence ontology will be removed and made available to broader community development.
- Renewed focus on integrating species-specific resources with annotation framework.
- A specific focus of the workshop was addressing the role of text mining in facilitating regulatory annotation. Sessions were led by Dr. Lynette Hirschman, MITRE, and Dr. Martin Krallinger, CNIO, to formulate where text-mining can help. A short term object of text-mining based analyses was formulated around both populating the ORegAnno queue and using the expert-curated portion of the ORegAnno queue to validate text-mining-based publication acquisition. The latter objectives are being led by Dr. Stein Aerts, University of Leuven.
References
- Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, Pleasance ED, Prychyna Y, Zhang X, Jones SJ (2006). "ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation". Bioinformatics. 22 (5): 637–40. doi:10.1093/bioinformatics/btk027. PMID 16397004.
- Griffith OL, Montgomery SB, Bernier B, Chu B, Kasaian K, Aerts S, Mahony S, Sleumer MC, Bilenky M, Haeussler M, Griffith M, Gallo SM, Giardine B, Hooghe B, Van Loo P, Blanco E, Ticoll A, Lithwick S, Portales-Casamar E, Donaldson IJ, Robertson G, Wadelius C, De Bleser P, Vlieghe D, Halfon MS, Wasserman W, Hardison R, Bergman CM, Jones SJ, Open Regulatory Annotation Consortium (2008). "ORegAnno: an open-access community-driven resource for regulatory annotation". Nucleic Acids Research. 36 (Database issue): D107–13. doi:10.1093/nar/gkm967. PMC 2239002. PMID 18006570.
- Lesurf R, Cotto KC, Wang G, Griffith M, Kasaian K, Jones SJ, Montgomery SB, Griffith OL, Open Regulatory Annotation Consortium (2016). "ORegAnno 3.0: a community-driven resource for curated regulatory annotation". Nucleic Acids Research. 44 (D1): D126-32. doi:10.1093/nar/gkv1203. PMC 4702855. PMID 26578589.