Information

How to annotate the SwitchGear gene models?


According to this page(http://switchgeargenomics.com/wp-content/uploads/2009/02/switchdb_technote1.pdf), the SwitchGear gene models can be annotated by the NCBI annotation associated with Refseq accession numbers, but I couldn't find such information in this page(https://www.ncbi.nlm.nih.gov/refseq/), which is indicated in the above pdf page…

How can I annotate the SwitchGear gene models?


Large-scale collection and annotation of gene models for date palm (Phoenix dactylifera, L.)

The date palm (Phoenix dactylifera L.), famed for its sugar-rich fruits (dates) and cultivated by humans since 4,000 B.C., is an economically important crop in the Middle East, Northern Africa, and increasingly other places where climates are suitable. Despite a long history of human cultivation, the understanding of P. dactylifera genetics and molecular biology are rather limited, hindered by lack of basic data in high quality from genomics and transcriptomics. Here we report a large-scale effort in generating gene models (assembled expressed sequence tags or ESTs and mapped to a genome assembly) for P. dactylifera, using the long-read pyrosequencing platform (Roche/454 GS FLX Titanium) in high coverage. We built fourteen cDNA libraries from different P. dactylifera tissues (cultivar Khalas) and acquired 15,778,993 raw sequencing reads�out one million sequencing reads per library𠅊nd the pooled sequences were assembled into 67,651 non-redundant contigs and 301,978 singletons. We annotated 52,725 contigs based on the plant databases and 45 contigs based on functional domains referencing to the Pfam database. From the annotated contigs, we assigned GO (Gene Ontology) terms to 36,086 contigs and KEGG pathways to 7,032 contigs. Our comparative analysis showed that 70.6 % (47,930), 69.4 % (47,089), 68.4 % (46,441), and 69.3 % (47,048) of the P. dactylifera gene models are shared with rice, sorghum, Arabidopsis, and grapevine, respectively. We also assigned our gene models into house-keeping and tissue-specific genes based on their tissue specificity.

Electronic supplementary material

The online version of this article (doi:10.1007/s11103-012-9924-z) contains supplementary material, which is available to authorized users.


Introduction

Phages, viruses that infect bacteria, provide unique challenges for bioinformatics. There is a limit to how much DNA can be packaged in a capsid, and therefore phage genomes are generally short, typically in the range 20–100 kb. By necessity, their genomes are compact: phage genes are shorter than their bacterial homologs are frequently co-transcribed, and adjacent open reading frames (ORFs) often overlap ( Kang et al., 2017). In a few cases, phage genes are encoded within each other ( Cahill et al., 2017 Summer et al., 2007). In contrast, bacterial genes generally are longer, separated by intergenic spacers and frequently switch strands ( Kang et al., 2017). There are no bioinformatics tools specifically designed to identify genes in phage genomes, so algorithms designed to identify bacterial genes are typically used ( McNair et al., 2018). For example, from 31 phage genomes published between October 14, 2016 and August 1, 2018, the genes in ten phage genomes were identified by GeneMark software (GeneMark/GeneMarkS/GeneMark.hmm), the genes in 10 phage genomes were identified by RAST, the genes in 7 phage genomes by Glimmer, 3 phage genomes each by Geneious, the NCBI ORF Finder, PHAST (which uses Glimmer as a gene caller Arndt et al., 2016), PROKKA (which uses Prodigal as a default gene caller Seemann, 2014), 2 phage genomes by Prodigal and 1 phage genome by MetaVir, RASTtk, SerialCloner or SnapGene ( Supplementary Table S1 note that in many publications several different tools were used to identify genes in phage genomes). Each of these algorithms relies on information that is not available and calculations that are not possible with short genomes. For example, there are no conserved genes in phage genomes that can be used to build universal training sets ( Rohwer and Edwards, 2002), fewer genes means the statistics used to identify start codons are less accurate ( Wu et al., 2003), and because many phage genes or the proteins they encode have no homolog in the databases, similarity searches are unreliable ( Roux et al., 2015). There are alternate gene calling approaches, such as using positional nucleotide frequency ( Besemer and Borodovsky, 1999), or the multivariate entropy of amino acid usage used by Glimmer ( Ouyang et al., 2004), but these are designed for complete bacterial genomes and have not been optimized for use with phage genomes.

Here, we introduce a novel method for gene identification that is specifically designed for phage genomes. We make several presumptions based on studying hundreds of phages genomes. First, we noted that since phages have physical limits on their genome sizes they contain minimal non-coding DNA. Second, we showed that phage genes are usually on the same strand of the DNA, presumably because they are co-transcribed ( Akhter, 2012 Kang et al., 2017). Based on these observations, we designed a completely novel approach to phage gene identification, tiling opening reading frames to minimize non-coding DNA bases and strand switching. We treat a phage genome as a network of paths in which ORFs are more favorable, and overlaps and gaps are less favorable. We solved this weighted graph problem using the Bellman-Ford algorithm ( Bellman, 1958 Ford, 1956), and by optimizing the parameters for phages genomes we are able to enhance phage gene prediction algorithms. In the absence of supporting data to confirm our new predictions, we turned to high-volume sequence similarity searches to explore the predicted proteins. Regions of the genome that encode proteins are more likely to be conserved at the amino acid level than regions that encode regulatory regions, replication regions, sites of integration and other, DNA-based, information components of the phage genome ( Badger and Olsen, 1999). These searches showed that the predicted phage genes might encode novel proteins that have been missed by existing gene callers designed to annotate bacterial genomes.


Results

This section briefly explains Web Apollo's core operations for importing data, editing, and exporting protein-coding gene models. Additionally we describe additional features supporting the annotation of corrections to lower quality genome assemblies, import and visualization of transcriptome data, and real-time collaboration.

Protein-coding gene annotation

To annotate a gene, curators commonly proceed by: (1) locating the region of interest (2) inspecting all available gene predictions and biological evidence aligned to the region (3) creating a gene model (4) if necessary, modifying these gene models using the editing functions (5) corroborating the accuracy of the annotation by comparing the resulting annotation with available homologs and (6) ensuring that correct naming conventions and relevant comments have been added, utilizing available literature as needed.

Importing genomic data: Using server-side middleware, the system can load data tracks from a variety of sources, including the UCSC genome database [23], Chado databases [24], Ensembl DAS [25], and GenBank XML [26]. In our recent experience, however, the most common sources of genomic information are the laboratories of individual researchers themselves and therefore we focused our attention on direct loading of genomic data files. The system accepts results of computational genomic analyses in the standard, widely used file formats GFF3 (Generic File Format, a de facto standard for sharing analysis results), SAM (Sequence Alignment/Map, accepted standard for efficient representation of high throughput sequencing alignments [27]), BAM (binary version of SAM), and BigWig (a binary index of 'wiggle' formatted files for the storage of dense, continuous data [28]). The initial server for an organism is typically primed with data using the combined output from a full genome analysis pipeline, such as MAKER [29]. Working with the MAKER developers, a feature that dynamically instantiates a Web Apollo server as the final step in a MAKER run has been implemented. In addition, users may augment pipeline results with other data, either during the initial installation and configuration process (in which case it is stored on the server), or loading them dynamically from a local file or URL during a session. The URL alternative makes it possible for a group of users to share their data without having to add it to the central server, for example to share and display the output from a Galaxy process [30].

Locating the region of interest: Due to the highly fragmented nature of low-coverage genome assemblies with hundreds or thousands of scaffolds, selecting a chromosomal region of interest is not always a straightforward task. To assist in locating a region of interest users may deploy the 'Search Sequence' tool, which queries the assembled genome with a gene or chromosomal region of interest using a BLAT search (BLAST-like Alignment Tool [31]). This feature was implemented using a plug-in architecture, allowing support for search tools other than BLAT with minor additions to the source code. BLAT may point to multiple potential regions containing the query sequence when paralogs are present, and/or when the gene of interest is split across two or more genomic fragments. This search results in list of regions that a user can then chose from by simply clicking on a region's row to display that region in the browser.

As an example, Figure 2 displays a small region of a scaffold from the honeybee (Apis mellifera) genome assembly. Each horizontal track presents a particular type of data, variously shown as graphs, 'heat maps', or as discrete features depending on the type of data and on user preferences. The data tracks retrieved from the server or uploaded by the user are read-only and are used as the evidence to support or refute individual gene models.

Example of the Web Apollo interface. Moving from top to bottom these example tracks from the honeybee (Apis mellifera) genome display: (A) In-progress gene models interactively being edited by the user. (B) The honeybee consortium's official gene set. (C) Transcripts from the NCBI RefSeq database. (D) Output from MAKER. (E) Output from various different gene prediction programs. (F, I, J) Contigs generated from RNA-seq data for respectively: nurse bees, testes, and ovaries. (G) Coverage map from the nurse bee RNA-seq data. (H) RNA-Seq data from forager bees displayed as a 'heat map'. Note that none of the gene predictions are in agreement regarding intron-exon boundaries in (E), which illustrates why manual review is needed. Web Apollo gives biologists the ability to manually resolve disagreements and create a more accurate set of gene predictions to improve upstream analysis pipelines in subsequent runs, as well as provide a more reliable substrate for downstream analyses.

Creating a gene model: Curators begin the manual annotation process by selecting and dragging the most appropriate computational results into the 'User-created Annotations' area, a writable 'white board' track where they can modify transcripts and individual exons. Alternately there is also the option to automatically promote one of the computational prediction sets. Due to the redundancy of available evidence for highly expressed transcripts, and the fluid growth of the available evidence, we expressly decided not to include any meta-data listing the evidence tracks used to create an annotation. The former would cause the meta-data captured to balloon, and the latter would make it extremely difficult to maintain data integrity. In our experience it is more effective to keep track of dates. If the annotation itself is dated (both for creation and for modification) as well as the evidence, then it is a straightforward operation to compare these and flag discrepancies. It is also important to use the available screen area optimally, particularly as the volume of information increases. Towards this end we added the capacity to restrict the view to a single strand, and to lock the editable white-board track into position so it is visible regardless of whether the user scrolls vertically.

Modifying a gene model: Basic editing operations such as deleting, merging, splitting, or duplicating a transcript or part of one, can be accessed from a pop-up menu available for each feature using a right-click of the mouse. To modify exon boundaries, users click to select the subject exon and drag either one of the edges. Apollo facilitates correct determination of exon boundaries by highlighting matching edges across the annotation and evidence tracks, by coloring the CDS annotation and evidence features according to their reading frame (that is, the frame of each exon is indicated by its color, and thus any features with conflicting frames displays in a different color), and by flagging non-canonical splice-sites in the user's annotations. The resulting protein sequence can be used to determine the biological credibility of a gene model by querying highly curated protein databases. Editing requests from different users arrive at the server one at a time (because of the network) and are handled in their order of arrival. The unit of operation includes all the additional edits that are intrinsic to the original operation, that is, if an exon is deleted or shortened then the parent transcript and parent gene are modified as well. The second edit request will either overwrite the first edit, which the first user will be able to see immediately, or in very rare cases of a contradictory edit (for example, an exon being deleted by the first user and then a request to change its boundary by the second user) the second user will receive and error warning, and the annotation will remain as edited by the first user. All operations performed in the 'User-created Annotations' track are recorded in the history and can be reversed or repeated with the 'Undo' and 'Redo' options.

Exporting data: To conduct further analyses, users may export their annotations as FASTA-formatted sequences, GFF3 files, or record them in a Chado database.

Sequence alterations

During the development of Web Apollo, we encountered a scenario among the newer genome projects that was radically different from our previous experience with large sequencing centers and MODs. The centers and MODs historically focused on assembling reference genomes with deep coverage from Sanger sequencing resulting in full chromosomal assemblies. In contrast, more recent projects are often assembled from Next Generation Sequencing (NGS) technologies which generate shorter reads with higher error rates, resulting in assemblies that are not only more fragmented but also contain a relatively higher number of errors in the genomic sequence [32]. For example, some errors introduced indels in coding sequence, disrupting the reading frame. Biologists needed to annotate the features on the genome, but in order to create the correct transcript annotation, correcting these suspected sequencing and assembly errors was also necessary, and it became a highly requested feature. Curators may now correct suspected assembly errors using Apollo's ability to perform genomic sequence insertions, deletions, and substitutions (Figure 3). These sequence changes do not alter the underlying reference assembly stored on the server, but are maintained as annotations so they can potentially be incorporated into subsequent assemblies for incremental improvements. Within the context of Apollo, these genomic sequence annotations create an underlying virtual sequence that is incorporated when calculating mRNA and protein sequences for these annotations. The resulting sequences can be exported as described below in the Methods section.

Example of sequence alteration editing operations. The top panel shows a transcript annotation (in blue) flagged with an orange exclamation icon indicating that the curated intron-exon junction does not follow a canonical splice site pattern, that is, having a 'GT' immediately 3' of the junction. In the second panel a curator has examined this issue and determined that a base was mis-called in the assembly, and has therefore added a substitution annotation (shown in yellow), substituting a 'T' for a 'C'. This change immediately triggers removal of the non-canonical warning icon, because with the substitution the splice junction now has the canonical 'GT'. In the third panel a curator has created a sequence insertion annotation (shown in green) upstream of the splice, and this leads to a stop codon that truncates the CDS. In the last panel a sequence deletion annotation has been created (shown in red), which causes a frame shift for the annotation transcript, and results in the reversal of the CDS truncation.

Visualizing stage and cell-type specific transcription

Using new sequencing technologies researchers are able to capture snapshots of the entire RNA content of samples from particular cell types, particular tissues, at particular developmental stages, or under any number of other specific environmental conditions. These techniques measure expression levels more precisely and offer better opportunities to identify alternate transcripts than the previous methods [33], providing essential information for thorough gene structure annotation. To gain an understanding of expression levels Web Apollo offers multiple modes for transcriptome data visualization, as coverage plots, as 'heat maps', and as alignments. Graphs of expression levels across the genome may be driven from data loaded in BigWig format alternatively the number of reads per base can be calculated using either the raw sequence data (FASTQ, SFF, and so on) or using alignment data from BAM files. Expression data may also be shown as 'heat map' plots (Figure 2, track H) in which regions with scores above a given threshold acquire a progressively brighter shade of blue, and scores below that threshold progressively become more intensely red. The display of aligned reads (BAM) includes base-by-base alignments for each read, if the MD or CIGAR fields for the read are provided. As shown in Figure 4, Web Apollo can display high-throughput RNA sequencing data from files in any of these formats, either from the server or from user-uploaded data files through a web browser.

RNA-Seq evidence provides support for alternative isoforms. In this example from the bovine genome (Bos taurus) the RNA-Seq data was stored as a BAM file and dynamically uploaded. Individual aligned reads are shown in teal. The example highlights the importance of utilizing deep RNA sequencing for curation. Two different splice variants are visible: one variant is visible in the Dog Ensembl track and a different one is visible in the Mouse Ensembl track. The RNA-Seq data track clearly shows evidence that both variants are present in the bovine. Edge-matching (in red) highlights the concordance in exon boundaries between the different tracks.

Real-time collaboration

In addition to supporting an individual's work, Web Apollo allows groups of researchers to share their annotations and to collaboratively add, delete, and revise annotations. Collaboration is enabled through the server's management of user login, authentication, and editing authorization permissions. The application is flexible enough to support members of a group working concurrently or at different times. Multiple users may work simultaneously on the same region while discussing their work in chat windows or using Voice-over IP services (for example, Skype, Google Hangout, Vidyo, and so on). All changes made in one client are instantly pushed and visible to all other clients. Alternatively, users may work asynchronously, monitoring the changes that occur in their absence. This is possible because the mechanism that supports 'Undo' and 'Redo' functions also supports graphical browsing of an annotation's edit history (Figure 5). Each revision is tracked, dated, and signed so collaborators can visually review the changes and identify the user(s) who made them. Users may add as many details as necessary in support of each annotation in the form of comments. Comments can be chosen from a predefined set, be added as free-text, and/or as cross-references to related resources (for example, gene ontology (GO) functional terms).

History tracking and edit operation. Two History windows show how the transcript changed between edit operations. Each History entry shows the edit operation, the user who made the edit, and the date. The top window shows the transcript after merging of two exons and the one below shows the transcript after an exon has been deleted. Users can click on different History entries, which will display how the transcript looked at that point in time.

Community adoption: In the three months since its public release in December of 2012, 18 servers (Table 1) for eight different annotation groups have been set up, some with our group's assistance and others independently.


How to annotate the SwitchGear gene models? - Biology

The International Sweetpotato Genome Initiative (ISPGI) is pleased to make available I. trifida and I. triloba genome sequences for use by public and private research communities as a resource to facilitate plant biology discoveries and plant breeding programs.

The paper "Genome sequences of two diploid wild relatives of cultivated sweetpotato reveal targets for genetic improvement" describing the sequencing and analysis of the I. trifida and I. triloba genome sequences was published in Nature Communications. ( https://doi.org/10.1038/s41467-018-06983-8 ). The data below is also archived at the Dryad Digital Repository ( https://doi.org/10.5061/dryad.b9m61cg ).

The I. trifida and I. triloba genome sequencing is primarily supported by Bill & Melinda Gates Foundation through the GT4SP (Genomic Tools for Sweetpotato Improvement) project.

Zhangjun Fei (Boyce Thompson Institute/Cornell) has sequenced and assembled two diploid Ipomoea species: Ipomoea trifida (NSP306) and Ipomoea triloba (NSP323). The genome assemblies have been annotated by Robin Buell and her group at Michigan State University.

The version 3 pseudomolecules, annotation, and RNA-seq gene expression data are available to download below. The genome annotation can also be viewed in the JBrowse genome browser. The browser also provides tracks displaying evidence alignments, RNA-seq coverage, putative repeats, and SNP calls. A BLAST server is also available for searching your sequences against the version 3 annotation and assemblies.

The annotation files and JBrowse links below are the v3 annotations based the pseudomolecules (assembly v3).

Release Date: April 17, 2017:

JBrowse Genome Browser

BLAST Server

Genome Assemblies

    - FASTA format
    - FASTA format
    The genome assemblies hard masked using RepeatMasker and a species-specific repeat library:

      - FASTA format
      - FASTA format

    Genome Annotation

    The genome annotation is based on the version 2 Ipomoea trifida (NSP306) and Ipomoea triloba (NSP323) assemblies. The annotation was transferred to the v3 pseudomolecules and final locus names assigned.

      Ipomoea trifida (NSP306) Genome Annotation - High Confidence Gene Model Set

        - Nucleotide sequences of the high confidence gene model transcript sequences (cDNA). - Nucleotide sequences of the high confidence gene model coding sequences (CDS). - Amino acid sequences corresponding to the high confidence gene model coding sequences - High confidence gene model annotation in GFF3 format - Putative functional annotation of high confidence gene models - List of representative high confidence gene models (longest CDS)

        - Nucleotide sequences of the high confidence gene model transcript sequences (cDNA). - Nucleotide sequences of the high confidence gene model coding sequences (CDS). - Amino acid sequences corresponding to the high confidence gene model coding sequences - High confidence gene model annotation in GFF3 format - Putative functional annotation of high confidence gene models - List of representative high confidence gene models (longest CDS)

        - Nucleotide sequences of the working gene model transcript sequences (cDNA). - Nucleotide sequences of the working gene model coding sequences (CDS). - Amino acid sequences corresponding to the working gene model coding sequences - Working gene model annotation in GFF3 format

        - Nucleotide sequences of the working gene model transcript sequences (cDNA). - Nucleotide sequences of the working gene model coding sequences (CDS). - Amino acid sequences corresponding to the working gene model coding sequences - Working gene model annotation in GFF3 format

      RNA-seq Gene Expression Data

      I. trifida and I. triloba RNA-seq libraries were mapped in paired-end read and single-stranded library mode to their respective version 3 genome assemblies using Tophat (v2.1.0). Gene expression for the version 3 high confidence gene models was calculated as FPKM using Cufflinks (v2.2.1).

        - FPKM values of v3 high confidence gene models for 8 I. trifida RNA-seq libraries (callus_flower, callus_stem, flower, flowerbud, leaf, root1, root2, stem)
        - FPKM values of v3 high confidence gene models for 6 I. triloba RNA-seq libraries (flower, flowerbud, leaf, root1, root2, stem)
        - FPKM values of v3 high confidence gene models for 15 I. trifida abiotic and biotic stress RNA-seq libraries. The libraries are described in the 'Library Key' worksheet.
        - FPKM values of v3 high confidence gene models for 15 I. triloba abiotic and biotic stress RNA-seq libraries. The libraries are described in the 'Library Key' worksheet.

      Contact:

      Dr. C. Robin Buell, Michigan State University - [email protected]

      This site has been tested on IE 10+, Chrome, Safari and Firefox.

      Funding has been provided by awards from the Bill & Melinda Gates Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Bill & Melinda Gates Foundation.

      Web template provided by Design By Darren. Photos courtesy of Wikimedia Commons.


      We are pleased to announce release 7 of the Rice Pseudomolecules and Genome Annotation. The official release date for this version was October 31, 2011.

      Release 7 is a major update from release 6.1. The rice pseudomolecules have been reconstructed using an optimal BAC tiling path that involved use of a BAC-optical map and error correction of the underlying BAC sequence using next generation sequencing reads from Nipponbare rice. This effort, in cooperation with researchers at the Agrogenomics Research Center at the National Institute of Agrobiological Sciences, Tsukuba, Japan and the Rice Annotation Project Database (RAP-DB), represents a final and unified set of pseudomolecules (Os-Nipponbare-Reference-IRGSP-1.0). There are the 12 chromosomes, one pseudomolecule representing the unanchored BAC clones, one pseudomolecule representing unmapped Syngenta sequences plus the two organellar genomes. Note that while the MSU Rice Genome Annotation Project and the International Rice Annotation Project Database (RAP-DB) have different annotation efforts, these parallel annotation efforts utilize the same underlying pseudomolecule sequence.

      In release 7, there were 373,245,519 bp of non-overlapping rice genome sequence from the 12 rice chromosomes. The genes that had been identified from release 6.1 were remapped and transfered to release 7. This process resulted in 55,986 genes (loci) had been identified, of which 6,457 had 10,352 additional alternative splicing isoforms resulting in a total of 66,338 transcripts (or gene models) in the rice genome. Note that small gene models (<50 amino acids) have been excluded from our annotated gene set.

      Transposable element-related (TE-related) gene models were identified using two approaches: BLASTN searches against the MSU Oryza Repeat Database and by identifying gene models containing TE-related Pfam domains. These loci (16,941) and their models (17,272) were annotated based on the Pfam domain or the nomenclature in the MSU Oryza Repeat Database. Pack-MULEs were identified on all 12 chromosomes. They were annotated as described in Hanada et al. 2009. Transduplicate MULEs identified by Juretic et al. 2005 were aligned to the current pseudomolecules. Note that the Jiang Pack-MULEs and the transduplicate MULEs had only been identified on the Genome Browser and not in our functional annotation. Also note that although loci and gene models on ChrUn and ChrSy are now included in our official gene set but are not assigned LOC_OsXXgXXXXX identifiers. These two pseudomolecules contain 185 loci and gene models.

      Please note that these pseudomolecules are constructed from finished and unfinished sequence and a majority of the gene models have not been manually curated.


      Table of Rice Pseudomolecule, Loci, and Gene Models in Release 7

      Chr BAC/ PAC No. Sequence Length in Pseudomolecule (bp) Gaps Genes/Loci a Gene Models a Download Sequences
      TE b Non-TE c Total d TE b Non-TE c Total d
      1 392 43,270,923 8 1,464 5,078 6,542 1,518 6,518 8,036 Download
      2 359 35,937,250 5 1,244 4,143 5,387 1,274 5,392 6,666 Download
      3 331 36,413,819 8 1,185 4,388 5,573 1,224 5,803 7,027 Download
      4 296 35,502,694 9 1,903 3,419 5,322 1,919 4,265 6,184 Download
      5 286 29,958,434 5 1,461 3,118 4,579 1,483 4,009 5,492 Download
      6 281 31,248,787 4 1,488 3,236 4,724 1,517 3,965 5,482 Download
      7 289 29,697,621 3 1,397 3,065 4,462 1,430 3,767 5,197 Download
      8 278 28,443,022 3 1,432 2,762 4,194 1,446 3,426 4,872 Download
      9 223 23,012,720 7 1,148 2,260 3,408 1,161 2,768 3,929 Download
      10 208 23,207,287 10 1,219 2,298 3,517 1,244 2,830 4,074 Download
      11 261 29,021,106 6 1,459 2,707 4,166 1,493 3,208 4,701 Download
      12 269 27,531,856 5 1,579 2,443 4,022 1,605 2,983 4,588 Download
      Total e 3,184 373,245,519 73 16,979 39,102 56,081 17,314 49,119 66,433 Download

      a Excluding small gene models (< 50 amino acids).
      b TE: Transposable elements related genes and gene models. The rice proteome was searched against the MSU Oryza Repeat Database with TBLASTN and against the TE-related Pfam domains with hmmpfam. Genes and gene models with matches above cut-offs were annotated as TE-related gene models. However, genes that have been identified as TE-related based on Pfam similarity but that were annotated by Community Annotators (CA) as non-TE functional genes are classified as non-TE-related and are given the CA-provided functional annotation.
      c Non-TE: Non-TE related gene models.
      d There are 89 loci and 89 models on ChrSy. There are 96 loci and 96 models on ChrUn. These loci and models are not included in the totals for the main pseudomolecules.
      e Note that these pseudomolecules are now identical to the IRGSP/RAP pseudomolecules.


      Official genome assembly for Papilio glaucus v1.1 can be downloaded here.
      Official gene annotation for Papilio glaucus v1.1 can be downloaded here.

      Official genome assembly for Lerema accius v1.1 can be downloaded here.
      Official gene annotation for Lerema accius v1.1 can be downloaded here.

      Official genome assembly for Phoebis sennae v1.1 can be downloaded here.
      Official gene annotation for Phoebis sennae v1.1 can be downloaded here.

      Official genome assembly for Calycopis cecrops v1.1 can be downloaded here.
      Official gene annotation for Calycopis cecrops v1.1 can be downloaded here.

      Official genome assembly for Pieris rapae v2 can be downloaded here.
      Official gene annotation for Pieris rapae v2 can be downloaded here.

      Official genome assembly for Achalarus lyciades v1 can be downloaded here.
      Official gene annotation for Achalarus lyciades v1 can be downloaded here.


      Abstract

      The pig is one of the earliest domesticated animals in the history of human civilization and represents one of the most important livestock animals. The recent sequencing of the Sus scrofa genome was a major step toward the comprehensive understanding of porcine biology, evolution, and its utility as a promising large animal model for biomedical and xenotransplantation research. However, the functional and structural annotation of the Sus scrofa genome is far from complete. Here, we present mass spectrometry-based quantitative proteomics data of nine juvenile organs and six embryonic stages between 18 and 39 days after gestation. We found that the data provide evidence for and improve the annotation of 8176 protein-coding genes including 588 novel and 321 refined gene models. The analysis of tissue-specific proteins and the temporal expression profiles of embryonic proteins provides an initial functional characterization of expressed protein interaction networks and modules including as yet uncharacterized proteins. Comparative transcript and protein expression analysis to human organs reveal a moderate conservation of protein translation across species. We anticipate that this resource will facilitate basic and applied research on Sus scrofa as well as its porcine relatives.


      We have been funded by the National Science Foundation to annotate the rice genome. A summary of the project and its goals are listed below.

      Rice is a model species for the monocotyledonous plants and the cereals which are the greatest source of food for the world's population. While rice genome sequence is available through multiple sequencing projects, high quality, uniform annotation is required in order for genome sequence data to be fully utilized by researchers. The existence of a common gene set and uniform annotation allows researchers within the rice community to work from a common resource so that their results can be more easily interpreted by other scientists.

      The objective of this project has always been to provide high quality annotation for the rice genome. We generated, refined and updated gene models for the estimated 40,000-60,000 total rice genes, provided standardized annotation for each model, linked each model to functional annotation including expression data, gene ontologies, and tagged lines.

      We have provided a resource to extend the annotation of the rice genome to other plant species by providing comparative alignments to other plant species. We have provided training in bioinformatics to over 100 plant scientists, sharing our informatic expertise with a broader range of scientists. We have developed agricultural genomics lecture and teaching modules for educating high school students and teachers on the significance of agricultural genomics.

      This project is funded by the National Science Foundation Plant Genome Research Program # DBI-0321538 and DBI-0834043.


      Methods

      The annotation system as described here is a platform-independent specification.

      The OMP wiki [14] implementation of the annotation system is based on the open source Mediawiki software platform [28]. The OMP wiki is currently running on Mediawiki 1.31 using php7.2 and MySQL 5.7 with customized extensions to support biological wikis and ontology projects [29] and additional software extensions developed specifically to support OMP projects. The OMP wiki is currently a virtual host on a single Linux server at Texas A&M shared with other projects. Extension code is open source and available at our GitHub repository [30].

      The OMP and ECO ontologies are downloaded from our central repositories daily and parsed into a local mysql database, obo_archive, with a custom schema that incorporates version history for every ontology term.

      The annotation system within the wiki is controlled by a custom extension for the OMP project, which in turn builds on TableEdit [31], an extension for managing structured tabular data in MediaWiki, and TableEdit-based code modules developed for ontology wiki projects [29]. The template for the annotation form is defined by a page in the wiki, Template:OMP_annotation_table, which controls formatting and callbacks for the displays in Fig. 2a (viewing mode) and b (editing mode). The annotation editing form (Fig. 2b) uses obo_archive to look up current term names when a curator enters OMP or ECO ids.

      Each phenotype annotation is stored as a TableEdit row associated with a specific TableEdit table on a genotype page. Each genotype page also contains a TableEdit table with genotype information defined by a different TableEdit template: Template: Strain_info_table. To calculate possibly relevant differences in genotype and conditions, the extension uses the unique annotation id in the Relative to field to find the content of the conditions field in the reference annotation, and the genotype on the page where the reference annotation is stored. The genotype and conditions fields for the reference and dependent annotation are then tokenized with a regular expression and the differences are calculated by comparing arrays of unique tokens for each field.


      Watch the video: bounding box annotations, side by side comparison (January 2022).