Raw cDNA transcriptome sequence reads:
The raw sequence data are deposited in the NCBI Sequence Read Archive (SRA) with accession numbers SRR3990241- SRR3990248 associated with BioProject PRJNA330848 and BioSamples SAMN05427525 - SAMN05427532
Species: Atlantic silverside (Menidia menidia)
Sample type: mRNA from mix of tissues
RNA extraction method: Qiagen RNeasy Plus Universal Tissue Mini Kit
Library preparation: Illumina’s TruSeq RNA sample prep kit v2
Sequencing instrument: Illumina HiSeq 2000
Assembled Atlantic silverside transcriptome:
An assembled Atlantic silverside transcriptome is deposited in the NCBI GenBank Transcriptome Shotgun Assembly Sequence Database (TSA). This version of the project (01) has the accession number GEVY01000000, and consists of sequences GEVY01000001-GEVY01020998. The cleaned RNA-seq reads from all samples were de no assembled with two different programs: CLC Genomic Workbench v6.0.2 (both with an automatically optimized word size of 25 and a longer word size of 40) and Trinity v. r20131110 (with default settings, but retaining only the isoform with the highest mapped read depth within each subcomponent). We saw that each assembly contained a substantial set of unique transcripts not present in the other assemblies and therefore merged all three to maximize the gene space coverage in our final contig set. To reduce redundancy, we used cd-hit-est v4.5.4 to collapse the contig set into the longest representative for each unique sequence, and CAP3 v12/21/07 to meta-assemble partial assemblies of the same transcript. Following these procedures, we broke up likely chimeric contigs with the method by Yang and Smith (2013, BMC Genomics 14:328). Because we wanted to reduce our contig set to only include a single representative transcript for each silverside gene, we used a reciprocal best hit blast approach to extract non-redundant putative orthologs to the gene sets in three related species: platyfish (Xiphophorus maculatus), medaka (Oryzias latipes), and Nile tilapia (Oreochromis niloticus). We compared our contig set against the full peptide set for each reference species (downloaded from Ensemble release 75) with blastx, and then compared the peptide sequences for each species to our contig set with tblastn, in both cases using soft masking and an e-value cut-off of 10e-4. For each reference species, we recorded reciprocal best hits (RBHs) when a contig and a protein had a best match to each other. We used a sequential approach to select putative orthologs. We first extracted the contigs that were RBHs to platyfish proteins (since this species yielded the highest number of RBHs). We also added additional contigs that had a best hit to a portion of an RBH protein not covered by the RBH contig (secondary hits (maximum overlap of 10 amino acids allowed)), under the assumption that these contigs represented transcript fragments. We then added contigs that were RBHs (and the associated secondary non-overlapping hits to the same proteins) to medaka proteins that were non-redundant to the platyfish proteins. Medaka proteins were considered non-redundant if they did not have a RBH to the previously extracted RBH platyfish protein set (in a direct blastp comparison of the two protein set) or was annotated to the same zebrafish gene (ZFIN ID) as an RBH platyfish protein. We similarly added contigs that were RBH or associated secondary hits to tilipia proteins that were non-redundant to the proteins included from the other species. To recover additional high quality non-redundant transcripts, we used TransDecoder to predict coding regions in our redundancy-reduced contig set on the basis of nucleotide composition, open reading frame (ORF) length and Pfam domain content. Of the contigs predicted to contain a complete ORF, we retained the subset which did not have a significant (e-value<10e-2) blastn hit to the RBH contig set (and therefore are non-redundant).
Methods are also published in:
Therkildsen, N. O., and S. R. Palumbi.2016. Practical low-coverage genomewide sequencing of hundreds of individually barcoded samples for population and evolutionary genomics in nonmodel species. Molecular Ecology Resources. doi: 10.1111/1755-0998.12593