From Whelan et al (2014):
Specimen Sampling and Sequencing
Malacobdella grossa was collected off Rhode Island by a commercial vessel harvesting the bivalve Arctica islandica as part of the study by Dahlgren et al. (2000). Paranemertes peregrina and T. polymorphus were collected in False Bay, San Juan Island, Washington. RNA extraction, complimentary DNA (cDNA) library preparation, and Illumina sequencing generally followed the methods of Weigert et al. (2014). In brief, we extracted RNA from M. grossa and Para. peregrina using whole animals and from T. polymorphus using the anterior three quarters of the specimen with TRIzol (Invitrogen). RNA was purified with the Qiagen RNeasy kit (Valencia, CA) using on-column DNAse digestion. cDNA libraries were constructed with the SMART cDNA library construction kit (Clontech Laboratories, Mountain View, CA) following the manufacturer’s protocol except that the provided 3’ oligo was replaced with the Cap-Trsa-CV oligo as per Meyer et al. (2009). Fulllength cDNA was then amplified using the Advantage 2 PCR system (Clontech) with a minimum number of PCR cycles (i.e., 17-21) and sent to HudsonAlpha Institute for Biotechnology (Huntsville, AL) for library preparation and sequencing on an Illumina HiSeq 2000 using 2x100 bp paired-end (PE) chemistry. Data for five other nemerteans were retrieved from NCBI (table 1; Riesgo et al. 2012).
Transcriptome Assembly
Transcriptome read quality was assessed with the FASTX toolkit (Gordon 2011). Given overall high read quality, sequences were not filtered prior to assembly. All raw data underwent digital normalization using the python script normalize- by-median.py (Brown et al. 2012) with a k-mer size of 20, a desired coverage (i.e., cutoff) of 30, and four hash tables with a lower bound of 2.5x10^9. Trinity version November 2013 (Grabherr et al. 2011; Haas et al. 2013) was utilized for transcriptome assembly of normalized reads for each species with a k-mer size of 25. Raw reads were assembled as PE data except for the two Lineus species and R. lacteus, for which only single-end reads were available.
Assessment of Assembly Quality
A rarefaction curve (Sanders 1968) of assembly statistics was used to evaluate quality and completeness of assemblies from the newly sequenced species because a nemertean reference genome was not available for comparison. Specifically, we removed 10-70% of PE sequences from the end of raw read files to produce datasets with reduced numbers of reads. These subsampled datasets were assembled as above, and N50 and total number of contigs for each assembly was plotted. A plateau on the rarefaction curve would indicate that adding more sequence data would not considerably change the characteristics of the transcriptome assembly. We further measured completeness of each transcriptome assembly with CEGMA 2.4 (Parra et al. 2007), which determines how many of 248 core eukaryotic genes were present in each transcriptome. Core genes annotated by CEGMA are ones that are highly conserved and chosen from the eukaryotic orthologous groups database (Tatusov et al. 2003). An advantage to an approach like CEGMA is that potential differences in which genes are present among assemblies, as a result of expression differences in various tissue types used for cDNA library preparation, is minimized because housekeeping genes should be expressed in virtually all cells.
Transcriptome Annotation
Annotation of each assembled transcriptome was done with the Trinotate annotation suite (http://trinityrnaseq.source forge.net/annotation/Trinotate.html, last accessed April 13, 2014). In brief, TransDecoder (Haas et al. 2013) was first used to predict open reading frames (ORFs) of at least 300 bp. If multiple, overlapping ORFs were present in the same contig, only the longest ORF was retained. In contrast, if multiple but nonoverlapping 300 bp ORFs were identified, all were retained. Thus, two or more ORFs could originate from the same transcript (i.e., ORFs on both forward and reverse strands and/or multiple ORFs on the same strand for long contigs). Untranslated transcripts and translated ORFs were then queried against the Swiss-Prot database (UniProt Consortium 2014) using Basic Local Alignment Search Tool x (BLASTx) and BLASTp, respectively (Altschul et al. 1997), with annotation coming from the best BLAST hit and associated Gene Ontology (GO) terms (Ashburner et al. 2000). Trinotate then used HMMER 3.1 tool hmmscan (Eddy 2001; Finn et al. 2011) and the Pfam-A database (Punta et al. 2014) to annotate protein domains for each predicted protein sequence. Trinotate results were populated into a SQLite database and placed into a tab delimitated file with scripts provided in the Trinotate package and a custom wrapper (available from http://github.com/halocaridina/bioinformatic-scripts, last accessed May 15, 2014). To roughly characterize the protein composition of each nemertean transcriptome, a custom python script (available from http://github.com/ NathanWhelan, last accessed May 15, 2014) was used to place GO terms for each UniProt annotated transcript from Trinotate into Web Gene Ontology Annotation Plotting (WEGO) format, and annotated GO terms were visualized using the WEGO web service (Ye et al. 2006).
Toxin genes were identified based on sequence similarity to previously characterized animal toxins genes under the assumption that sequence similarity is generally indicative of function (Gabaldon and Huynen 2004). Putative toxin genes were initially distinguished if top BLASTx and/or BLASTp hits in the Trinotate output were a previously characterized eukaryotic toxin gene (as defined by Swiss-Prot or presence of a Pfam domain with 'toxin' in the description). Genes of putative viral or bacterial origin were discarded, which eliminated putative toxin genes that may have been from bacterial endosymbionts or from horizontal gene transfer events. Amino acid sequences of remaining transcripts were then manually searched against the NCBI nonredundant GenBank database (nr) and the Pfam protein domain database using the HMMER 3.1 tool phmmer (Eddy 2001; Finn et al. 2011). Annotated sequences initially identified as a toxin gene by Swiss-Prot were further considered a toxin gene if either 1) cross validation produced a significant hit for a toxin domain family in Pfam or 2) if the best annotated hit from the nr database was labeled a toxin gene. In some instances, a toxin as identified by Trinotate and Swiss-Prot did not have a toxin gene as highest hit against the nr database using phmmer, nor did these transcripts possess a toxin protein domain according to Pfam-A; such transcripts were not further considered as they were potential false positives. Toxin genes passing the above filters were then reciprocally queried against the other transcriptomes to identify putative orthologs via BLAST searches. We also queried previously identified nemertean peptide toxins (i.e., Cytotoxin A-III, Neurotoxin B-II, and Neurotoxin B-IV; Kem 1976; Blumenthal et al. 1981) against all transcriptomes with a tBLASTn search.
Gene Tree Reconstruction
Gene trees were inferred for stonefish toxin (SNTX)-like and Plancitoxin-1-like genes because they were found in all nine nemerteans. Putative SNTX and Plancitoxin genes were translated with TransDecoder (Haas et al. 2013) using default settings. Redundant nemertean protein sequences were then removed from the dataset. The SNTX dataset of von Reumont (2014) was added to the nemertean SNTX-like genes. Putative Plancitoxin-1 genes and nontoxic DNase II genes were retrieved from UniProt and GenBank (supplementary figs. S1 and S2, Supplementary Material online). Alignments were done inMAFFT 3 with the E-INS-i algorithm (Katoh and Standley 2013). The appropriate model of protein evolution was selected for each gene using ProtTest 3.4 (LG+ F +Gamma for both genes; Darriba et al. 2011). RAxML 8 (Stamatakis 2014) was used to infer maximum likelihood gene trees, and 1,000 nonparametric bootstrap replicates were performed to assess nodal support. Trees were rooted with nontoxic homologs.