Contributors | Affiliation | Role |
---|---|---|
Orcutt, Beth N. | Bigelow Laboratory for Ocean Sciences | Principal Investigator |
D'Angelo, Timothy | Bigelow Laboratory for Ocean Sciences | Scientist |
Merchant, Lynne M. | Woods Hole Oceanographic Institution (WHOI BCO-DMO) | BCO-DMO Data Manager |
Genomic dataset collection, curation and quality control: This study used publicly available genome assemblies. Existing publicly available genome assemblies were downloaded from the National Center for Biotechnology Investigation (NCBI) and the Integrated Microbial Genomes (IMG) database of the U.S. Department of Energy’s Joint Genome Institute in June 2021. The Genome Taxonomy Database (GTDB) website (release 202) was used to access lists of NCBI assembly accession numbers for the following GTDB-assigned phyla: Nitrospinota, Nitrospinota_A (now called Tectomicrobia), Nitrospinota_B, Nitrospirota, Nitrospirota_A (Leptospirilla). The IMG assemblies were found using the same GTDB taxonomy classifier using the search function on the IMG website. IMG metagenome assemblies that were designated as “public” and “published” were also downloaded for these phyla. Duplicate entries between IMG and NCBI were manually removed.
Quality control of the assemblies was performed using the CheckM qa workflow (v 1.07) to remove genomes with <50% genome completion and >10% sequence contamination, leaving genomes that fall within the MIMAG categories “medium” (>50% completion, <10% contamination) and “high” (>90% completion, <5% contamination). These resulting genomes were dereplicated with dRep, using default parameters, to remove nearly-identical assemblies. All genomes were then classified using the GTDB-tk classifier tool (v1.5.0, r202). Polyphyletic groups that were once considered a part of Nitrospirota and Nitrospinota (i.e., Nitrospirota_A (Leptospirilla), Nitrospinota_A (Tectomicrobia) and Nitrospinota_B) were included only in the phylogenomic trees. These groups were not included in the gene cluster based functional analyses. All code to recreate these processes are available at https://github.com/ts-dangelo/bioinformatic_scripts_python.
Summary of processing on the submitted dataset by the dataset's BCO-DMO data manager.
1. Modified submitted accessions file
Fixed two shifted cell values in the submitted file "Supplemental Data 1.csv" and saved it as "Supplemental Data 1 fixed.csv" by using Excel.
The original file has a row number 333 with some values shifted into the wrong columns. Part of an "IsolationPlot" value was shifted to the right into the "Sample" column. And the “Sample” value was shifted to the right into the “Location” column. This was fixed by cutting the value in the “Sample” column and copying it into the "IsolationPlot" column for a final value of “saline water (ENVO:00002010) including plankton (ENVO:xxxxxxxx) “ and then cutting the value SAMN06450621 from the “Location” column and copying it into the "Sample" column.
2. Updated this fixed accessions file with IMG and NCBI metadata
Python code in Jupyter notebooks was created by the data manager and run to add metadata to the submitted accessions file. The scripts were run using the fixed file "Supplemental Data 1 fixed.csv" as input.
Data manager GitHub code repo (last commit = 7e8e2462a on 12/30/2024 updating docs)
https://github.com/BCODMO/DM_scripts_by_dataset/tree/master/Orcutt_acces...
Code run on 2024-12-29 using the following two Jupyter notebooks
https://github.com/BCODMO/DM_scripts_by_dataset/blob/master/Orcutt_acces...
https://github.com/BCODMO/DM_scripts_by_dataset/blob/master/Orcutt_acces...
The code and and processing procedure are documented in markdown files in the GitHub repository
https://github.com/BCODMO/DM_scripts_by_dataset/blob/master/Orcutt_acces...
https://github.com/BCODMO/DM_scripts_by_dataset/blob/master/Orcutt_acces...
Eight new columns were added using python Jupyter notebooks run on 12/29/2024 with the output saved to the file named “updated_pi_supplemental_table.csv”.
columns added to the fixed file "Supplemental Data 1 fixed.csv": IMG_genome_id, BioProject, Corrected_BioSample, GenBank_assembly, release_date, last_updated_date, publication_date, ncbi_taxon_id
column definitions
IMG_genome_id: IMG genome assembly id (extracted from values in “ID” column)
BioProject: NCBI BioProject accession
Corrected_BioSample: correct NCBI BioSample accession
GenBank_assembly: NCBI GenBank genome assembly accession
release_date: Date genome assembly was released to the public
last_updated_date: Last date BioSample was updated on NCBI
publication_date: Date genome assembly submitted to either IMG or NCBI
ncbi_taxon_id: NCBI taxon id of the sampled organism
Steps to retrieve extra metadata
Used a search on the IMG website and python Jupyter notebooks to add NCBI accessions, dates, and NCBI taxon id metadata columns to provide more context to the submitted file.
There are two sources for the new metadata. One source is a search query on the IMG website (Integrated Microbial Genomes and Microbiomes https://img.jgi.doe.gov/ https://img.jgi.doe.gov/) and the other source is NCBI API calls retrieving JSON reports.
The goal was to add corresponding BioProject accessions for each row in the ‘ID’ column of a submitted accessions file.
In the process of gathering supporting metadata by the dataset's BCO-DMO data manager, 15 BioSample values retrieved with the NCBI API using NCBI genome assembly accessions were found to be different than those in the submitted file’s “Sample” column, which contains BioSample values. The 15 new BioSample values in the column "Corrected_BioSample" in the final BCO-DMO dataset 933610_v1_nitrospirota_and_nitrospinota_genomes.csv were manually checked on their corresponding NCBI web pages. It was found that the BioSample metadata and the corresponding NCBI genome assembly value confirmed the BioSample values retrieved via NCBI API calls were correct. The original 15 BioSample values were manually checked on their corresponding NCBI web pages, and the BioSample metadata and NCBI genome assembly accession for that BioSample did not match those listed in the submitted file.
The incorrect BioSample values in the "Sample" column are the following:
SAMN03222684, SAMN03222684, SAMN02862043, SAMN03222686, SAMN03222686, SAMEA3140934, SAMEA3140928, SAMN01922995, SAMN03203005, SAMN03203000, SAMN10411419, SAMN12518157, SAMN04315473, SAMN06659639, SAMN07631077
The correct BioSample values are found in the column "Corrected_BioSample" in the primary dataset file 933610_v1_nitrospirota_and_nitrospinota_genomes.csv.
The new columns were found with results from a query on the IMG website (Integrated Microbial Genomes and Microbiomes at [https://img.jgi.doe.gov/](https://img.jgi.doe.gov/)) and python Jupyter notebooks using NCBI API calls and a series of joins with the fixed submitted accessions dataset.
To query the IMG website, a list of IMG genome ids was created using IMG id values in the “ID” column of the file "Supplemental Data 1 fixed.csv". The IMG genome ids in the “ID” column are prefixed with “IMG_” in the “ID” column and some are suffixed with an underscore followed by a number.
IMG query URL: https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=FindGenomes&page=geno...
Downloaded the query results table from IMG (Integrated Microbial Genomes and Microbiomes https://img.jgi.doe.gov/) where the query used the list of IMG genome ids from the “ID” column. The results table contained NCBI accessions and other metadata associated with IMG genome ids.
Created a python Jupyter notebook to perform NCBI API calls using BioSample accessions, NCBI assembly accessions, and taxon names to retrieve data reports in JSON format. The reports JSON contained values for BioSample, BioProject, NCBI assembly accession, release date, last updated date, publication date, and NCBI taxon id.
The NCBI API calls used the NCBI version 2 REST API which is explained here https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/rest-api/. The following three API endpoints were used:
/genome/accession/{accession}/dataset_report
/biosample/accession/{accession}/biosample_report
/genome/taxon/{taxon_name}/dataset_report
The API endpoint /genome/accession/{accession}/dataset_report was called with NCBI genome assembly accessions from the “ID” column of the accessions dataset table.
The API endpoint /biosample/accession/{accession}/biosample_report was called with NCBI BioSample accessions from the “Sample” column of the accessions dataset table.
The API endpoint /genome/taxon/{taxon_name}/dataset_report was called using the phylum taxon names of Nitrospinota and Nitrospirota.
All the metadata results from the NCBI API endpoints were combined into one file with duplicates removed.
Created a python Jupyter notebook to run after attaining NCBI accessions and IMG genome ids and this retrieves and adds 8 extra metadata columns to the fixed submitted accessions dataset.
This notebook performed a series of joins on IMG genome ids, NCBI BioSamples and NCBI genome assembly accessions.
The first step was joining the IMG genome ids and its associated NCBI metadata on to the starting accessions dataset to create a new table.
The next steps were to perform a series of joins onto this new table.
Joined this new table with a table of NCBI accession values, which came from 3 NCBI API calls as discussed above, on the NCBI genome assembly values to create a new table.
A column containing NCBI BioSample values was created from this join. This column contains 15 BioSample values that are different than those in the original submitted accessions table.
A new column named ‘Fixed_BioSample’ was created by merging NCBI BioSample values found with the IMG query and NCBI BioSample values found with NCBI API calls.
Using the BioSample values in the ‘Fixed_BioSample’ column created by the Jupyter notebook, the NCBI API was called again and the new metadata values found for these corrected BioSample values was used in the final processed dataset. This retrieved metadata was used over any previously attained metadata. Any metadata values not retrieved using the ‘Fixed_BioSample’ column values are filled in with the original metadata values.
Renamed columns in the table with new metadata, 'Fixed_BioSample' was renamed 'Corrected_BioSample' and this table was saved to a file named “updated_pi_supplemental_table.csv”.
3. Processed the file “updated_pi_supplemental_table.csv” with the BCO-DMO dataset processing tool named laminar
Renamed columns to conform with the BCO-DMO parameter naming convention of replacing spaces with underscores, and starting parameter names with an alphabetical character. Renamed the parameter ““# predicted genes” to “num_predicted_genes”.
Renamed the parameter ncbi_taxon_id to NCBI_organism_taxid.
Converted UTC date parameters with the format %Y-%m-%dT%H:%M:%S.%f to the ISO UTC datetime format of %Y-%m-%dT%H:%M:%SZ.
The organism taxon id is an integer, so removed the trailing decimal 0 to convert it into an integer.
Reordered the parameters so that genome metadata is towards the end of the dataset table and accessions metadata is towards the start.
Saved this updated table as the final dataset named 933610_v1_nitrospirota_and_nitrospinota_genomes.csv
Parameter | Description | Units |
ID | Name of sequence data | unitless |
IMG_genome_id | IMG genome assembly ID (Integrated Microbial Genomes and Metagenomes) | unitless |
GenBank_assembly | NCBI genome assembly accession (National Center for Biotechnology Informatio) | unitless |
Sample | Submitter's original NCBI BioSample accession (National Center for Biotechnology Information) | unitless |
Corrected_BioSample | NCBI BioSample accession retrieved with NCBI API call using NCBI Genome assembly accession (National Center for Biotechnology Information) | unitless |
BioProject | NCBI BioProject accession (National Center for Biotechnology Information) | unitless |
release_date | Release date for genome to be publicly accessible | unitless |
last_updated_date | Last date BioSample metadata was updated | unitless |
publication_date | Date that genome assembly was submitted to either IMG (Integrated Microbial Genomes and Metagenomes) or NCBI (National Center for Biotechnology Information) | unitless |
Domain | Taxonomic phylogenetic domain of the sequence data | unitless |
Phylum | Taxonomic phylogenetic phylum of the sequence data | unitless |
Class | Taxonomic phylogenetic class of the sequence data | unitless |
Order | Taxonomic phylogenetic order of the sequence data | unitless |
Family | Taxonomic phylogenetic family of the sequence data | unitless |
Genus | Taxonomic phylogenetic genus of the sequence data. Blank if could not be identified. | unitless |
Species | Taxonomic phylogenetic species of the sequence data. Blank if could not be identified. | unitless |
NCBI_organism_taxid | NCBI taxon id of organism | unitless |
Isolation_Source | descriptive category of environment where data originated from | unitless |
IsolationPlot | Grouping of Isolation Source into predefined categories for plotting | unitless |
Location | Descriptive location where sequence data originated | unitless |
Coordinates | Downloaded latitude and longitude data from NCBI | degrees |
Latitude | latitude of sample, south is negative | decimal degrees |
Longitude | longitude of sample, west is negative | decimal degrees |
Completeness | Estimated genome completeness | unitless |
Contamination | Estimated contamination of genome data | unitless |
Genome_Size_bp | Length of genome in basepairs | basepairs |
Estimated_Genome_Length | basepairs length of contig | unitless |
num_predicted_genes | The number of predicted genes in the contig | unitless |
GC | percentage of GC basepairs out of all basepairs | unitless |
Coding_density | percent of contig with genomic information in a known gene | unitless |
NSF Award Abstract:
The marine deep biosphere is the habitat for life existing under the sea floor. The zone has remarkably low energy sources creating a paradox of how life can persist there. Resolving this energy paradox is a grand challenge in deep biosphere research. The Juan de Fuca Ridge flank off the coast of Washington, USA, is an accessible, low energy environment making it an attractive location for addressing this challenge. A series of experiments will be conducted on the seafloor at the Juan de Fuca Ridge flank, using established subseafloor observatories that access the crustal deep biosphere, to provide the first direct in situ measurement of microbial activity in the crustal subsurface. This project will provide essential information about the ability of life to survive under conditions that we are not able to replicate in the laboratory, but that are increasingly important for understanding microbial community interaction in the environment. This information can then be used in models of global microbial activity for estimating the impact of this biosphere on elemental cycling, transforming our understanding of microbial processes within this vast subseafloor habitat. To communicate these discoveries to the public, the project will include a ship-to-shore outreach program during the cruise. In addition public lectures will be presented, and an interactive display of deep-sea video footage will be set up for the annual public Open House at the Bigelow Laboratory for Ocean Sciences in Maine. Diverse undergraduate students and a postdoctoral researcher will be recruited to participate in the research and public outreach activities.
This project proposes to leverage existing subsurface infrastructure on the eastern flank of the Juan de Fuca Ridge with advances in single-cell based molecular and geochemical approaches to make fundamental new discoveries about the activity of life in the deep crustal biosphere. During a two-week research cruise, the research team will incubate crustal fluids in situ and in the laboratory with labeled substrates for tracking single-cell activity, coupled with radioisotope tracer activity and potentiostat measurements, with the objective of determining in situ and potential rates of activity and cellular physiology. The research will also identify which metabolisms active microorganisms utilize under in situ and laboratory conditions, the rates of these processes, and the microorganisms involved. The results are expected to provide explicit hypothesis testing of microbial activity and in situ microbial growth rates from the crustal deep biosphere to transform understanding of microbial activity in the crustal deep biosphere and generate critical information about the ability of life to survive under low energy conditions.
Funding Source | Award |
---|---|
NSF Division of Ocean Sciences (NSF OCE) |