Using Bioinformatics Tools (Blast, BWA, etc.)

This KB Article References: High Performance Computing

This Information is Intended for: Instructors, Researchers, Staff

Created: 06/08/2017 Last Updated: 10/25/2024

A variety of widely used tools for bioinformatics analysis are available on Seawulf.

Some tools have their own module that can be loaded (e.g., blast+/2.10.0). However, because a lot of bioinformatics tools are intended to be used together, we have created several Anaconda environments that provide multiple tools for different types of analyses. We have created modules to activate each of these environments for convenience.

The following table shows a list and description for each module:

Module name	Module description	Example software
bioconvert/1.0.0	Format conversion tool for life science data	bioconvert, biopython, sambamba
diffexp/1.0	RNA-Seq & differential gene expression	Salmon, DESeq2, Trinity
genome_assembly/1.0	Genome assembly & evaluation	Flye, SPAdes, BUSCO
genome_annotation/1.0	Functional annotation of genome assemblies	RepeatMasker, Augustus, Maker
GWAS/1.0	Genome-wide association software	Plink, GEMMA
hts/1.0	Standard high throughput sequencing software	Samtools, bwa, bowtie2
metagenomics/1.0	Metagenomic classification, assembly, and analysis	Kraken2, Megahit
popgen/1.0	Variant calling & population genomics	GATK, Plink, admixture
phylo/1.0	Multiple sequence alignment & phylogenetic inference	MAFFT, IQ-TREE2, raxml-ng
singlecell/1.0	Single Cell sequencing analysis	Seurat, scanpy, scvi-tools
structural-variant/1.0	Structural variant calling and genotyping	TIDDIT, lumpy, manta

Once one of the above modules is loaded, the executables for the installed programs will be available in your path.

For example, the following is how to access the samtools program in the hts/1.0 module

module load hts/1.0
samtools --help

To see a full list of software installed under each conda module, do the following after loading one of the modules:

conda list

For further instructions on how to run individual programs, please consult the relevant online documentation.

When running a program, you will need to execute it as part of SLURM bash script. An example SLURM script for running Trinity RNA-seq assembly can be found at:

/gpfs/projects/samples/Trinity

If there are missing or outdated packages in any of the above modules, or if you have suggestions for a new conda biofinformatics module, please submit a ticket.

Bioinformatics Workflow Managers

Users may find it convenient (or even necessary) to use workflow management system to handle analyses involving anything but a small number of inputs (e.g., multiple samples to be processed in parallel). SeaWulf currently has two different bioinformatics workflow systems available: Snakemake and Nextflow.

Snakemake

Snakemake is a python-based workflow manager with flexible scripting options. Typically, users will create an input file called a "Snakefile" that defines the input files that the biofinformatics pipeline will run on, along with one or more "rules" that define the steps that will be taken by the workflow. Snakemake also comes with a larger number of wrappers that simplify management of standard bioinformatics software and make it easy call these programs in your Snakemake rules.

Once the pipeline has been defined in the Snakefile, it can be run with the following:

snakemake -s Snakefile <... additional flags ...>

Please see the Snakemake documentation for more information on setting up and executing the pipeline.

A recent version of Snakemake is installed in each of the above bioinformatics conda environments, so no additional modules need to be loaded to use Snakemake.

Nextflow

Nextflow is a workflow manager written in a language called Groovy. It requires a relatively recent version of Java, so please load the following modules to access Nextflow:

module load openjdk/latest
module load nextflow/latest

These modules will be updated periodically to ensure that the Java and Nextflow versions available are current.

While you are welcome to build your own Nextflow pipelines, you may wish to instead use one of the many community-curated pipelines available via the nf-core project. nf-core provides reproducible, best-practice pipelines with detailed reporting for a variety of typical bioinformatics use cases. To see a list of available nf-core pipelines, please visit the nf-core website.

Nextflow and nf-core handle software dependencies using Conda, Docker (not available on SeaWulf), or Singularity. We recommend use of Singularity, as it is available without loading any modules and greatly simplifies dependency installation. When using Singularity for software management, nf-core will download several container images, which can lead to large amounts of storage usage. To avoid running out of space in your home directory, we recommend setting the following environment variables before running nf-core pipelines:

export SINGULARITY_CACHEDIR=/gpfs/scratch/$USER/singularity
export NXF_SINGULARITY_CACHEDIR=/gpfs/scratch/$USER/singularity

This will force Singularity to save container images in your scratch directory, where you should have plenty of space.

The SeaWulf HPC team has created a custom nf-core configuration file for SeaWulf that allows nexftflow to automatically submit jobs to the 40-core and 96-core partitons and uses Singularity for software management. Because of this, nf-core pipeline jobs can be launched on the login node, and Nextflow will handle submitting jobs for each step in the analysis. However, the Nextflow process needs to keep running for the duration of the pipeline for this to work successfully.

The following are a set of recommended steps for running an nf-core pipeline with Nextflow:

1. ssh to a milan login node:

ssh <netid>@milan.seawulf.stonybrook.edu

2. Start a tmux session to keep the Nextflow process running even after logging out of SeaWulf (see tmux documentation here):

tmux

3. Load the openjdk and nextflow modules:

module load openjdk/latest
module load nextflow/latest

4. Set the Singularity cache directory environment variables:

export SINGULARITY_CACHEDIR=/gpfs/scratch/$USER/singularity
export NXF_SINGULARITY_CACHEDIR=/gpfs/scratch/$USER/singularity

5. Run the nf-core/rnaseq workflow using the seawulf and test profiles (the test profile just runs a short job with a small amount of test data):

nextflow run nf-core/rnaseq -profile seawulf,test --outdir test_out

Nextflow will submit a series of jobs to the 40-core or 96-core partitions, and results will be saved in the "test_out" directory that was specified in the command above.

For more information, please see the Nextflow and nf-core documentation.

For More Information Contact

IACS Support System

Still Need Help? The best way to report your issue or make a request is by submitting a ticket.

Request Access or Report an Issue

Related Information

Important Files & Links

Example Slurm Job Scripts

Submitting a Slurm Job