SRA-toolkit Tutorial

What is SRA-toolkit?

The SRA-toolkit is a set of utilities to download and process sequencing data from the NCBI Sequence Read Archive (SRA) at scale. The SRA is a primary repository for high-throughput sequencing data hosted by NIH and is part of the International Nucleotide Sequence Database Collaboration (INSDC).

Downloading the SRA-toolkit

Download the latest version of the SRA-toolkit (version 3.0.0) using the following command:

wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz

Extracting the Toolkit

Extract the downloaded tar.gz file:

tar -zxvf sratoolkit.3.0.0-ubuntu64.tar.gz

Configuring the SRA-toolkit

Add the SRA-toolkit binaries to your PATH:

export PATH=$PATH:/home/Abbas/tools/sratoolkit.3.0.0-ubuntu64/bin

Configure the SRA-toolkit to access public cloud data:

vdb-config -i

Follow the on-screen instructions to configure the toolkit. For more details, visit the SRA-toolkit configuration guide.

Downloading Sequencing Data

To download a single SRA file:

prefetch SRR19850882

To download multiple SRA files:

prefetch SRR19850882 SRR19850883 SRR19850884

You can also provide a text file containing SRR numbers to download multiple files.

Converting SRA Files to FASTQ Format

Use fastq-dump or fasterq-dump to convert SRA files to FASTQ format:

fastq-dump SRR19850882 SRR19850883 SRR19850884

For paired-end sequencing data, use the --split-files option:

fastq-dump --split-files SRR19850882 SRR19850883 SRR19850884

Alternatively, use fasterq-dump for faster processing:

fasterq-dump SRR19850882 SRR19850883

Additional Resources

For more advanced usage, check out the parallel-fastq-dump tool.