SRA-toolkit Tutorial
What is SRA-toolkit?
The SRA-toolkit is a set of utilities to download and process sequencing data from the NCBI Sequence Read Archive (SRA) at scale. The SRA is a primary repository for high-throughput sequencing data hosted by NIH and is part of the International Nucleotide Sequence Database Collaboration (INSDC).
Downloading the SRA-toolkit
Download the latest version of the SRA-toolkit (version 3.0.0) using the following command:
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz
Extracting the Toolkit
Extract the downloaded tar.gz file:
tar -zxvf sratoolkit.3.0.0-ubuntu64.tar.gz
Configuring the SRA-toolkit
Add the SRA-toolkit binaries to your PATH:
export PATH=$PATH:/home/Abbas/tools/sratoolkit.3.0.0-ubuntu64/bin
Configure the SRA-toolkit to access public cloud data:
vdb-config -i
Follow the on-screen instructions to configure the toolkit. For more details, visit the SRA-toolkit configuration guide.
Downloading Sequencing Data
To download a single SRA file:
prefetch SRR19850882
To download multiple SRA files:
prefetch SRR19850882 SRR19850883 SRR19850884
You can also provide a text file containing SRR numbers to download multiple files.
Converting SRA Files to FASTQ Format
Use fastq-dump
or fasterq-dump
to convert SRA files to FASTQ format:
fastq-dump SRR19850882 SRR19850883 SRR19850884
For paired-end sequencing data, use the --split-files
option:
fastq-dump --split-files SRR19850882 SRR19850883 SRR19850884
Alternatively, use fasterq-dump
for faster processing:
fasterq-dump SRR19850882 SRR19850883
Additional Resources
For more advanced usage, check out the parallel-fastq-dump tool.