Using BWA to Align Sequenced Reads to Reference Genome
BWA maps DNA sequences to a large reference genome, such as human or plant genomes. There are different algorithms in BWA:
You can download BWA and Picard tools from the following links:
Create three directories to organize the necessary files:
mkdir Ref_genome # for reference genome
mkdir FastqFiles # for raw fastq files
mkdir BamFiles # for BWA output files
To prepare the reference genome for alignment, you need to index it. Run the following command to index the reference genome:
bwa index -p Ref_genome/your_ref.genome
After running this command, several index files will be created in the Ref_genome
directory.
If you have multiple paired-end reads like:
SRR1.1.fastq.gz SRR1.2.fastq.gz SRR2.1.fastq.gz SRR2.2.fastq.gz ...
Use the following loop to align them using BWA-MEM:
for INDEX in 1 2 3 4;
do
bwa mem -M -t 8 -R "@RG\tID:COL_${INDEX}\tSM:COL_${INDEX}" Ref_genome/genome.Garb.CRI.fa \
FastqFiles/SRR${INDEX}.1.fastq.gz \
FastqFiles/SRR${INDEX}.2.fastq.gz \
> BamFiles/SRR${INDEX}.sam
done
@RG
field refers to read groups, which are collections of reads from a single sequencing run. This information helps in distinguishing between samples and specific sequenced samples across different experiments. It is required by tools like GATK to account for variability across sequencing runs.
After BWA alignment, you will get SAM files. To convert these into BAM files, which are more efficient for downstream processing, run the following command:
for INDEX in {1..4};
do
picard SortSam \
I=BamFiles/SRR${INDEX}.sam \
O=BamFiles/SRR${INDEX}.sorted.bam \
SORT_ORDER=coordinate \
CREATE_INDEX=true
done
Next, create an index for the BAM files so that downstream programs can quickly access their contents:
for INDEX in {1..4}
do
picard BuildBamIndex \
I=BamFiles/SRR${INDEX}.sorted.bam
done
For more information on BWA, visit the official manual:
Primary reference:
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi: 10.1093/bioinformatics/btp324. Epub 2009 May 18. PMID: 19451168; PMCID: PMC2705234.