file = "/path/to/yourfile.txt"
while IFS= read -r line
do
echo "$line"
echo "${line:8:5}" # grab chars starting at 8th position(0 index) for 5 chars
done <"$file"
1 Glueing Together Pipelines
This chapter is named as such because I will be detailing the processing pipelines for our NGS data and the command line, i.e. the glue, that connects different parts together.
As of October 2023, we are starting to move ahead with switching to Nextflow and nf-core’s processing pipelines which will take care of most basic processing steps such as mapping so I will not list them in too much detail.
1.1 Working with High Performance Computing (HPCs)
Northwestern’s Quest User Guide is a great place to start with learning how HPCs work. However, with our own server at Emory, things are much simpler.
Cancel a lot of pending jobs scancel --state PENDING --user vkp2256
.
1.2 Environment Reproducibility
Many packages will now use conda, so many of those will be self-contained. I highly recommend using mamba to speed up installations.
For R, while one can use a conda environment, I’ve found it to be hard to set it up, so try to use only one version of R in one project to avoid dependency issues and do sessioninfo()
to get package versions that were installed.
1.3 Nextflow
Nextflow is a pipeline construction language. Nf-core is a publicly contributed best-practices of bioinformatics pipelines. If you want to do basic, routine analysis, using nf-core’s ready-made pipelines is best, but because of its large code-base and complexity, it might be hard to debug. You may want to start your own nextflow container if you find yourself doing similar analyses many times.
Use nextflow pull nf-core/rnaseq
to update pipelines. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since.
1.4 The Command Line
It is essential to master the basics of the command line. (Buffalo 2015, chap. 8) is a great place to get started. I found learning about path expansion, awk, and vim extremely useful. The Missing Semester is also a great resource.
Make your executable bash script have a help page by creating a Help facility.
While I think going through these well-documented resources is better than anything I could share, I’ll document the things I found really useful here.
Some of these might not be accurate, test carefully before using them (as you should do with any random piece of code you find).
1.4.1 One-liners
There are many great bioinformatics one-liner compilations such as this one. I suggest you keep a list of your own that you find yourself re-using often.
1.4.2 File manipulation
Read in files line by line and perform an action with it.
If you want to refer to columns in your file
# Loop through each line in the mapping file and rename the files
while IFS=' ' read -r old_name new_name; do
# Rename the files
mv "$old_name" "$new_name"
done < "$mapping_file"
Checking if a directory already exists
if [ ! -d $dir ]; then
mkdir $dir
fi
For loops: there are many ways of specifying what to loop through
#!/bin/bash
for i in m{1..3} m{13..15}
for i in 1 2 3 4 5
for i in file1 file2
do
echo "Welcome $i times"
j="$(basename "${i}")" # get the file name
k="$(basename "${i}" | sed 's/_[^_]*$//')" # prints file name without the extension and the last chunk after the last _
l="$(dirname "${i}")" # prints the dirname without filename
echo raw/JYu-${i}_*_L001* # filename expansion
done
For more complex naming schemes, you may use an array.
# Specify the last two digits of the numbers
numbers=("59" "82" "23" "45" "77")
# Iterate over the array
for num in "${numbers[@]}"; do
echo "SRR13105$num"
done
# Iterate over indexes
for ((index=0;index<5;index++))
do
num="SRR13105${numbers[$index]}"
done
See this for more examples and use cases.
1.4.3 Unix Text Processing: Awk and Sed
Awk
Awk statements are written as pattern { action }
. If we omit the pattern, Awk will run the action on all records. If we omit the action but specify a pattern, Awk will print all records that match the pattern.
awk '{if ($3 == "sth) print $0;}' file.tsv
awk -F "\t" '{print NF; exit}' file.tsv # prints number of fields/columns
awk -F "\t" 'BEGIN{OFS="\t";} {print $1,$2,$3,"+"}' input.bed > output.bed # indicate that the output should be tab-delimited
color="$(awk 'NR==val{print; exit}' val=$line color_list_3_chars.txt)" # pick line based on counter and assigning to a variable
Sed
Sed statements are written as substitute: 's/pattern/replacement/'
sed -e some rule
-e another rule
The g/ means global replace i.e. find all occurrences of foo and replace with bar using sed. If you removed the /g only first occurrence is changed. chaining rules also possible: sed -e ‘s/,/g ; 5q’ print only the first 5 lines
1.4.4 Miscellanous
- Inside a bash script, you can use
$0
for the name of the script,$1
for the first argument and so on.
$?
show error code of last command, 0 if everything went well,$#
number of commands.shuf -n 4 example_data.csv
print random 4 lines.
nohup command &
will make commmand run in the background,fg
to bring it back to the foreground.
- https://www.shellcheck.net/ is a website where it can check for errors and bugs in your shell scripts.
paste -sd,
concatenate lines into a single line -s, delimited by a comma
1.4.4.1 Working with lines
count number of occurrences of unique instances
sort | uniq -c
output number of duplicated lines if there are any
uniq -d mm_gene_names.txt | wc -l
count fastq lines (assuming well-formatted fastq files which it usually are)
cat file.fq | echo $((`wc -l`/4))
print only the number of lines in a file and no filenames
wc -l m*/m*.narrowPeak | cut -f 4 -d' '
to get rid of x lines at the beginning of a file
tail -n +2 file.txt # starts from the 2nd line, i.e. removing the first line
to see the first 2 lines and the last 2 lines of a file
(head -n 2; tail -n 2) < Mus_musculus.GRCm38.75_chr1.bed
1.4.4.2 Working with columns
cut -f 2 Mus_musculus.GRCm38.75_chr1.bed # only the second column
cut -f 3-8 Mus_musculus.GRCm38.75_chr1.bed # range of columns
cut -f 3,6,7,8 Mus_musculus.GRCm38.75_chr1.bed # sets of columns, cannot be reordered
cut -d, -f2 some_file.csv # use comma as delimiter
grep -v "^#" some_file.gtf | cut -f 1-8 | column -t | head -n 3 #prettify output
remove header lines and count the number of columns
grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'
1.4.4.3 Working with strings
to remove everything after first dot. useful for getting sample name from filename.fastq.gz
${i%%.*}
to remove everything after last dot. useful for getting sample name from filename.param.fastq
${i%.*}
to remove everything before a /, including it
${i##*/}
to make uppercase
${var^}
1.4.4.4 Directories
to get human readable size of folders under the parent folder
du -h --max-depth=1 parent_folder_name
s means summary
du -sh /full/path/directory
print each file in new line
ls -1
check what file takes the most space (d is one level down)
du -d 1 | sort -n
1.4.4.5 Alignment files
Removing EBV chromosomes for viewing on UCSC browser.
samtools idxstats m${i}_sorted.bam | cut -f 1 | grep -v 'chrEBV' | xargs samtools view -h m${i}_sorted.bam | grep -v 'chrEBV' | samtools view -o m${i}_filtered.bam
Remove reads not mapping to chr1-22, X, Y. this does not remove from headers. The first sed expression removes leading whitespace from echo (-n to ), the second expression to add “chr” at the beginning.
samtools view -o mr392_filtered1.bam mr392_sorted.bam `echo -n {{1..22},X,Y}$'\n' | sed -e 's/^[[:space:]]//' -e 's/^/chr/'`
or you can just chain together reverse grep to remove any chromosomes you want
chr=samtools view -H m1076_rgid_reheader_st.bam | grep chr | cut -f2 | sed 's/SN://g' | grep -v chrM | grep -v chrY | awk '{if(length($0)<6)print}'
the awk statement is to remove the unknown chromosomes and random contigs since they will be longer than 6 chars
1.5 Bioinformatics Bits
These are mostly still command line tools, but more bioinformatics related.
1.5.1 Getting files from GEO
Using SRAtools https://bioinformaticsworkbook.org/dataAcquisition/fileTransfer/sra.html#gsc.tab=0 To get the SRR runs numbers, use the SRA Run Selector. Select the ones you want to download or all of them, and download the Accession List. Upload to quest. Better to do a job array for each SRR. fasterq-dump is now the preferred and faster option. To download bam, can bystep sam step.
module load sratoolkit
fastq-dump --gzip SRR # use --split-files for paired-end
fasterq-dump SRR -O output/dir # for 10x, need to use --split-filles and --include-technical
sam-dump SRR | samtools view -bS -> SRR.bam
Using ftp
This can be faster than SRAtoolkit but only works if those files have been uploaded to EBI.
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/006/SRR1016916
pattern: fastq/first 6 chars/00(last number)/full accession
1.5.2 Making UCSC tracks
mdoule load homer
makeTagDirectory tag1045 1045.bed
makeUCSCfile tagm57 -o m57 -norm 2e7v
With RNA-Seq UCSC tracks, use –fragLength given, otherwise it messes up the auto-correlation thinking that’s it’s ChIP-Seq leading to screwed up tracks.
You can upload the tracks and save it as a session that can be shared.
1.5.3 samtools
samtools view -h test.bam # print bam with headers
samtools view -H test.bam # headers only
samtools flagstat test.bam # get mapping info
samtools idxstats test.bam
samtools -f 0x2 # read mapped in proper pair
samtools view -H test.bam | grep @HD # Check if BAM files are sorted
samtools view -F 4 test.bam | wc -l # check how many mapped reads
1.6 R Quirks
When saving anything produced with ggplot and you will be editing in Illustrator, set useDingbats = FALSE
.