1 Glueing Together Pipelines

This chapter is named as such because I will be detailing the processing pipelines for our NGS data and the command line, i.e. the glue, that connects different parts together.

As of October 2023, we are starting to move ahead with switching to Nextflow and nf-core’s processing pipelines which will take care of most basic processing steps such as mapping so I will not list them in too much detail.

1.1 Working with High Performance Computing (HPCs)

Northwestern’s Quest User Guide is a great place to start with learning how HPCs work. However, with our own server at Emory, things are much simpler.

Cancel a lot of pending jobs scancel --state PENDING --user vkp2256.

1.2 Environment Reproducibility

Many packages will now use conda, so many of those will be self-contained. I highly recommend using mamba to speed up installations.

For R, while one can use a conda environment, I’ve found it to be hard to set it up, so try to use only one version of R in one project to avoid dependency issues and do sessioninfo() to get package versions that were installed.

1.3 Nextflow

Nextflow is a pipeline construction language. Nf-core is a publicly contributed best-practices of bioinformatics pipelines. If you want to do basic, routine analysis, using nf-core’s ready-made pipelines is best, but because of its large code-base and complexity, it might be hard to debug. You may want to start your own nextflow container if you find yourself doing similar analyses many times.

Use nextflow pull nf-core/rnaseq to update pipelines. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since.

1.4 The Command Line

It is essential to master the basics of the command line. (Buffalo 2015, chap. 8) is a great place to get started. I found learning about path expansion, awk, and vim extremely useful. The Missing Semester is also a great resource.

Make your executable bash script have a help page by creating a Help facility.

While I think going through these well-documented resources is better than anything I could share, I’ll document the things I found really useful here.

Warning

Some of these might not be accurate, test carefully before using them (as you should do with any random piece of code you find).

1.4.1 One-liners

There are many great bioinformatics one-liner compilations such as this one. I suggest you keep a list of your own that you find yourself re-using often.

1.4.2 File manipulation

Read in files line by line and perform an action with it.

file = "/path/to/yourfile.txt"
while IFS= read -r line
do
echo "$line"
echo "${line:8:5}" # grab chars starting at 8th position(0 index) for 5 chars
done <"$file"

If you want to refer to columns in your file

# Loop through each line in the mapping file and rename the files
while IFS=' ' read -r old_name new_name; do
    # Rename the files
    mv "$old_name" "$new_name"
done < "$mapping_file"

Checking if a directory already exists

if [ ! -d $dir ]; then
    mkdir $dir
fi

For loops: there are many ways of specifying what to loop through

#!/bin/bash
for i in m{1..3} m{13..15}
for i in 1 2 3 4 5
for i in file1 file2

do
echo "Welcome $i times"
j="$(basename "${i}")" # get the file name

k="$(basename "${i}" | sed 's/_[^_]*$//')" # prints file name without the extension and the last chunk after the last _
l="$(dirname "${i}")" # prints the dirname without filename
echo raw/JYu-${i}_*_L001* # filename expansion

done

For more complex naming schemes, you may use an array.

# Specify the last two digits of the numbers
numbers=("59" "82" "23" "45" "77")

# Iterate over the array
for num in "${numbers[@]}"; do
  echo "SRR13105$num"
done

# Iterate over indexes
for ((index=0;index<5;index++))
do
num="SRR13105${numbers[$index]}"
done

See this for more examples and use cases.

1.4.3 Unix Text Processing: Awk and Sed

Awk

Awk statements are written as pattern { action }. If we omit the pattern, Awk will run the action on all records. If we omit the action but specify a pattern, Awk will print all records that match the pattern.

awk '{if ($3 == "sth) print $0;}' file.tsv
awk -F "\t" '{print NF; exit}' file.tsv # prints number of fields/columns
awk -F "\t" 'BEGIN{OFS="\t";} {print $1,$2,$3,"+"}' input.bed > output.bed # indicate that the output should be tab-delimited
color="$(awk 'NR==val{print; exit}' val=$line color_list_3_chars.txt)" # pick line based on counter and assigning to a variable

Sed
Sed statements are written as substitute: 's/pattern/replacement/'

sed -e some rule -e another rule The g/ means global replace i.e. find all occurrences of foo and replace with bar using sed. If you removed the /g only first occurrence is changed. chaining rules also possible: sed -e ‘s/,/g ; 5q’ print only the first 5 lines

1.4.4 Miscellanous

Inside a bash script, you can use $0 for the name of the script, $1 for the first argument and so on.
$? show error code of last command, 0 if everything went well, $# number of commands.
shuf -n 4 example_data.csv print random 4 lines.
nohup command & will make commmand run in the background, fg to bring it back to the foreground.
https://www.shellcheck.net/ is a website where it can check for errors and bugs in your shell scripts.
paste -sd, concatenate lines into a single line -s, delimited by a comma

1.4.4.1 Working with lines

count number of occurrences of unique instances

sort | uniq -c

output number of duplicated lines if there are any

uniq -d mm_gene_names.txt | wc -l

count fastq lines (assuming well-formatted fastq files which it usually are)

cat file.fq | echo $((`wc -l`/4))

print only the number of lines in a file and no filenames

wc -l m*/m*.narrowPeak | cut -f 4 -d' '

to get rid of x lines at the beginning of a file

tail -n +2 file.txt # starts from the 2nd line, i.e. removing the first line

to see the first 2 lines and the last 2 lines of a file

(head -n 2; tail -n 2) < Mus_musculus.GRCm38.75_chr1.bed

1.4.4.2 Working with columns

cut -f 2 Mus_musculus.GRCm38.75_chr1.bed # only the second column
cut -f 3-8 Mus_musculus.GRCm38.75_chr1.bed # range of columns  
cut -f 3,6,7,8 Mus_musculus.GRCm38.75_chr1.bed # sets of columns, cannot be reordered  
cut -d, -f2 some_file.csv # use comma as delimiter

grep -v "^#" some_file.gtf | cut -f 1-8 | column -t | head -n 3 #prettify output

remove header lines and count the number of columns

grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}'

1.4.4.3 Working with strings

to remove everything after first dot. useful for getting sample name from filename.fastq.gz

${i%%.*}

to remove everything after last dot. useful for getting sample name from filename.param.fastq

${i%.*}

to remove everything before a /, including it

${i##*/}

to make uppercase

${var^}

1.4.4.4 Directories

to get human readable size of folders under the parent folder

du -h --max-depth=1 parent_folder_name

s means summary

du -sh /full/path/directory

print each file in new line

ls -1

check what file takes the most space (d is one level down)

du -d 1 | sort -n

1.4.4.5 Alignment files

Removing EBV chromosomes for viewing on UCSC browser.

samtools idxstats m${i}_sorted.bam | cut -f 1 | grep -v 'chrEBV' | xargs samtools view -h m${i}_sorted.bam | grep -v 'chrEBV' | samtools view -o m${i}_filtered.bam

Remove reads not mapping to chr1-22, X, Y. this does not remove from headers. The first sed expression removes leading whitespace from echo (-n to ), the second expression to add “chr” at the beginning.

samtools view -o mr392_filtered1.bam mr392_sorted.bam `echo -n {{1..22},X,Y}$'\n' | sed -e 's/^[[:space:]]//' -e 's/^/chr/'`

or you can just chain together reverse grep to remove any chromosomes you want

chr=samtools view -H m1076_rgid_reheader_st.bam | grep chr | cut -f2 | sed 's/SN://g' | grep -v chrM | grep -v chrY | awk '{if(length($0)<6)print}'

the awk statement is to remove the unknown chromosomes and random contigs since they will be longer than 6 chars

1.5 Bioinformatics Bits

These are mostly still command line tools, but more bioinformatics related.

1.5.1 Getting files from GEO

Using SRAtools https://bioinformaticsworkbook.org/dataAcquisition/fileTransfer/sra.html#gsc.tab=0 To get the SRR runs numbers, use the SRA Run Selector. Select the ones you want to download or all of them, and download the Accession List. Upload to quest. Better to do a job array for each SRR. fasterq-dump is now the preferred and faster option. To download bam, can bystep sam step.

module load sratoolkit
fastq-dump --gzip SRR # use --split-files for paired-end
fasterq-dump SRR -O output/dir # for 10x, need to use --split-filles and --include-technical
sam-dump SRR | samtools view -bS -> SRR.bam

Using ftp
This can be faster than SRAtoolkit but only works if those files have been uploaded to EBI.

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/006/SRR1016916

pattern: fastq/first 6 chars/00(last number)/full accession

1.5.2 Making UCSC tracks

mdoule load homer
makeTagDirectory tag1045 1045.bed
makeUCSCfile tagm57 -o m57 -norm 2e7v

With RNA-Seq UCSC tracks, use –fragLength given, otherwise it messes up the auto-correlation thinking that’s it’s ChIP-Seq leading to screwed up tracks.

You can upload the tracks and save it as a session that can be shared.

1.5.3 samtools

Learn the Bam Format.

samtools view -h test.bam # print bam with headers
samtools view -H test.bam # headers only
samtools flagstat test.bam # get mapping info
samtools idxstats test.bam
samtools -f 0x2 # read mapped in proper pair
samtools view -H test.bam | grep @HD # Check if BAM files are sorted
samtools view -F 4 test.bam | wc -l # check how many mapped reads

1.6 R Quirks

When saving anything produced with ggplot and you will be editing in Illustrator, set useDingbats = FALSE.

--- execute: eval: false format: html: link-external-newwindow: true --- # Glueing Together Pipelines This chapter is named as such because I will be detailing the processing pipelines for our NGS data and the command line, i.e. the glue, that connects different parts together. As of October 2023, we are starting to move ahead with switching to Nextflow and nf-core's processing pipelines which will take care of most basic processing steps such as mapping so I will not list them in too much detail. ## Working with High Performance Computing (HPCs) Northwestern's [Quest User Guide](https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=505) is a great place to start with learning how HPCs work. However, with our own server at Emory, things are much simpler. Cancel a lot of pending jobs `scancel --state PENDING --user vkp2256`. ## Environment Reproducibility Many packages will now use [conda](https://conda.io/projects/conda/en/latest/user-guide/index.html), so many of those will be self-contained. I highly recommend using [mamba](https://mamba.readthedocs.io/en/latest/) to speed up installations. For R, while one can use a conda environment, I've found it to be hard to set it up, so try to use only one version of R in one project to avoid dependency issues and do `sessioninfo()` to get package versions that were installed. ## Nextflow Nextflow is a pipeline construction language. Nf-core is a publicly contributed best-practices of bioinformatics pipelines. If you want to do basic, routine analysis, using nf-core's ready-made pipelines is best, but because of its large code-base and complexity, it might be hard to debug. You may want to start your own nextflow container if you find yourself doing similar analyses many times. Use `nextflow pull nf-core/rnaseq` to update pipelines. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. ## The Command Line It is essential to master the basics of the command line. [@Buffalo_2015, Chapter 8] is a great place to get started. I found learning about path expansion, awk, and vim extremely useful. [The Missing Semester](https://www.youtube.com/watch?v=Z56Jmr9Z34Q) is also a great resource. Make your executable bash script have a help page by creating a [Help facility](https://opensource.com/article/19/12/help-bash-program). While I think going through these well-documented resources is better than anything I could share, I'll document the things I found really useful here. ::: callout-warning Some of these might not be accurate, test carefully before using them (as you should do with any random piece of code you find). ::: ### One-liners There are many great bioinformatics one-liner compilations such as [this one](https://github.com/stephenturner/oneliners). I suggest you keep a list of your own that you find yourself re-using often. ### File manipulation Read in files line by line and perform an action with it. ```{bash} file = "/path/to/yourfile.txt" while IFS= read -r line do echo "$line" echo "${line:8:5}" # grab chars starting at 8th position(0 index) for 5 chars done <"$file" ``` If you want to refer to columns in your file ```{bash} # Loop through each line in the mapping file and rename the files while IFS=' ' read -r old_name new_name; do # Rename the files mv "$old_name" "$new_name" done < "$mapping_file" ``` Checking if a directory already exists ```{bash} if [ ! -d $dir ]; then mkdir $dir fi ``` For loops: there are many ways of specifying what to loop through ```{bash} #!/bin/bash for i in m{1..3} m{13..15} for i in 1 2 3 4 5 for i in file1 file2 do echo "Welcome $i times" j="$(basename "${i}")" # get the file name k="$(basename "${i}" | sed 's/_[^_]*$//')" # prints file name without the extension and the last chunk after the last _ l="$(dirname "${i}")" # prints the dirname without filename echo raw/JYu-${i}_*_L001* # filename expansion done ``` For more complex naming schemes, you may use an array. ```{bash} # Specify the last two digits of the numbers numbers=("59" "82" "23" "45" "77") # Iterate over the array for num in "${numbers[@]}"; do echo "SRR13105$num" done # Iterate over indexes for ((index=0;index<5;index++)) do num="SRR13105${numbers[$index]}" done ``` See [this](https://geekflare.com/bash-for-loop-examples/) for more examples and use cases. ### Unix Text Processing: Awk and Sed **Awk** Awk statements are written as `pattern { action }`. If we omit the pattern, Awk will run the action on all records. If we omit the action but specify a pattern, Awk will print all records that match the pattern. ```{bash} awk '{if ($3 == "sth) print $0;}' file.tsv awk -F "\t" '{print NF; exit}' file.tsv # prints number of fields/columns awk -F "\t" 'BEGIN{OFS="\t";} {print $1,$2,$3,"+"}' input.bed > output.bed # indicate that the output should be tab-delimited color="$(awk 'NR==val{print; exit}' val=$line color_list_3_chars.txt)" # pick line based on counter and assigning to a variable ``` **Sed**\ Sed statements are written as `substitute: 's/pattern/replacement/'` sed -e `some rule` -e `another rule` The g/ means global replace i.e. find all occurrences of foo and replace with bar using sed. If you removed the /g only first occurrence is changed. chaining rules also possible: sed -e 's/,/\t/g ; 5q' print only the first 5 lines ### Miscellanous - Inside a bash script, you can use `$0` for the name of the script, `$1` for the first argument and so on.\ - `$?` show error code of last command, 0 if everything went well, `$#` number of commands. - `shuf -n 4 example_data.csv` print random 4 lines.\ - `nohup command &` will make commmand run in the background, `fg` to bring it back to the foreground.\ - https://www.shellcheck.net/ is a website where it can check for errors and bugs in your shell scripts. - `paste -sd,` concatenate lines into a single line -s, delimited by a comma #### Working with lines count number of occurrences of unique instances ```{bash} sort | uniq -c ``` output number of duplicated lines if there are any ```{bash} uniq -d mm_gene_names.txt | wc -l ``` count fastq lines (assuming well-formatted fastq files which it usually are) ```{bash} cat file.fq | echo $((`wc -l`/4)) ``` print only the number of lines in a file and no filenames ```{bash} wc -l m*/m*.narrowPeak | cut -f 4 -d' ' ``` to get rid of x lines at the beginning of a file ```{bash} tail -n +2 file.txt # starts from the 2nd line, i.e. removing the first line ``` to see the first 2 lines and the last 2 lines of a file ```{bash} (head -n 2; tail -n 2) < Mus_musculus.GRCm38.75_chr1.bed ``` #### Working with columns ```{bash} cut -f 2 Mus_musculus.GRCm38.75_chr1.bed # only the second column cut -f 3-8 Mus_musculus.GRCm38.75_chr1.bed # range of columns cut -f 3,6,7,8 Mus_musculus.GRCm38.75_chr1.bed # sets of columns, cannot be reordered cut -d, -f2 some_file.csv # use comma as delimiter grep -v "^#" some_file.gtf | cut -f 1-8 | column -t | head -n 3 #prettify output ``` remove header lines and count the number of columns ```{bash} grep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf | awk -F "\t" '{print NF; exit}' ``` #### Working with strings to remove everything after first dot. useful for getting sample name from filename.fastq.gz ```{bash} ${i%%.*} ``` to remove everything after last dot. useful for getting sample name from filename.param.fastq ```{bash} ${i%.*} ``` to remove everything before a /, including it ```{bash} ${i##*/} ``` to make uppercase ```{bash} ${var^} ``` #### Directories to get human readable size of folders under the parent folder ```{bash} du -h --max-depth=1 parent_folder_name ``` s means summary ```{bash} du -sh /full/path/directory ``` print each file in new line ```{bash} ls -1 ``` check what file takes the most space (d is one level down) ```{bash} du -d 1 | sort -n ``` #### Alignment files Removing EBV chromosomes for viewing on UCSC browser. ```{bash} samtools idxstats m${i}_sorted.bam | cut -f 1 | grep -v 'chrEBV' | xargs samtools view -h m${i}_sorted.bam | grep -v 'chrEBV' | samtools view -o m${i}_filtered.bam ``` Remove reads not mapping to chr1-22, X, Y. this does not remove from headers. The first sed expression removes leading whitespace from echo (-n to ), the second expression to add "chr" at the beginning. ```{bash} samtools view -o mr392_filtered1.bam mr392_sorted.bam `echo -n {{1..22},X,Y}$'\n' | sed -e 's/^[[:space:]]//' -e 's/^/chr/'` ``` or you can just chain together reverse grep to remove any chromosomes you want ```{bash} chr=samtools view -H m1076_rgid_reheader_st.bam | grep chr | cut -f2 | sed 's/SN://g' | grep -v chrM | grep -v chrY | awk '{if(length($0)<6)print}' ``` the awk statement is to remove the unknown chromosomes and random contigs since they will be longer than 6 chars ```{r} #| echo: false # Included only to make the bash code chunks have background color and correct code chunk behaviors ``` ## Bioinformatics Bits These are mostly still command line tools, but more bioinformatics related. ### Getting files from GEO **Using SRAtools** <https://bioinformaticsworkbook.org/dataAcquisition/fileTransfer/sra.html#gsc.tab=0> To get the SRR runs numbers, use the SRA Run Selector. Select the ones you want to download or all of them, and download the Accession List. Upload to quest. Better to do a job array for each SRR. fasterq-dump is now the preferred and faster option. To download bam, can bystep sam step. ```{bash} module load sratoolkit fastq-dump --gzip SRR # use --split-files for paired-end fasterq-dump SRR -O output/dir # for 10x, need to use --split-filles and --include-technical sam-dump SRR | samtools view -bS -> SRR.bam ``` **Using ftp**\ This can be faster than SRAtoolkit but only works if those files have been uploaded to EBI. ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/006/SRR1016916 pattern: fastq/first 6 chars/00(last number)/full accession ### Making UCSC tracks ```{bash} mdoule load homer makeTagDirectory tag1045 1045.bed makeUCSCfile tagm57 -o m57 -norm 2e7v ``` With RNA-Seq UCSC tracks, use --fragLength given, otherwise it messes up the auto-correlation thinking that's it's ChIP-Seq leading to screwed up tracks. You can upload the tracks and save it as a session that can be shared. ### samtools [Learn the Bam Format](https://github.com/davetang/learning_bam_file). ```{bash, eval = FALSE} samtools view -h test.bam # print bam with headers samtools view -H test.bam # headers only samtools flagstat test.bam # get mapping info samtools idxstats test.bam samtools -f 0x2 # read mapped in proper pair samtools view -H test.bam | grep @HD # Check if BAM files are sorted samtools view -F 4 test.bam | wc -l # check how many mapped reads ``` ## R Quirks When saving anything produced with ggplot and you will be editing in Illustrator, set `useDingbats = FALSE`.