library(tidyverse)
<- list.files(path = "./working_dir", pattern = "*genes.results$", full.names = TRUE)
files <- lapply(files, read_delim)
rsem_results <- lapply(rsem_results, function(x) { x$expected_count })
expected_counts_list <- do.call(cbind, expected_counts_list) %>% as.data.frame() expected_counts
2 Bulk Analysis
This chapter is for the downstream analysis of the most common bulk NGS experiments we perform. For functional analysis that are common to all these methods, see next chapter.
2.1 RNA-Seq
While RPKM has some issues and is not the most correct quantification, for basic comparison, this is good enough to share with biologists. The nf-core pipeline will provide this value by default.
2.1.1 Differential Gene Expression
One of the most basic and common analysis on RNA-Seq. We use DESeq2 and their well documented vignette is worth reading from start to end for the beginner.
RSEM produces non-integer counts, and we can by-pass that by using round()
. Alternatively, you can use tximport to read the files in.
<- round(expected_counts)
expected_counts rownames(expected_counts) <- rsem_results[[1]]$gene_id
colnames(expected_counts) <- stringr::str_extract(files, "mr\\d+") # our RNA-seq samples usually start with MR
<- data.frame(condition = factor(rep(c("control", "knockdown"), each = 3)),
sampleTable replicate = factor(rep(seq(1,3))))
rownames(sampleTable) <- colnames(expected_counts)
DESeq2
<- DESeqDataSetFromMatrix(expected_counts, sampleTable, design = ~condition)
dds <- rowSums(counts(dds)) > 10
keep <- dds[keep,]
dds <- DESeq(dds)
dds <- results(dds, alpha = 0.01)
res summary(res)
Difference between rlog, vst and lfcShrink https://support.bioconductor.org/p/104615/.
Plotting PCs
DESeq2’s plotPCA()
function will plot the top 500 most variable genes. The chunk below will plot all genes.
<- res %>% subset(padj < 0.01)
degenes <- rlog(dds)
dds_rlog <- t(assay(dds_rlog)) %>% prcomp()
pca_data autoplot(pca_data, data = sampleTable, colour = "condition") +
geom_text_repel(label = rownames(sampleTable))
<- assay(dds_rlog) %>% subset(rownames(dds_rlog) %in% rownames(degenes))
rlog_de <- t(scale(t(rlog_de))) rlog_de_scaled
After getting DEGs, you’d want to group the genes into biological functions. See Section 3.2 for Over representation analysis (ORA) with GO and KEGG terms as well as Gene Set Enrichment Analysis (GSEA) with ranked genes.
2.1.2 Deciding Groups and Plotting
While you can use k-means manually to get seperate groups, ComplexHeatmap allows you to do so with more flexiblity and get visualizations as well.
<- list(
mat_colors replicate = c(brewer.pal(3, "Accent")),
condition = c(brewer.pal(6, "Set1")))
names(mat_colors$replicate) <- unique(sampleTable$replicate)
names(mat_colors$condition) <- sampleTable$condition
<- HeatmapAnnotation(df = sampleTable,
col_anno which = 'col',
col = mat_colors
)
<- Heatmap(rlog_de_scaled,
hmap name = "scaled",
# Row Params
show_row_names = FALSE,
row_title_rot=0,
cluster_row_slices = FALSE,
border = TRUE,
row_km = 2, # split rows into 2 groups
# Column Params
cluster_columns = FALSE,
column_title = "Rlog Transformed Expression for all DE genes",
top_annotation = col_anno)
<- draw(hmap) # assigning so that k-means is only called once
hmap row_order(hmap) # grab the different groups in rows
See here for why we assign draw()
.
2.2 ChIP-Seq and ATAC-Seq
2.2.1 Spike-in normalization
Guidelines on spike-in normalization from ActiveMotif.
- Perform ChIP combining the Spike-in Chromatin, Spike-in Antibody, test chromatin and test antibody into the same tube for immunoprecipitation. We suggest using the guidelines provided for chromatin and antibody quantities based on the antibody target.
- Follow ChIP with Next-Generation Sequencing.
- Map ChIP-seq data to the test reference genome (e.g. human, mouse or other).
- Map ChIP-seq data to the Drosophila reference genome.
- Count uniquely aligning Drosophila sequence tags and identify the sample containing the least number of tags.
- Compare Drosophila tag counts from other samples to the sample containing the least tags and generate a normalization factor for each comparison. (Sample 1 with lowest tag count / Sample 2) = Normalization factor
- Downsample the tag counts of data sets proportional to the normalization factor determined
2.2.2 Peak calling
Depending on if you use the nfcore’s pipeline or your own, you will have to call peaks. I use MACS2 and here are some details I’ve gathered.
https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part1.3_MACS2_peak_calling_details.md
Why I skip model building https://github.com/macs3-project/MACS/issues/391
macs2 callpeak -t mxx_sorted.bam --outdir macs/mxx -n mxx -g hs -q 0.01 --nomodel --shift 0 --extsize 250
2.2.3 Overlapping peaks
After MACS2 peak-calling, we may want to see how many peaks overlap in different conditions. Even though it’s named mergePeaks, you will be able to get overlapping statistics from this. Be mindful of long path names as mergePeaks will produce errors.
mergePeaks -d 100 pu1.peaks cebp.peaks -prefix mmm -venn venn.txt
The -d flag changes the unique peaks to 100 bp each and keep the shared peaks same size.
To get literal 1bp overlap, just omit -d argument altogether
The easiest way to plot this result is with Vennerable.
<- tot[1]
a <- tot[2]
b <- tot[3]
ab
<- createVennObj(nSets = 2, sNames = c("m1043", "m1044"), # names in order of first to last
venn_obj sSizes = c(0, a, b ,ab))
<- plotVenn(nVennObj = venn_obj) vp
This gets tedious with more than 2 sets. See code below. You should not make Venn Diagrams with more than 4 sets. An upset plot is better in that scenario.
Show the code
# 3 sets -----
<- tot[1]
a <- tot[2]
b <- tot[3]
ab <- tot[4]
c <- tot[5]
ac <- tot[6]
bc <- tot[7]
abc
# 4 sets -------
<- tot[1]
a <- tot[2]
b <- tot[3]Ve
ab <- tot[4]
c <- tot[5]
ac <- tot[6]
bc <- tot[7]
abc <- tot[8]
d <- tot[9]
ad <- tot[10]
bd <- tot[11]
abd <- tot[12]
cd <- tot[13]
acd <- tot[14]
bcd <- tot[15] abcd
This is a more automated way of reading in the data, here using 4 sets as an example.
<- read.table("results/output_nepc.bedpe", header = TRUE, sep = "\t")
venn $alpha <- apply(venn, 1, function(x) {
venn<- c("A", "B", "C", "D")
sets <- sets[which(x == "X")]
selected_sets paste(selected_sets, collapse = "")
})
<- c("0", "A", "B", "AB", "C", "AC", "BC", "ABC", "D", "AD", "BD", "ABD", "CD", "ACD", "BCD", "ABCD")
order_vector
$alpha <- factor(venn$alpha, levels = order_vector)
venn
# Sort the data frame based on the factor levels
<- venn[order(venn$alpha), ]
venn <- Venn(SetNames = c("93", "145.1", "145.2", "nci"),
venn_plot Weight = c(0, venn$Features))
plot(venn_plot, type = "ellipses")
Creating upset plots https://github.com/hms-dbmi/UpSetR
library(UpSetR)
$Sets <- apply(venn[, -1], 1, function(x) {
venn<- colnames(venn)[2:17]
sets <- sets[which(x == "X")]
selected_sets paste(selected_sets, collapse = "&")
})
<- c(venn$Features)
upset_input names(upset_input) <- venn$Sets
upset(fromExpression(upset_input), nsets = 16, nintersects = 100, number.angles = 45)
A sample script to automate this process. I haven’t incorporated the above part into this script.
mergePeaksandplot.sh
module load homer/4.10
module load R/4.0.3
mergePeaks control_peaks.narrowPeak purturbed_peaks.narrowPeak -prefix ov -venn venn.txt
# n_way script, venn.txt, last sample label, middle samples label, first sample label, txt file name (no extension)
Rscript --vanilla venn2way.R venn.txt purturbed control venn
The accompanying R file:
venn2way.R
library(Vennerable)
<- commandArgs(trailingOnly = TRUE)
args
<- read.table(args[1], header = TRUE, sep = "\t")
venn <- venn[[Total]]
tot
<- tot[1]
a <- tot[2]
b <- tot[3]
ab
<- Venn(SetNames = c(args[3], args[2]), # opposite labelling to nVennR, going from last sample to first
venner Weight = c(0, b, a, ab))
png(paste0(args[4], ".png"))
plot(venner)
dev.off()
To make Venn Diagrams that are more accurately weighted, use nVennR. Sadly, it looks like it’s been removed from CRAN since the last time I’ve installed it.
If you have replicates, overlapping peaks can be obtained via packages like MAnorm2. Read more about dealing with replicates here.
2.2.4 Motif Enrichment
Homer mergePeaks
is able to take the narrowPeaks format as input, but if you’re doing homer mergePeaks
first and then findMotifs
, you need to change the Homer mergePeaks
output to exclude the header line first. This also applies to deepTools heatmaps.
findMotifsGenome.pl peak_or_bed /projects/p20023/Viriya/software/Homer4.10/data/genomes/hg38/ output_dir -size 200
-size 200 is the default, and is calculated from the center of the peaks. If your peaks are bigger and you wish to use the entire region, use -size given
. However, when the regions are too large, motifs will not be significantly enriched.
2.2.5 Plotting Heatmaps
Here I’m showing a sample deepTools script to plot heatmaps. First you need bigwig files. bamCoverage
offers normalization by scaling factor, Reads Per Kilobase per Million mapped reads (RPKM), counts per million (CPM), bins per million mapped reads (BPM) and 1x depth (reads per genome coverage, RPGC).
bamCoverage --bam m${i}_sorted.bam --outFileName m${i}.bw --normalizeUsing CPM \
--numberOfProcessors 6 --binSize 20 --extendReads 200
I find having paths defined at the top of the script makes it less likely to have mistakes.
day3="m1015_1055_hg38/foxa2/CPM"
nci="m1015_1055_hg38/CPM"
chip="../Viriya/analysis/foxa2/chip-seq"
computeMatrix reference-point --referencePoint center -S \
$day3/m674.bw $day3/m676.bw $tf/m914.bw $tf/m921.bw $nci/m1043.bw $nci/m1044.bw \
$chip/clustering/output/1.bed \
-R $chip/clustering/output/2.bed \
$chip/clustering/output/3.bed \
$chip/clustering/output/4.bed \
-b 3000 -o $chip/heatmaps/scaled_FOXA2_sites.npz \
-a 3000 \
--samplesLabel m674 m676 m914 m921 m1043 m1044 --blackListFileName $chip/ENCFF356LFX.bed.gz
-p max
plotHeatmap -m $chip/FOXA2_6_samples/heatmaps/scaled_FOXA2_sites.npz \
$chip/FOXA2_6_samples/heatmaps/scaled_FOXA2_sites.pdf \
-o --colorMap Blues
--sortUsingSamples
is not 0 indexed. 1st sample is 1. --sortUsingSamples
is also usable in plotHeatmap so we can save that to there for more flexible plotting.
--regionsLabel
: “xxx binding sites n=2390” put it in quotes if you don’t want to key in escape too many times
--sampleLabels
: if you’re sure you’re not going to change any labels, put it in makeheatmap step because fewer things to type when redoing the heatmap multiple times due to scaling etc, therefore fewer mistakes. otherwise put it in plotHeatmap.
In legends: “Pol II” spaces and parantheses need to be escaped, but not + signs.
Use BED6 format annd not the BED3 if you care about strandedness https://github.com/deeptools/deepTools/issues/886.
To get sorted output bed files, use the --sortedOutRegions
at the plotHeatmap step, not computeMatrix.
Use --clusterUsingSamples
for a more robust delineation.
2.2.6 Genomic Annotation
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
<- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
promoter
<- readPeakFile("/path_to_narrowPeak/mxxx_peaks.narrowPeak")
mxxx <- getTagMatrix(mxxx, windows = promoter) # usually needs an interactive job
tagMatrix <- annotatePeak(mxxx, tssRegion = c(-3000,3000), TxDb = txdb, annoDb = "org.Hs.eg.db")
peakAnno plotAnnoPie(peakAnno, main = "mxxx", line = -6)
vennpie(peakAnno, r = 0.1)
upsetplot(peakAnno) + ggtitle("mxxx")
plotAnnoBar(peakAnno)
annotatePeak
will default to chosing one gene per region. This is fine for small binding sites but for larger regions use seq2gene
as it will map genomic regions in a many-to-many manner.