library(tidyverse)
library(GenomicAlignments)
library(kableExtra)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=FALSE, echo = TRUE, message=FALSE, warning=FALSE, fig.align="center", fig.retina = 2)
source("../tools/plotting_tools.R")
theme_set(theme_notebook())
The raw sequencing data is available on the SRA and is associated with this BioProject.
Library Name | Run ID and Link | Size |
---|---|---|
oPool_J2 | SRR25076304 | 15 Mb |
short_TF_J2 | SRR25076302 | 43 Mb |
long_TF_J2 | SRR25076300 | 39 Mb |
small_RNA_J2 | SRR25076298 | 30 Mb |
This notebook is nearly identical to
left_side_TO_libs_processing.Rmd
, it just uses the right
side sequencing data instead.
Four different ORBIT mutant libraries were constructed using targeting oligo pools. These different deletion libraries were as follows:
Library #0: galK, hisA, metA, leuD (gold standards) + 23 TF genes (27 targets total)
Library #1: All TF genes < 575 bp (76 targets total)
Library #2: All TF genes > 575 bp (226 targets total)
Library #3: All small RNA genes (90 targets total)
Within each library the targeting oligo deletes a target gene, and pInt_kanR should integrate into the newly formed attB site on the genome. This should leave us with a library with the target genes deleted, replaced by the pInt_kanR backbone flanked by the attR and attL sites. To test the accuracy with which these libraries were constructed sequencing libraries were prepared to amplify these genome-pInt junctions. This was done by shearing genomic DNA, adding C tails to free DNA ends and then amplifying junctions with a polyG and pInt specific primer. This notebook processes the raw fastq sequencing files (direct from MiSeq) into a more usable form of reads that are filtered and trimmed to map to the E. coli genome. Then by looking at where these junction reads start (following the pInt sequence) we can determine if the ORBIT mutations were created accurately.
The structure of this data processing is shown below. This processing
script lives in the code
folder under
seq_data_processing
and it processes raw data files that
live in data/seq_data/right_side_TO_libs
. The processing
starts from 0_raw_fastq
(which houses truly raw data) and
proceeds to filter for quality using fastp
(1_fastp_output
), then the pInt_kanR sequence that is
common to all mutants is found and trimmed off along with the polyC tail
using cutadapt (2_cutadapt_output
). At this stage filtered
and trimmed reads should be the flanking E. coli genomic
sequences for each mutant, so we map these reads to the E. coli genome
with bowtie2 (3_bowtie_ecoli_output
). Finally, we examine
some of the reads that don’t map by seeing if they still map to the
pInt_kanR plasmid even after the previous steps
(4_bowtie_pInt_output
) and also check if they map to the
phiX loading control that we use on the Miseq
(bowtie_phiX_output
).
After all this, we will convert these various sets of reads into bam
files and then into a more usable form for analysis - the final
dataframe: df_to_libs_right_all_reads.csv
├── code
│ ├── seq_data_processing
│ │ ├── right_side_TO_libs_processing.Rmd
├── data
│ ├── seq_data
│ │ ├── right_side_TO_libs
│ │ │ ├── 0_raw_fastq
│ │ │ ├── 1_fastp_output
│ │ │ ├── 2_cutadapt_output
│ │ │ ├── 3_bowtie_ecoli_output
│ │ │ ├── 4_bowtie_pInt_output
│ │ │ ├── 5_bowtie_phiX_output
│ │ │ ├── 6_bam_files
│ │ │ ├── df_to_libs_right_all_reads.csv
├── figures
This complicated setup is required the first time for m1 macs only, which must run certain programs in an intel like environment. Note that it gets a little funky running bash from rmd, so it may be more straightforward for the user to run bash commands in the terminal or windows equivalent. However, this setup works on multiple machines for the author at the current time.
This setup assumes you already have bash and conda in addition to R/rmd.
source ~/.bash_profile
#**first time running scripts, create conda env**
# use code below when running on *m1 mac* to setup intel like env
# see stack overflow: https://stackoverflow.com/questions/71515117/how-to-set-up-a-conda-osx-64-environment-on-arm-mac
conda create -n bioinfo_env -y
conda activate bioinfo_env
conda config --env --set subdir osx-64
conda install python=3.7 -y
conda install -n bioinfo_env fastp cutadapt bowtie2 samtools starcode --channel bioconda -y
For normal, non-m1 mac conda environment setup:
source ~/.bash_profile
#**for normal conda install using this code first time**
conda create -n bioinfo_env fastp cutadapt samtools bowtie2 starcode -y --channel bioconda
With our bioinfo_env
setup we are ready to process the
data. Note, when running bash scripts in rmd, we have found that we must
activate the conda env in each code chunk and also source our bash
profile to get things to work correctly.
Looking at the raw data, (0_raw_fastq
), you can see that
these were actually paired end sequencing runs. However, from looking at
the data, read2 quality is very low, so we will focus on the higher
quality read1 data, which starts from the pInt_kanR backbone and
sequences out into the genome (instead of read2 sequencing in from the
polyC tail).
Here, we will use the program fastp, with defaults to filter out very low quality reads, and also to get a nice summary of the seq libraries.
source ~/.bash_profile
conda activate bioinfo_env
cd ../../data/seq_data/right_side_TO_libs/
fastp -i "0_raw_fastq/right-IDT-lib-0_S1_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_0_right_fastp.fastq.gz" -h "1_fastp_output/lib_0_right_report.html" --json /dev/null
fastp -i "0_raw_fastq/right-twist-lib-1_S2_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_1_right_fastp.fastq.gz" -h "1_fastp_output/lib_1_right_report.html" --json /dev/null
fastp -i "0_raw_fastq/right-twist-lib-2_S3_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_2_right_fastp.fastq.gz" -h "1_fastp_output/lib_2_right_report.html" --json /dev/null
fastp -i "0_raw_fastq/right-twist-lib-3_S4_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_3_right_fastp.fastq.gz" -h "1_fastp_output/lib_3_right_report.html" --json /dev/null
Please see the nice html reports in 1_fastp_output
. The
main takeaway is the total number of reads we are working with and their
quality:
Lib #0: 82k reads
Lib #1: 237k reads
Lib #2: 212k reads
Lib #3: 159k reads
For all 4 libraries, >99% of reads pass the quality filter. ~78% of remaining reads are Q30 or above. These total read counts are higher than the left side libraries - please note that this was a MiSeq nano kit.
The next step in the processing pipeline is to trim off the ORBIT
plasmid sequence and the polyC tail used in library prep. To do this we
will use the ‘linked adapter’ option from cutadapt. I experimented with
a few options and this works way better than unlinked. This linked
adapter is called with the option -a 5'ADAPTER...3'ADAPTER
. Note the polyC tail is specified as 8x C’s C{8}
. In
looking at the fastq files there are many reads that have way way longer
polyC tails, but this cutadapt scheme should trim everything 3’ of the
first instance of 8 Cs following the 5’ adapter. Note that the PCR1
reverse primer that is polyG contains 1N, because of ordering issues
from sigma. Therefore many of the polyC tails are not perfect. For this
reason and because the ORBIT adapter needs to be robustly found, I
increased the error tolerance from the standard 10% to 20%
(e 0.2
). This means that for the polyC tail only 1-2 nt
cannot be C.
After trimming adapter the remaining genomic regions will be mapped
to the E. coli genome, so I’m also passing a length cutoff of at least
20bp for the trimmed read -m 20
. A read shorter than 20 bp
will likely not be unique and probably shouldn’t be trusted. This
minimum length could be increased later to be more stringent.
source ~/.bash_profile
conda activate bioinfo_env
cd ../../data/seq_data/right_side_TO_libs/
cutadapt -a "gaagcagctccagcctacacggtttgtaccgtacaccactgagaccgccgtcgtcgacaagcc;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_0_right_fastp.fastq.gz -o 2_cutadapt_output/lib_0_right_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_0_right_cutadpt_report.tsv
cutadapt -a "gaagcagctccagcctacacggtttgtaccgtacaccactgagaccgccgtcgtcgacaagcc;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_1_right_fastp.fastq.gz -o 2_cutadapt_output/lib_1_right_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_1_right_cutadpt_report.tsv
cutadapt -a "gaagcagctccagcctacacggtttgtaccgtacaccactgagaccgccgtcgtcgacaagcc;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_2_right_fastp.fastq.gz -o 2_cutadapt_output/lib_2_right_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_2_right_cutadpt_report.tsv
cutadapt -a "gaagcagctccagcctacacggtttgtaccgtacaccactgagaccgccgtcgtcgacaagcc;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_3_right_fastp.fastq.gz -o 2_cutadapt_output/lib_3_right_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_3_right_cutadpt_report.tsv
Now let’s look at the output reports and make sure the trimming worked as expected.
cutadapt_path <- '../../data/seq_data/right_side_TO_libs/2_cutadapt_output/'
cut_lib_0 <- paste0(cutadapt_path,'lib_0_right_cutadpt_report.tsv')
cut_lib_1 <- paste0(cutadapt_path,'lib_1_right_cutadpt_report.tsv')
cut_lib_2 <- paste0(cutadapt_path,'lib_2_right_cutadpt_report.tsv')
cut_lib_3 <- paste0(cutadapt_path,'lib_3_right_cutadpt_report.tsv')
df_cutadapt <- bind_rows(read_delim(cut_lib_0, delim = '\t', skip = length(readLines(cut_lib_0))-2 ) %>% mutate(lib=0),
read_delim(cut_lib_1, delim = '\t', skip = length(readLines(cut_lib_1))-2 )%>% mutate(lib=1),
read_delim(cut_lib_2, delim = '\t', skip = length(readLines(cut_lib_2))-2 )%>% mutate(lib=2),
read_delim(cut_lib_3, delim = '\t', skip = length(readLines(cut_lib_3))-2 )%>% mutate(lib=3)) %>%
mutate(frac_in_out = out_reads / in_reads, frac_w_adapt = `w/adapters` / in_reads, frac_too_short = too_short/in_reads)
df_cutadapt %>% kable() %>% kable_styling()
status | in_reads | in_bp | too_short | too_long | too_many_n | out_reads | w/adapters | qualtrim_bp | out_bp | lib | frac_in_out | frac_w_adapt | frac_too_short |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OK | 82409 | 12409587 | 25179 | 0 | 0 | 48106 | 73285 | 0 | 2986944 | 0 | 0.5837469 | 0.8892839 | 0.3055370 |
OK | 237630 | 35786928 | 73529 | 0 | 0 | 144645 | 218174 | 0 | 8785203 | 1 | 0.6086984 | 0.9181248 | 0.3094264 |
OK | 212310 | 31950558 | 72768 | 0 | 0 | 117327 | 190095 | 0 | 7097312 | 2 | 0.5526212 | 0.8953653 | 0.3427441 |
OK | 159308 | 23978667 | 55452 | 0 | 0 | 90405 | 145857 | 0 | 5572493 | 3 | 0.5674856 | 0.9155661 | 0.3480804 |
Ok, so overall 55-60% of the input reads are being trimmed and output. About 90% have the “adapter” sequences, but a significant fraction of reads are below the minimum 20bp length after trimming and are therefore considered too short and not output (30-35%). This is likely just a function of the size distribution of the library. Let’s look at this graphically:
df_cutadapt_summary <- df_cutadapt %>%
mutate(w_o_adapters = in_reads-`w/adapters`) %>%
pivot_longer(c(too_short, out_reads, w_o_adapters), values_to = 'reads', names_to = 'fraction')
#ggplot(df_cutadapt_summary, aes(x = lib,y=reads, fill = fraction)) + geom_bar(position="stack", stat="identity") + scale_fill_viridis_d()
ggplot(df_cutadapt_summary, aes(x = lib,y=reads, fill = fraction)) + geom_bar(position="fill", stat="identity") + scale_fill_viridis_d()
Here is the largest bottleneck in the entire pipeline, where we lose
a significant portion of the reads due to them being too short following
adapter trimming (too_short
. This is likely because of the
library prep strategy, which caused some fragments to have long c tails
and shorter genomic fragments. However, the remaining reads now seem to
be ready to go to answer our original question: how accurately are ORBIT
mutants constructed. As a reminder these remaining reads
(out_reads
above) passed sequencing quality filters, they
have the pInt_kanR backbone sequence, and they have at least 20bp of
putative genomic sequence that should allow us to map back to the
original ORBIT target genes.
The next step is to align the trimmed reads to the E. coli
genome. This should be an unbiased way to see how well the sequencing
reads from our mutant library actually match the intended ORBIT
deletions. I’m using the --very-sensitive-local
mapping
option. Earlier tests indicated that this gave the most reads mapping
and manual inspection indicated that it was mapping correctly. This
local option means only part of the read needs to align, not the whole
thing. This may not be optimal, but seems necessary because some reads
have remnants of the c-tail still present in read 1, which seemed to
limit the standard alignment. The E. coli ref genome is built
with the genome fasta file MG1655_U000963.fasta
by simply
calling bowtie build MG1655_U000963.fasta ecoli_ref_genome
. Since we are only using the higher quality / more informative read1, I
am using the -U
option, signifying that all the input reads
(from cutadapt) are unpaired. Bowtie will try to map the reads to the
genome and save them as a sam file (-S
option). I am also
capturing the unmapped reads with the --un
option, which
will go to a different sam file. Finally, with the
2>file.txt
option, I’m taking the standard metrics
bowtie outputs to the console and writing them to a text file, so we can
keep track of the fate of each libraries reads with real numbers.
source ~/.bash_profile
conda activate bioinfo_env
cd ../../data/seq_data/right_side_TO_libs/3_bowtie_ecoli_output/
bowtie2-build --quiet MG1655_U000963.fasta ecoli_ref_genome
bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_0_right_cutadpt.fastq.gz -S lib_0_right_mapped.sam --un lib_0_right_unmapped.sam 2> lib_0_right_bowtie.txt
bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_1_right_cutadpt.fastq.gz -S lib_1_right_mapped.sam --un lib_1_right_unmapped.sam 2> lib_1_right_bowtie.txt
bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_2_right_cutadpt.fastq.gz -S lib_2_right_mapped.sam --un lib_2_right_unmapped.sam 2> lib_2_right_bowtie.txt
bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_3_right_cutadpt.fastq.gz -S lib_3_right_mapped.sam --un lib_3_right_unmapped.sam 2> lib_3_right_bowtie.txt
Now let’s look at the bowtie metrics text files to see how the alignments went.
ecoli_bowtie_path <- '../../data/seq_data/right_side_TO_libs/3_bowtie_ecoli_output/'
cut_lib_0 <- paste0(cutadapt_path,'lib_0_right_cutadpt_report.tsv')
vec0 <- readLines(paste0(ecoli_bowtie_path,"lib_0_right_bowtie.txt"))
vec1 <- readLines(paste0(ecoli_bowtie_path,"lib_1_right_bowtie.txt"))
vec2 <- readLines(paste0(ecoli_bowtie_path,"lib_2_right_bowtie.txt"))
vec3 <- readLines(paste0(ecoli_bowtie_path,"lib_3_right_bowtie.txt"))
names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')
bowtie_0 <- parse_number(vec0) %>% setNames(names)
bowtie_1 <- parse_number(vec1) %>% setNames(names)
bowtie_2 <- parse_number(vec2) %>% setNames(names)
bowtie_3 <- parse_number(vec3) %>% setNames(names)
df_bowtie_map_1 <- bind_rows(bowtie_0, bowtie_1,bowtie_2,bowtie_3)
df_bowtie_map_1 %>% kable() %>% kable_styling()
input_reads | unpaired | zero_align | one_align | oneplus_align | overall_align_percent |
---|---|---|---|---|---|
48106 | 48106 | 10109 | 37866 | 131 | 78.99 |
144645 | 144645 | 23344 | 109243 | 12058 | 83.86 |
117327 | 117327 | 20424 | 96711 | 192 | 82.59 |
90405 | 90405 | 14753 | 65988 | 9664 | 83.68 |
For each of the libraries, the overall alignment percentage is 79-83% giving us 10s of thousands of mapped reads for each library. This also means that 17-21% of reads did not align, so we will briefly explore two explanations for what those reads could be.
This cell will look complicated, but basically we just have to take the unmapped reads as a sam file (default output) and convert them back into a fastq file, so that we can use them as input for this next bowtie mapping step.
source ~/.bash_profile
conda activate bioinfo_env
cd ../../data/seq_data/right_side_TO_libs/4_bowtie_pInt_output/
bowtie2-build --quiet pInt_kanR_half.fasta pInt_kanR_half
bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_0_right_unmapped.sam -S lib_0_right_pIntKan_mapped.sam --un lib_0_right_pIntKan_unmapped.sam 2>lib_0_right_pInt_bowtie.txt
bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_1_right_unmapped.sam -S lib_1_right_pIntKan_mapped.sam --un lib_1_right_pIntKan_unmapped.sam 2>lib_1_right_pInt_bowtie.txt
bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_2_right_unmapped.sam -S lib_2_right_pIntKan_mapped.sam --un lib_2_right_pIntKan_unmapped.sam 2>lib_2_right_pInt_bowtie.txt
bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_3_right_unmapped.sam -S lib_3_right_pIntKan_mapped.sam --un lib_3_right_pIntKan_unmapped.sam 2>lib_3_right_pInt_bowtie.txt
Again, let’s check the mapping statistics.
pInt_bowtie_path <- '../../data/seq_data/right_side_TO_libs/4_bowtie_pInt_output/'
vec_p0 <- readLines(paste0(pInt_bowtie_path,"lib_0_right_pInt_bowtie.txt"))
vec_p1 <- readLines(paste0(pInt_bowtie_path,"lib_1_right_pInt_bowtie.txt"))
vec_p2 <- readLines(paste0(pInt_bowtie_path,"lib_2_right_pInt_bowtie.txt"))
vec_p3 <- readLines(paste0(pInt_bowtie_path,"lib_3_right_pInt_bowtie.txt"))
names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')
bowtie_p0 <- parse_number(vec_p0) %>% setNames(names)
bowtie_p1 <- parse_number(vec_p1) %>% setNames(names)
bowtie_p2 <- parse_number(vec_p2) %>% setNames(names)
bowtie_p3 <- parse_number(vec_p3) %>% setNames(names)
df_bowtie_map_p1 <- bind_rows(bowtie_p0, bowtie_p1,bowtie_p2,bowtie_p3)
df_bowtie_map_p1 %>% kable() %>% kable_styling()
input_reads | unpaired | zero_align | one_align | oneplus_align | overall_align_percent |
---|---|---|---|---|---|
10109 | 10109 | 5862 | 4247 | 0 | 42.01 |
23344 | 23344 | 15413 | 7931 | 0 | 33.97 |
20424 | 20424 | 14635 | 5789 | 0 | 28.34 |
14753 | 14753 | 10695 | 4058 | 0 | 27.51 |
So, some reads do map to the backbone of pInt_kanR suggesting they could be plasmid concatemers or other adapter / backbone sequences.
Finally, we will check to see if any of our reads map to the phiX control library that was run alongside the sequencing libraries (10% final concentration)
source ~/.bash_profile
conda activate bioinfo_env
cd ../../data/seq_data/right_side_TO_libs/5_bowtie_phiX_output/
bowtie2-build --quiet phiX174.fasta phiX174
bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_0_right_pIntKan_unmapped.sam -S lib_0_right_phiX_mapped.sam --un lib_0_right_phiX_unmapped.sam 2>lib_0_right_phiX_bowtie.txt
bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_1_right_pIntKan_unmapped.sam -S lib_1_right_phiX_mapped.sam --un lib_1_right_phiX_unmapped.sam 2>lib_1_right_phiX_bowtie.txt
bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_2_right_pIntKan_unmapped.sam -S lib_2_right_phiX_mapped.sam --un lib_2_right_phiX_unmapped.sam 2>lib_2_right_phiX_bowtie.txt
bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_3_right_pIntKan_unmapped.sam -S lib_3_right_phiX_mapped.sam --un lib_3_right_phiX_unmapped.sam 2>lib_3_right_phiX_bowtie.txt
Let’s look at metrics:
phiX_bowtie_path <- '../../data/seq_data/right_side_TO_libs/5_bowtie_phiX_output/'
vec_ph0 <- readLines(paste0(phiX_bowtie_path,"lib_0_right_phiX_bowtie.txt"))
vec_ph1 <- readLines(paste0(phiX_bowtie_path,"lib_1_right_phiX_bowtie.txt"))
vec_ph2 <- readLines(paste0(phiX_bowtie_path,"lib_2_right_phiX_bowtie.txt"))
vec_ph3 <- readLines(paste0(phiX_bowtie_path,"lib_3_right_phiX_bowtie.txt"))
names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')
bowtie_ph0 <- parse_number(vec_ph0) %>% setNames(names)
bowtie_ph1 <- parse_number(vec_ph1) %>% setNames(names)
bowtie_ph2 <- parse_number(vec_ph2) %>% setNames(names)
bowtie_ph3 <- parse_number(vec_ph3) %>% setNames(names)
df_bowtie_map_ph1 <- bind_rows(bowtie_ph0, bowtie_ph1,bowtie_ph2,bowtie_ph3)
df_bowtie_map_ph1 %>% kable() %>% kable_styling()
input_reads | unpaired | zero_align | one_align | oneplus_align | overall_align_percent |
---|---|---|---|---|---|
5862 | 5862 | 5789 | 72 | 1 | 1.25 |
15413 | 15413 | 15246 | 166 | 1 | 1.08 |
14635 | 14635 | 14451 | 181 | 3 | 1.26 |
10695 | 10695 | 10596 | 99 | 0 | 0.93 |
Very few reads align to phiX.
Let’s combine all of the mapped reads into a csv and look at a quick overview. We will convert all the files to bam for reading into R.
source ~/.bash_profile
conda activate bioinfo_env
cd ../../data/seq_data/right_side_TO_libs/6_bam_files/
samtools view -bh ../3_bowtie_ecoli_output/lib_0_right_mapped.sam > lib_0_right_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_1_right_mapped.sam > lib_1_right_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_2_right_mapped.sam > lib_2_right_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_3_right_mapped.sam > lib_3_right_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_0_right_pIntKan_mapped.sam > lib_0_right_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_1_right_pIntKan_mapped.sam > lib_1_right_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_2_right_pIntKan_mapped.sam > lib_2_right_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_3_right_pIntKan_mapped.sam > lib_3_right_pIntKan_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_0_right_phiX_mapped.sam > lib_0_right_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_1_right_phiX_mapped.sam > lib_1_right_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_2_right_phiX_mapped.sam > lib_2_right_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_3_right_phiX_mapped.sam > lib_3_right_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_0_right_phiX_unmapped.sam > lib_0_right_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_1_right_phiX_unmapped.sam > lib_1_right_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_2_right_phiX_unmapped.sam > lib_2_right_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_3_right_phiX_unmapped.sam > lib_3_right_phiX_unmapped.bam
Now we can import the bam files.
bam_path <- '../../data/seq_data/right_side_TO_libs/6_bam_files/'
df_bam_0_1 <- readGAlignments(file = paste0(bam_path,"lib_0_right_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 1)
df_bam_1_1 <- readGAlignments(file = paste0(bam_path,"lib_1_right_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 1)
df_bam_2_1 <- readGAlignments(file = paste0(bam_path,"lib_2_right_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 1)
df_bam_3_1 <- readGAlignments(file = paste0(bam_path,"lib_3_right_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 1)
df_bam_0_2 <- readGAlignments(file = paste0(bam_path,"lib_0_right_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 2)
df_bam_1_2 <- readGAlignments(file = paste0(bam_path,"lib_1_right_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 2)
df_bam_2_2 <- readGAlignments(file = paste0(bam_path,"lib_2_right_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 2)
df_bam_3_2 <- readGAlignments(file = paste0(bam_path,"lib_3_right_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 2)
df_bam_0_3 <- readGAlignments(file = paste0(bam_path,"lib_0_right_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 3)
df_bam_1_3 <- readGAlignments(file = paste0(bam_path,"lib_1_right_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 3)
df_bam_2_3 <- readGAlignments(file = paste0(bam_path,"lib_2_right_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 3)
df_bam_3_3 <- readGAlignments(file = paste0(bam_path,"lib_3_right_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 3)
We can combine all these reads from the different steps into one dataframe and save a csv.
df_all_reads <- bind_rows(df_bam_0_1,df_bam_1_1,df_bam_2_1,df_bam_3_1,
df_bam_0_2,df_bam_1_2,df_bam_2_2,df_bam_3_2,
df_bam_0_3,df_bam_1_3,df_bam_2_3,df_bam_3_3)
write_csv(df_all_reads, "../../data/seq_data/right_side_TO_libs/df_to_libs_right_all_reads.csv")
df_all_reads %>% head(10) %>% kable() %>% kable_styling()
seqnames | strand | cigar | qwidth | start | end | width | njunc | lib | step |
---|---|---|---|---|---|---|---|---|---|
U00096.3 |
|
79M2S | 81 | 2978946 | 2979024 | 79 | 0 | 0 | 1 |
U00096.3 |
|
6S47M | 53 | 2095079 | 2095125 | 47 | 0 | 0 | 1 |
U00096.3 |
|
81M | 81 | 4277913 | 4277993 | 81 | 0 | 0 | 1 |
U00096.3 |
|
76M | 76 | 3535810 | 3535885 | 76 | 0 | 0 | 1 |
U00096.3 |
|
2S78M | 80 | 317133 | 317210 | 78 | 0 | 0 | 1 |
U00096.3 |
|
51M | 51 | 3059708 | 3059758 | 51 | 0 | 0 | 1 |
U00096.3 |
|
2S30M | 32 | 2095096 | 2095125 | 30 | 0 | 0 | 1 |
U00096.3 |
|
81M | 81 | 3535805 | 3535885 | 81 | 0 | 0 | 1 |
U00096.3 |
|
75M | 75 | 4099474 | 4099548 | 75 | 0 | 0 | 1 |
U00096.3 |
|
34M2S | 36 | 417811 | 417844 | 34 | 0 | 0 | 1 |
We will calculate some statistics about the read lengths at the different steps:
df_all_summary <- df_all_reads %>%
group_by(lib, step) %>%
summarise(n=n(), mean_len = mean(qwidth), median_len = median(qwidth), sd_len = sd(qwidth))
df_all_summary %>% kable() %>% kable_styling()
lib | step | n | mean_len | median_len | sd_len |
---|---|---|---|---|---|
0 | 1 | 37997 | 64.63331 | 76.0 | 19.745319 |
0 | 2 | 4247 | 72.57947 | 76.0 | 8.844884 |
0 | 3 | 73 | 141.80822 | 146.0 | 15.805816 |
1 | 1 | 121301 | 62.81156 | 73.0 | 20.263986 |
1 | 2 | 7931 | 72.45064 | 76.0 | 9.319747 |
1 | 3 | 167 | 139.76048 | 146.0 | 19.442907 |
2 | 1 | 96903 | 63.19703 | 73.0 | 19.821499 |
2 | 2 | 5789 | 72.12731 | 76.0 | 9.766072 |
2 | 3 | 184 | 139.50543 | 145.5 | 20.935497 |
3 | 1 | 75652 | 64.12823 | 75.0 | 19.357990 |
3 | 2 | 4058 | 72.18088 | 76.0 | 9.711455 |
3 | 3 | 99 | 138.00000 | 146.0 | 21.813682 |
And we will view the read lengths for all 4 different libraries at the 3 different mapping steps.
The only notable thing here is that the tiny number of reads aligning to phiX are long, probably because they didn’t contain our pInt_kanR “adaper” sequence.
Finally, let’s look at the fate (number) of our reads across the different mapping steps.
ggplot(df_all_summary, aes(x = lib,y=n, fill = factor(step))) + geom_bar(position="fill", stat="identity") + scale_fill_viridis_d()
So, the vast majority of the mapped reads mapped to the E. coli genome in step 1.
These mapped reads will be further analyzed in the
right_side_TO_libs_analysis.Rmd
notebook.
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] viridis_0.6.2 viridisLite_0.4.1
## [3] kableExtra_1.3.4 GenomicAlignments_1.34.0
## [5] Rsamtools_2.14.0 Biostrings_2.66.0
## [7] XVector_0.38.0 SummarizedExperiment_1.28.0
## [9] Biobase_2.58.0 MatrixGenerics_1.10.0
## [11] matrixStats_0.63.0 GenomicRanges_1.50.1
## [13] GenomeInfoDb_1.34.9 IRanges_2.32.0
## [15] S4Vectors_0.36.0 BiocGenerics_0.44.0
## [17] forcats_0.5.2 stringr_1.5.0
## [19] dplyr_1.1.0 purrr_0.3.5
## [21] readr_2.1.3 tidyr_1.2.1
## [23] tibble_3.1.8 ggplot2_3.4.0
## [25] tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 fs_1.5.2 bit64_4.0.5
## [4] lubridate_1.9.0 webshot_0.5.4 httr_1.4.4
## [7] tools_4.2.0 backports_1.4.1 bslib_0.4.1
## [10] utf8_1.2.2 R6_2.5.1 DBI_1.1.3
## [13] colorspace_2.0-3 withr_2.5.0 gridExtra_2.3
## [16] tidyselect_1.2.0 bit_4.0.5 compiler_4.2.0
## [19] cli_3.4.1 rvest_1.0.3 xml2_1.3.3
## [22] DelayedArray_0.24.0 labeling_0.4.2 sass_0.4.4
## [25] scales_1.2.1 systemfonts_1.0.4 digest_0.6.30
## [28] svglite_2.1.0 rmarkdown_2.18 pkgconfig_2.0.3
## [31] htmltools_0.5.4 highr_0.9 dbplyr_2.2.1
## [34] fastmap_1.1.0 rlang_1.1.1 readxl_1.4.1
## [37] rstudioapi_0.14 farver_2.1.1 jquerylib_0.1.4
## [40] generics_0.1.3 jsonlite_1.8.3 vroom_1.6.0
## [43] BiocParallel_1.32.4 googlesheets4_1.0.1 RCurl_1.98-1.9
## [46] magrittr_2.0.3 GenomeInfoDbData_1.2.9 Matrix_1.5-3
## [49] munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.3
## [52] stringi_1.7.8 yaml_2.3.6 zlibbioc_1.44.0
## [55] grid_4.2.0 parallel_4.2.0 crayon_1.5.2
## [58] lattice_0.20-45 haven_2.5.1 hms_1.1.2
## [61] knitr_1.41 pillar_1.8.1 codetools_0.2-18
## [64] reprex_2.0.2 glue_1.6.2 evaluate_0.18
## [67] modelr_0.1.10 vctrs_0.5.2 tzdb_0.3.0
## [70] cellranger_1.1.0 gtable_0.3.1 assertthat_0.2.1
## [73] cachem_1.0.6 xfun_0.35 broom_1.0.1
## [76] googledrive_2.0.0 gargle_1.2.1 timechange_0.1.1
## [79] ellipsis_0.3.2