library(tidyverse)
library(GenomicAlignments)
library(kableExtra)

knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=FALSE, echo = TRUE, message=FALSE, warning=FALSE, fig.align="center", fig.retina = 2)

source("../tools/plotting_tools.R")
theme_set(theme_notebook())

The raw sequencing data is available on the SRA and is associated with this BioProject.

Library Name Run ID and Link Size
oPool_J2 SRR25076304 15 Mb
short_TF_J2 SRR25076302 43 Mb
long_TF_J2 SRR25076300 39 Mb
small_RNA_J2 SRR25076298 30 Mb

This notebook is nearly identical to left_side_TO_libs_processing.Rmd, it just uses the right side sequencing data instead.

Four different ORBIT mutant libraries were constructed using targeting oligo pools. These different deletion libraries were as follows:

Within each library the targeting oligo deletes a target gene, and pInt_kanR should integrate into the newly formed attB site on the genome. This should leave us with a library with the target genes deleted, replaced by the pInt_kanR backbone flanked by the attR and attL sites. To test the accuracy with which these libraries were constructed sequencing libraries were prepared to amplify these genome-pInt junctions. This was done by shearing genomic DNA, adding C tails to free DNA ends and then amplifying junctions with a polyG and pInt specific primer. This notebook processes the raw fastq sequencing files (direct from MiSeq) into a more usable form of reads that are filtered and trimmed to map to the E. coli genome. Then by looking at where these junction reads start (following the pInt sequence) we can determine if the ORBIT mutations were created accurately.

The structure of this data processing is shown below. This processing script lives in the code folder under seq_data_processing and it processes raw data files that live in data/seq_data/right_side_TO_libs. The processing starts from 0_raw_fastq (which houses truly raw data) and proceeds to filter for quality using fastp (1_fastp_output), then the pInt_kanR sequence that is common to all mutants is found and trimmed off along with the polyC tail using cutadapt (2_cutadapt_output). At this stage filtered and trimmed reads should be the flanking E. coli genomic sequences for each mutant, so we map these reads to the E. coli genome with bowtie2 (3_bowtie_ecoli_output). Finally, we examine some of the reads that don’t map by seeing if they still map to the pInt_kanR plasmid even after the previous steps (4_bowtie_pInt_output) and also check if they map to the phiX loading control that we use on the Miseq (bowtie_phiX_output).

After all this, we will convert these various sets of reads into bam files and then into a more usable form for analysis - the final dataframe: df_to_libs_right_all_reads.csv

├── code
│   ├── seq_data_processing
│   │   ├── right_side_TO_libs_processing.Rmd
├── data
│   ├── seq_data
│   │   ├── right_side_TO_libs
│   │   │   ├── 0_raw_fastq
│   │   │   ├── 1_fastp_output
│   │   │   ├── 2_cutadapt_output
│   │   │   ├── 3_bowtie_ecoli_output
│   │   │   ├── 4_bowtie_pInt_output
│   │   │   ├── 5_bowtie_phiX_output
│   │   │   ├── 6_bam_files
│   │   │   ├── df_to_libs_right_all_reads.csv
├── figures

Step 0: Bioinformatics environment setup

This complicated setup is required the first time for m1 macs only, which must run certain programs in an intel like environment. Note that it gets a little funky running bash from rmd, so it may be more straightforward for the user to run bash commands in the terminal or windows equivalent. However, this setup works on multiple machines for the author at the current time.

This setup assumes you already have bash and conda in addition to R/rmd.

source ~/.bash_profile

#**first time running scripts, create conda env**
# use code below when running on *m1 mac* to setup intel like env
# see stack overflow: https://stackoverflow.com/questions/71515117/how-to-set-up-a-conda-osx-64-environment-on-arm-mac
conda create -n bioinfo_env -y

conda activate bioinfo_env

conda config --env --set subdir osx-64

conda install python=3.7 -y

conda install -n bioinfo_env fastp cutadapt bowtie2 samtools starcode --channel bioconda -y

For normal, non-m1 mac conda environment setup:

source ~/.bash_profile
  
#**for normal conda install using this code first time**
conda create -n bioinfo_env fastp cutadapt samtools bowtie2 starcode -y --channel bioconda

With our bioinfo_env setup we are ready to process the data. Note, when running bash scripts in rmd, we have found that we must activate the conda env in each code chunk and also source our bash profile to get things to work correctly.

Step 1: Fastp quality filter

Looking at the raw data, (0_raw_fastq), you can see that these were actually paired end sequencing runs. However, from looking at the data, read2 quality is very low, so we will focus on the higher quality read1 data, which starts from the pInt_kanR backbone and sequences out into the genome (instead of read2 sequencing in from the polyC tail).

Here, we will use the program fastp, with defaults to filter out very low quality reads, and also to get a nice summary of the seq libraries.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/right_side_TO_libs/

fastp -i "0_raw_fastq/right-IDT-lib-0_S1_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_0_right_fastp.fastq.gz" -h "1_fastp_output/lib_0_right_report.html" --json /dev/null

fastp -i "0_raw_fastq/right-twist-lib-1_S2_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_1_right_fastp.fastq.gz" -h "1_fastp_output/lib_1_right_report.html" --json /dev/null

fastp -i "0_raw_fastq/right-twist-lib-2_S3_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_2_right_fastp.fastq.gz" -h "1_fastp_output/lib_2_right_report.html" --json /dev/null

fastp -i "0_raw_fastq/right-twist-lib-3_S4_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_3_right_fastp.fastq.gz" -h "1_fastp_output/lib_3_right_report.html" --json /dev/null

Please see the nice html reports in 1_fastp_output. The main takeaway is the total number of reads we are working with and their quality:

For all 4 libraries, >99% of reads pass the quality filter. ~78% of remaining reads are Q30 or above. These total read counts are higher than the left side libraries - please note that this was a MiSeq nano kit.

Step 2: Cutadapt trimming

The next step in the processing pipeline is to trim off the ORBIT plasmid sequence and the polyC tail used in library prep. To do this we will use the ‘linked adapter’ option from cutadapt. I experimented with a few options and this works way better than unlinked. This linked adapter is called with the option -a 5'ADAPTER...3'ADAPTER . Note the polyC tail is specified as 8x C’s C{8}. In looking at the fastq files there are many reads that have way way longer polyC tails, but this cutadapt scheme should trim everything 3’ of the first instance of 8 Cs following the 5’ adapter. Note that the PCR1 reverse primer that is polyG contains 1N, because of ordering issues from sigma. Therefore many of the polyC tails are not perfect. For this reason and because the ORBIT adapter needs to be robustly found, I increased the error tolerance from the standard 10% to 20% (e 0.2). This means that for the polyC tail only 1-2 nt cannot be C.

After trimming adapter the remaining genomic regions will be mapped to the E. coli genome, so I’m also passing a length cutoff of at least 20bp for the trimmed read -m 20. A read shorter than 20 bp will likely not be unique and probably shouldn’t be trusted. This minimum length could be increased later to be more stringent.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/right_side_TO_libs/

cutadapt -a "gaagcagctccagcctacacggtttgtaccgtacaccactgagaccgccgtcgtcgacaagcc;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_0_right_fastp.fastq.gz -o 2_cutadapt_output/lib_0_right_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_0_right_cutadpt_report.tsv

cutadapt -a "gaagcagctccagcctacacggtttgtaccgtacaccactgagaccgccgtcgtcgacaagcc;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_1_right_fastp.fastq.gz -o 2_cutadapt_output/lib_1_right_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_1_right_cutadpt_report.tsv

cutadapt -a "gaagcagctccagcctacacggtttgtaccgtacaccactgagaccgccgtcgtcgacaagcc;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_2_right_fastp.fastq.gz -o 2_cutadapt_output/lib_2_right_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_2_right_cutadpt_report.tsv

cutadapt -a "gaagcagctccagcctacacggtttgtaccgtacaccactgagaccgccgtcgtcgacaagcc;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_3_right_fastp.fastq.gz -o 2_cutadapt_output/lib_3_right_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_3_right_cutadpt_report.tsv

Now let’s look at the output reports and make sure the trimming worked as expected.

cutadapt_path <- '../../data/seq_data/right_side_TO_libs/2_cutadapt_output/'

cut_lib_0 <- paste0(cutadapt_path,'lib_0_right_cutadpt_report.tsv')
cut_lib_1 <- paste0(cutadapt_path,'lib_1_right_cutadpt_report.tsv')
cut_lib_2 <- paste0(cutadapt_path,'lib_2_right_cutadpt_report.tsv')
cut_lib_3 <- paste0(cutadapt_path,'lib_3_right_cutadpt_report.tsv')

df_cutadapt <- bind_rows(read_delim(cut_lib_0, delim = '\t', skip = length(readLines(cut_lib_0))-2 ) %>% mutate(lib=0),
          read_delim(cut_lib_1, delim = '\t', skip = length(readLines(cut_lib_1))-2 )%>% mutate(lib=1),
          read_delim(cut_lib_2, delim = '\t', skip = length(readLines(cut_lib_2))-2 )%>% mutate(lib=2),
          read_delim(cut_lib_3, delim = '\t', skip = length(readLines(cut_lib_3))-2 )%>% mutate(lib=3)) %>% 
  mutate(frac_in_out = out_reads / in_reads, frac_w_adapt = `w/adapters` / in_reads, frac_too_short = too_short/in_reads)

df_cutadapt %>% kable() %>% kable_styling()
status in_reads in_bp too_short too_long too_many_n out_reads w/adapters qualtrim_bp out_bp lib frac_in_out frac_w_adapt frac_too_short
OK 82409 12409587 25179 0 0 48106 73285 0 2986944 0 0.5837469 0.8892839 0.3055370
OK 237630 35786928 73529 0 0 144645 218174 0 8785203 1 0.6086984 0.9181248 0.3094264
OK 212310 31950558 72768 0 0 117327 190095 0 7097312 2 0.5526212 0.8953653 0.3427441
OK 159308 23978667 55452 0 0 90405 145857 0 5572493 3 0.5674856 0.9155661 0.3480804

Ok, so overall 55-60% of the input reads are being trimmed and output. About 90% have the “adapter” sequences, but a significant fraction of reads are below the minimum 20bp length after trimming and are therefore considered too short and not output (30-35%). This is likely just a function of the size distribution of the library. Let’s look at this graphically:

df_cutadapt_summary <- df_cutadapt %>% 
  mutate(w_o_adapters = in_reads-`w/adapters`) %>% 
  pivot_longer(c(too_short, out_reads, w_o_adapters), values_to = 'reads', names_to = 'fraction')
  

#ggplot(df_cutadapt_summary, aes(x = lib,y=reads, fill = fraction)) + geom_bar(position="stack", stat="identity") + scale_fill_viridis_d()

ggplot(df_cutadapt_summary, aes(x = lib,y=reads, fill = fraction)) + geom_bar(position="fill", stat="identity") + scale_fill_viridis_d()

Here is the largest bottleneck in the entire pipeline, where we lose a significant portion of the reads due to them being too short following adapter trimming (too_short. This is likely because of the library prep strategy, which caused some fragments to have long c tails and shorter genomic fragments. However, the remaining reads now seem to be ready to go to answer our original question: how accurately are ORBIT mutants constructed. As a reminder these remaining reads (out_reads above) passed sequencing quality filters, they have the pInt_kanR backbone sequence, and they have at least 20bp of putative genomic sequence that should allow us to map back to the original ORBIT target genes.

Step 3: Bowtie alignment

Mapping to E. coli genome

The next step is to align the trimmed reads to the E. coli genome. This should be an unbiased way to see how well the sequencing reads from our mutant library actually match the intended ORBIT deletions. I’m using the --very-sensitive-local mapping option. Earlier tests indicated that this gave the most reads mapping and manual inspection indicated that it was mapping correctly. This local option means only part of the read needs to align, not the whole thing. This may not be optimal, but seems necessary because some reads have remnants of the c-tail still present in read 1, which seemed to limit the standard alignment. The E. coli ref genome is built with the genome fasta file MG1655_U000963.fasta by simply calling bowtie build MG1655_U000963.fasta ecoli_ref_genome . Since we are only using the higher quality / more informative read1, I am using the -U option, signifying that all the input reads (from cutadapt) are unpaired. Bowtie will try to map the reads to the genome and save them as a sam file (-S option). I am also capturing the unmapped reads with the --un option, which will go to a different sam file. Finally, with the 2>file.txt option, I’m taking the standard metrics bowtie outputs to the console and writing them to a text file, so we can keep track of the fate of each libraries reads with real numbers.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/right_side_TO_libs/3_bowtie_ecoli_output/

bowtie2-build --quiet MG1655_U000963.fasta ecoli_ref_genome 

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_0_right_cutadpt.fastq.gz -S lib_0_right_mapped.sam --un lib_0_right_unmapped.sam 2> lib_0_right_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_1_right_cutadpt.fastq.gz -S lib_1_right_mapped.sam --un lib_1_right_unmapped.sam 2> lib_1_right_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_2_right_cutadpt.fastq.gz -S lib_2_right_mapped.sam --un lib_2_right_unmapped.sam 2> lib_2_right_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_3_right_cutadpt.fastq.gz -S lib_3_right_mapped.sam --un lib_3_right_unmapped.sam 2> lib_3_right_bowtie.txt

Now let’s look at the bowtie metrics text files to see how the alignments went.

ecoli_bowtie_path <- '../../data/seq_data/right_side_TO_libs/3_bowtie_ecoli_output/'

cut_lib_0 <- paste0(cutadapt_path,'lib_0_right_cutadpt_report.tsv')

vec0 <- readLines(paste0(ecoli_bowtie_path,"lib_0_right_bowtie.txt"))
vec1 <- readLines(paste0(ecoli_bowtie_path,"lib_1_right_bowtie.txt"))
vec2 <- readLines(paste0(ecoli_bowtie_path,"lib_2_right_bowtie.txt"))
vec3 <- readLines(paste0(ecoli_bowtie_path,"lib_3_right_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_0 <- parse_number(vec0) %>% setNames(names)
bowtie_1 <- parse_number(vec1) %>% setNames(names)
bowtie_2 <- parse_number(vec2) %>% setNames(names)
bowtie_3 <- parse_number(vec3) %>% setNames(names)

df_bowtie_map_1 <- bind_rows(bowtie_0, bowtie_1,bowtie_2,bowtie_3)

df_bowtie_map_1 %>% kable() %>% kable_styling()
input_reads unpaired zero_align one_align oneplus_align overall_align_percent
48106 48106 10109 37866 131 78.99
144645 144645 23344 109243 12058 83.86
117327 117327 20424 96711 192 82.59
90405 90405 14753 65988 9664 83.68

For each of the libraries, the overall alignment percentage is 79-83% giving us 10s of thousands of mapped reads for each library. This also means that 17-21% of reads did not align, so we will briefly explore two explanations for what those reads could be.

Evaluate unmapped reads

Mapping to plasmid concatemers

This cell will look complicated, but basically we just have to take the unmapped reads as a sam file (default output) and convert them back into a fastq file, so that we can use them as input for this next bowtie mapping step.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/right_side_TO_libs/4_bowtie_pInt_output/

bowtie2-build --quiet pInt_kanR_half.fasta pInt_kanR_half

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_0_right_unmapped.sam -S lib_0_right_pIntKan_mapped.sam --un lib_0_right_pIntKan_unmapped.sam 2>lib_0_right_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_1_right_unmapped.sam -S lib_1_right_pIntKan_mapped.sam --un lib_1_right_pIntKan_unmapped.sam 2>lib_1_right_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_2_right_unmapped.sam -S lib_2_right_pIntKan_mapped.sam --un lib_2_right_pIntKan_unmapped.sam 2>lib_2_right_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_3_right_unmapped.sam -S lib_3_right_pIntKan_mapped.sam --un lib_3_right_pIntKan_unmapped.sam 2>lib_3_right_pInt_bowtie.txt

Again, let’s check the mapping statistics.

pInt_bowtie_path <- '../../data/seq_data/right_side_TO_libs/4_bowtie_pInt_output/'


vec_p0 <- readLines(paste0(pInt_bowtie_path,"lib_0_right_pInt_bowtie.txt"))
vec_p1 <- readLines(paste0(pInt_bowtie_path,"lib_1_right_pInt_bowtie.txt"))
vec_p2 <- readLines(paste0(pInt_bowtie_path,"lib_2_right_pInt_bowtie.txt"))
vec_p3 <- readLines(paste0(pInt_bowtie_path,"lib_3_right_pInt_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_p0 <- parse_number(vec_p0) %>% setNames(names)
bowtie_p1 <- parse_number(vec_p1) %>% setNames(names)
bowtie_p2 <- parse_number(vec_p2) %>% setNames(names)
bowtie_p3 <- parse_number(vec_p3) %>% setNames(names)

df_bowtie_map_p1 <- bind_rows(bowtie_p0, bowtie_p1,bowtie_p2,bowtie_p3)

df_bowtie_map_p1 %>% kable() %>% kable_styling()
input_reads unpaired zero_align one_align oneplus_align overall_align_percent
10109 10109 5862 4247 0 42.01
23344 23344 15413 7931 0 33.97
20424 20424 14635 5789 0 28.34
14753 14753 10695 4058 0 27.51

So, some reads do map to the backbone of pInt_kanR suggesting they could be plasmid concatemers or other adapter / backbone sequences.

mapping to phiX

Finally, we will check to see if any of our reads map to the phiX control library that was run alongside the sequencing libraries (10% final concentration)

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/right_side_TO_libs/5_bowtie_phiX_output/


bowtie2-build --quiet phiX174.fasta phiX174

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_0_right_pIntKan_unmapped.sam -S lib_0_right_phiX_mapped.sam --un lib_0_right_phiX_unmapped.sam 2>lib_0_right_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_1_right_pIntKan_unmapped.sam -S lib_1_right_phiX_mapped.sam --un lib_1_right_phiX_unmapped.sam 2>lib_1_right_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_2_right_pIntKan_unmapped.sam -S lib_2_right_phiX_mapped.sam --un lib_2_right_phiX_unmapped.sam 2>lib_2_right_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_3_right_pIntKan_unmapped.sam -S lib_3_right_phiX_mapped.sam --un lib_3_right_phiX_unmapped.sam 2>lib_3_right_phiX_bowtie.txt

Let’s look at metrics:

phiX_bowtie_path <- '../../data/seq_data/right_side_TO_libs/5_bowtie_phiX_output/'


vec_ph0 <- readLines(paste0(phiX_bowtie_path,"lib_0_right_phiX_bowtie.txt"))
vec_ph1 <- readLines(paste0(phiX_bowtie_path,"lib_1_right_phiX_bowtie.txt"))
vec_ph2 <- readLines(paste0(phiX_bowtie_path,"lib_2_right_phiX_bowtie.txt"))
vec_ph3 <- readLines(paste0(phiX_bowtie_path,"lib_3_right_phiX_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_ph0 <- parse_number(vec_ph0) %>% setNames(names)
bowtie_ph1 <- parse_number(vec_ph1) %>% setNames(names)
bowtie_ph2 <- parse_number(vec_ph2) %>% setNames(names)
bowtie_ph3 <- parse_number(vec_ph3) %>% setNames(names)

df_bowtie_map_ph1 <- bind_rows(bowtie_ph0, bowtie_ph1,bowtie_ph2,bowtie_ph3)

df_bowtie_map_ph1 %>% kable() %>% kable_styling()
input_reads unpaired zero_align one_align oneplus_align overall_align_percent
5862 5862 5789 72 1 1.25
15413 15413 15246 166 1 1.08
14635 14635 14451 181 3 1.26
10695 10695 10596 99 0 0.93

Very few reads align to phiX.

Output mapped reads

Let’s combine all of the mapped reads into a csv and look at a quick overview. We will convert all the files to bam for reading into R.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/right_side_TO_libs/6_bam_files/

samtools view -bh ../3_bowtie_ecoli_output/lib_0_right_mapped.sam > lib_0_right_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_1_right_mapped.sam > lib_1_right_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_2_right_mapped.sam > lib_2_right_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_3_right_mapped.sam > lib_3_right_mapped.bam

samtools view -bh ../4_bowtie_pInt_output/lib_0_right_pIntKan_mapped.sam > lib_0_right_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_1_right_pIntKan_mapped.sam > lib_1_right_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_2_right_pIntKan_mapped.sam > lib_2_right_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_3_right_pIntKan_mapped.sam > lib_3_right_pIntKan_mapped.bam

samtools view -bh ../5_bowtie_phiX_output/lib_0_right_phiX_mapped.sam > lib_0_right_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_1_right_phiX_mapped.sam > lib_1_right_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_2_right_phiX_mapped.sam > lib_2_right_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_3_right_phiX_mapped.sam > lib_3_right_phiX_mapped.bam

samtools view -bh ../5_bowtie_phiX_output/lib_0_right_phiX_unmapped.sam > lib_0_right_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_1_right_phiX_unmapped.sam > lib_1_right_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_2_right_phiX_unmapped.sam > lib_2_right_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_3_right_phiX_unmapped.sam > lib_3_right_phiX_unmapped.bam

Now we can import the bam files.

bam_path <- '../../data/seq_data/right_side_TO_libs/6_bam_files/'

df_bam_0_1 <- readGAlignments(file = paste0(bam_path,"lib_0_right_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 1)
df_bam_1_1 <- readGAlignments(file = paste0(bam_path,"lib_1_right_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 1)
df_bam_2_1 <- readGAlignments(file = paste0(bam_path,"lib_2_right_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 1)
df_bam_3_1 <- readGAlignments(file = paste0(bam_path,"lib_3_right_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 1)

df_bam_0_2 <- readGAlignments(file = paste0(bam_path,"lib_0_right_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 2)
df_bam_1_2 <- readGAlignments(file = paste0(bam_path,"lib_1_right_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 2)
df_bam_2_2 <- readGAlignments(file = paste0(bam_path,"lib_2_right_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 2)
df_bam_3_2 <- readGAlignments(file = paste0(bam_path,"lib_3_right_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 2)

df_bam_0_3 <- readGAlignments(file = paste0(bam_path,"lib_0_right_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 3)
df_bam_1_3 <- readGAlignments(file = paste0(bam_path,"lib_1_right_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 3)
df_bam_2_3 <- readGAlignments(file = paste0(bam_path,"lib_2_right_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 3)
df_bam_3_3 <- readGAlignments(file = paste0(bam_path,"lib_3_right_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 3)

We can combine all these reads from the different steps into one dataframe and save a csv.

df_all_reads <- bind_rows(df_bam_0_1,df_bam_1_1,df_bam_2_1,df_bam_3_1,
          df_bam_0_2,df_bam_1_2,df_bam_2_2,df_bam_3_2,
          df_bam_0_3,df_bam_1_3,df_bam_2_3,df_bam_3_3)

write_csv(df_all_reads, "../../data/seq_data/right_side_TO_libs/df_to_libs_right_all_reads.csv")

df_all_reads %>% head(10) %>% kable() %>% kable_styling()
seqnames strand cigar qwidth start end width njunc lib step
U00096.3
79M2S 81 2978946 2979024 79 0 0 1
U00096.3
6S47M 53 2095079 2095125 47 0 0 1
U00096.3
81M 81 4277913 4277993 81 0 0 1
U00096.3
76M 76 3535810 3535885 76 0 0 1
U00096.3
2S78M 80 317133 317210 78 0 0 1
U00096.3
51M 51 3059708 3059758 51 0 0 1
U00096.3
2S30M 32 2095096 2095125 30 0 0 1
U00096.3
81M 81 3535805 3535885 81 0 0 1
U00096.3
75M 75 4099474 4099548 75 0 0 1
U00096.3
34M2S 36 417811 417844 34 0 0 1

We will calculate some statistics about the read lengths at the different steps:

df_all_summary <- df_all_reads %>% 
  group_by(lib, step) %>% 
  summarise(n=n(), mean_len = mean(qwidth), median_len = median(qwidth), sd_len = sd(qwidth))

df_all_summary %>% kable() %>% kable_styling()
lib step n mean_len median_len sd_len
0 1 37997 64.63331 76.0 19.745319
0 2 4247 72.57947 76.0 8.844884
0 3 73 141.80822 146.0 15.805816
1 1 121301 62.81156 73.0 20.263986
1 2 7931 72.45064 76.0 9.319747
1 3 167 139.76048 146.0 19.442907
2 1 96903 63.19703 73.0 19.821499
2 2 5789 72.12731 76.0 9.766072
2 3 184 139.50543 145.5 20.935497
3 1 75652 64.12823 75.0 19.357990
3 2 4058 72.18088 76.0 9.711455
3 3 99 138.00000 146.0 21.813682

And we will view the read lengths for all 4 different libraries at the 3 different mapping steps.

ggplot(df_all_reads, aes(x = factor(step),y=qwidth)) + geom_boxplot() + facet_wrap(~lib)

The only notable thing here is that the tiny number of reads aligning to phiX are long, probably because they didn’t contain our pInt_kanR “adaper” sequence.

Finally, let’s look at the fate (number) of our reads across the different mapping steps.

ggplot(df_all_summary, aes(x = lib,y=n, fill = factor(step))) + geom_bar(position="fill", stat="identity") + scale_fill_viridis_d()

So, the vast majority of the mapped reads mapped to the E. coli genome in step 1.

These mapped reads will be further analyzed in the right_side_TO_libs_analysis.Rmd notebook.

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] viridis_0.6.2               viridisLite_0.4.1          
##  [3] kableExtra_1.3.4            GenomicAlignments_1.34.0   
##  [5] Rsamtools_2.14.0            Biostrings_2.66.0          
##  [7] XVector_0.38.0              SummarizedExperiment_1.28.0
##  [9] Biobase_2.58.0              MatrixGenerics_1.10.0      
## [11] matrixStats_0.63.0          GenomicRanges_1.50.1       
## [13] GenomeInfoDb_1.34.9         IRanges_2.32.0             
## [15] S4Vectors_0.36.0            BiocGenerics_0.44.0        
## [17] forcats_0.5.2               stringr_1.5.0              
## [19] dplyr_1.1.0                 purrr_0.3.5                
## [21] readr_2.1.3                 tidyr_1.2.1                
## [23] tibble_3.1.8                ggplot2_3.4.0              
## [25] tidyverse_1.3.2            
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           fs_1.5.2               bit64_4.0.5           
##  [4] lubridate_1.9.0        webshot_0.5.4          httr_1.4.4            
##  [7] tools_4.2.0            backports_1.4.1        bslib_0.4.1           
## [10] utf8_1.2.2             R6_2.5.1               DBI_1.1.3             
## [13] colorspace_2.0-3       withr_2.5.0            gridExtra_2.3         
## [16] tidyselect_1.2.0       bit_4.0.5              compiler_4.2.0        
## [19] cli_3.4.1              rvest_1.0.3            xml2_1.3.3            
## [22] DelayedArray_0.24.0    labeling_0.4.2         sass_0.4.4            
## [25] scales_1.2.1           systemfonts_1.0.4      digest_0.6.30         
## [28] svglite_2.1.0          rmarkdown_2.18         pkgconfig_2.0.3       
## [31] htmltools_0.5.4        highr_0.9              dbplyr_2.2.1          
## [34] fastmap_1.1.0          rlang_1.1.1            readxl_1.4.1          
## [37] rstudioapi_0.14        farver_2.1.1           jquerylib_0.1.4       
## [40] generics_0.1.3         jsonlite_1.8.3         vroom_1.6.0           
## [43] BiocParallel_1.32.4    googlesheets4_1.0.1    RCurl_1.98-1.9        
## [46] magrittr_2.0.3         GenomeInfoDbData_1.2.9 Matrix_1.5-3          
## [49] munsell_0.5.0          fansi_1.0.3            lifecycle_1.0.3       
## [52] stringi_1.7.8          yaml_2.3.6             zlibbioc_1.44.0       
## [55] grid_4.2.0             parallel_4.2.0         crayon_1.5.2          
## [58] lattice_0.20-45        haven_2.5.1            hms_1.1.2             
## [61] knitr_1.41             pillar_1.8.1           codetools_0.2-18      
## [64] reprex_2.0.2           glue_1.6.2             evaluate_0.18         
## [67] modelr_0.1.10          vctrs_0.5.2            tzdb_0.3.0            
## [70] cellranger_1.1.0       gtable_0.3.1           assertthat_0.2.1      
## [73] cachem_1.0.6           xfun_0.35              broom_1.0.1           
## [76] googledrive_2.0.0      gargle_1.2.1           timechange_0.1.1      
## [79] ellipsis_0.3.2