library(tidyverse)
library(GenomicAlignments)
library(kableExtra)

knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=FALSE, echo = TRUE, message=FALSE, warning=FALSE, fig.align="center", fig.retina = 2)

source("../tools/plotting_tools.R")
theme_set(theme_notebook())

The raw sequencing data is available on the SRA and is associated with this BioProject.

Library Name Run ID and Link Size
oPool_J1 SRR25076307 8.3 Mb
short_TF_J1 SRR25076303 25 Mb
long_TF_J1 SRR25076301 16 Mb
small_RNA_J1 SRR25076299 18 Mb

Four different ORBIT mutant libraries were constructed using targeting oligo pools. These different deletion libraries were as follows:

Within each library the targeting oligo deletes a target gene, and pInt_kanR should integrate into the newly formed attB site on the genome. This should leave us with a library with the target genes deleted, replaced by the pInt_kanR backbone flanked by the attR and attL sites. To test the accuracy with which these libraries were constructed sequencing libraries were prepared to amplify these genome-pInt junctions. This was done by shearing genomic DNA, adding C tails to free DNA ends and then amplifying junctions with a polyG and pInt specific primer. This notebook processes the raw fastq sequencing files (direct from MiSeq) into a more usable form of reads that are filtered and trimmed to map to the E. coli genome. Then by looking at where these junction reads start (following the pInt sequence) we can determine if the ORBIT mutations were created accurately.

The structure of this data processing is shown below. This processing script lives in the code folder under seq_data_processing and it processes raw data files that live in data/seq_data/left_side_TO_libs. The processing starts from 0_raw_fastq (which houses truly raw data) and proceeds to filter for quality using fastp (1_fastp_output), then the pInt_kanR sequence that is common to all mutants is found and trimmed off along with the polyC tail using cutadapt (2_cutadapt_output). At this stage filtered and trimmed reads should be the flanking E. coli genomic sequences for each mutant, so we map these reads to the E. coli genome with bowtie2 (3_bowtie_ecoli_output). Finally, we examine some of the reads that don’t map by seeing if they still map to the pInt_kanR plasmid even after the previous steps (4_bowtie_pInt_output) and also check if they map to the phiX loading control that we use on the Miseq (bowtie_phiX_output).

After all this, we will convert these various sets of reads into bam files and then into a more usable form for analysis - the final dataframe: df_to_libs_left_all_reads.csv

├── code
│   ├── seq_data_processing
│   │   ├── left_side_TO_libs_processing.Rmd
├── data
│   ├── seq_data
│   │   ├── left_side_TO_libs
│   │   │   ├── 0_raw_fastq
│   │   │   ├── 1_fastp_output
│   │   │   ├── 2_cutadapt_output
│   │   │   ├── 3_bowtie_ecoli_output
│   │   │   ├── 4_bowtie_pInt_output
│   │   │   ├── 5_bowtie_phiX_output
│   │   │   ├── 6_bam_files
│   │   │   ├── df_to_libs_left_all_reads.csv
├── figures

Step 0: Bioinformatics environment setup

This complicated setup is required the first time for m1 macs only, which must run certain programs in an intel like environment. Note that it gets a little funky running bash from rmd, so it may be more straightforward for the user to run bash commands in the terminal or windows equivalent. However, this setup works on multiple machines for the author at the current time.

This setup assumes you already have bash and conda in addition to R/rmd.

source ~/.bash_profile

#**first time running scripts, create conda env**
# use code below when running on *m1 mac* to setup intel like env
# see stack overflow: https://stackoverflow.com/questions/71515117/how-to-set-up-a-conda-osx-64-environment-on-arm-mac
conda create -n bioinfo_env -y

conda activate bioinfo_env

conda config --env --set subdir osx-64

conda install python=3.7 -y

conda install -n bioinfo_env fastp cutadapt bowtie2 samtools starcode --channel bioconda -y

For normal, non-m1 mac conda environment setup:

source ~/.bash_profile
  
#**for normal conda install using this code first time**
conda create -n bioinfo_env fastp cutadapt samtools bowtie2 starcode -y --channel bioconda

With our bioinfo_env setup we are ready to process the data. Note, when running bash scripts in rmd, we have found that we must activate the conda env in each code chunk and also source our bash profile to get things to work correctly.

Step 1: Fastp quality filter

Looking at the raw data, (0_raw_fastq), you can see that these were actually paired end sequencing runs. However, from looking at the data, read2 quality is very low, so we will focus on the higher quality read1 data, which starts from the pInt_kanR backbone and sequences out into the genome (instead of read2 sequencing in from the polyC tail).

Here, we will use the program fastp, with defaults to filter out very low quality reads, and also to get a nice summary of the seq libraries.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/

fastp -i "0_raw_fastq/lib-0-IDToPool_S1_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_0_left_fastp.fastq.gz" -h "1_fastp_output/lib_0_left_report.html" --json /dev/null

fastp -i "0_raw_fastq/lib-1-twist-TFavdovlp-short_S2_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_1_left_fastp.fastq.gz" -h "1_fastp_output/lib_1_left_report.html" --json /dev/null

fastp -i "0_raw_fastq/lib-2-twist-TFavdovlp-long_S3_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_2_left_fastp.fastq.gz" -h "1_fastp_output/lib_2_left_report.html" --json /dev/null

fastp -i "0_raw_fastq/lib-3-twist-sRNA_S4_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_3_left_fastp.fastq.gz" -h "1_fastp_output/lib_3_left_report.html" --json /dev/null

Please see the nice html reports in 1_fastp_output. The main takeaway is the total number of reads we are working with and their quality:

For all 4 libraries, >98% of reads pass the quality filter. ~80% of remaining reads are Q30 or above. These total read counts may seem low, please note that this was a MiSeq nano kit and the loading was not optimal. However, these read counts are perfectly sufficient to quantify 400 mutants.

Step 2: Cutadapt trimming

The next step in the processing pipeline is to trim off the ORBIT plasmid sequence and the polyC tail used in library prep. To do this we will use the ‘linked adapter’ option from cutadapt. I experimented with a few options and this works way better than unlinked. This linked adapter is called with the option -a 5'ADAPTER...3'ADAPTER . Note the polyC tail is specified as 8x C’s C{8}. In looking at the fastq files there are many reads that have way way longer polyC tails, but this cutadapt scheme should trim everything 3’ of the first instance of 8 Cs following the 5’ adapter. Note that the PCR1 reverse primer that is polyG contains 1N, because of ordering issues from sigma. Therefore many of the polyC tails are not perfect. For this reason and because the ORBIT adapter needs to be robustly found, I increased the error tolerance from the standard 10% to 20% (e 0.2). This means that for the polyC tail only 1-2 nt cannot be C.

After trimming adapter the remaining genomic regions will be mapped to the E. coli genome, so I’m also passing a length cutoff of at least 20bp for the trimmed read -m 20. A read shorter than 20 bp will likely not be unique and probably shouldn’t be trusted. This minimum length could be increased later to be more stringent.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/

cutadapt -a "ggaactattacacgtcttgagcgattggtttgtctggtcaaccaccgcggtctccgtcgtcaggatcat;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_0_left_fastp.fastq.gz -o 2_cutadapt_output/lib_0_left_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_0_left_cutadpt_report.tsv

cutadapt -a "ggaactattacacgtcttgagcgattggtttgtctggtcaaccaccgcggtctccgtcgtcaggatcat;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_1_left_fastp.fastq.gz -o 2_cutadapt_output/lib_1_left_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_1_left_cutadpt_report.tsv

cutadapt -a "ggaactattacacgtcttgagcgattggtttgtctggtcaaccaccgcggtctccgtcgtcaggatcat;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_2_left_fastp.fastq.gz -o 2_cutadapt_output/lib_2_left_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_2_left_cutadpt_report.tsv

cutadapt -a "ggaactattacacgtcttgagcgattggtttgtctggtcaaccaccgcggtctccgtcgtcaggatcat;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_3_left_fastp.fastq.gz -o 2_cutadapt_output/lib_3_left_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_3_left_cutadpt_report.tsv

Now let’s look at the output reports and make sure the trimming worked as expected.

cutadapt_path <- '../../data/seq_data/left_side_TO_libs/2_cutadapt_output/'

cut_lib_0 <- paste0(cutadapt_path,'lib_0_left_cutadpt_report.tsv')
cut_lib_1 <- paste0(cutadapt_path,'lib_1_left_cutadpt_report.tsv')
cut_lib_2 <- paste0(cutadapt_path,'lib_2_left_cutadpt_report.tsv')
cut_lib_3 <- paste0(cutadapt_path,'lib_3_left_cutadpt_report.tsv')

df_cutadapt <- bind_rows(read_delim(cut_lib_0, delim = '\t', skip = length(readLines(cut_lib_0))-2 ) %>% mutate(lib=0),
          read_delim(cut_lib_1, delim = '\t', skip = length(readLines(cut_lib_1))-2 )%>% mutate(lib=1),
          read_delim(cut_lib_2, delim = '\t', skip = length(readLines(cut_lib_2))-2 )%>% mutate(lib=2),
          read_delim(cut_lib_3, delim = '\t', skip = length(readLines(cut_lib_3))-2 )%>% mutate(lib=3)) %>% 
  mutate(frac_in_out = out_reads / in_reads, frac_w_adapt = `w/adapters` / in_reads, frac_too_short = too_short/in_reads)

df_cutadapt %>% kable() %>% kable_styling()
status in_reads in_bp too_short too_long too_many_n out_reads w/adapters qualtrim_bp out_bp lib frac_in_out frac_w_adapt frac_too_short
OK 44316 6667590 9317 0 0 32442 41759 0 1868044 0 0.7320607 0.9423007 0.2102401
OK 137369 20646810 41055 0 0 88411 129466 0 4916920 1 0.6436023 0.9424688 0.2988666
OK 87525 13152982 22949 0 0 59369 82318 0 3490009 2 0.6783091 0.9405084 0.2621994
OK 94334 14179489 26121 0 0 62844 88965 0 3806581 3 0.6661861 0.9430852 0.2768991

Ok, so overall 64-73% of the input reads are being trimmed and output. About 94% have the “adapter” sequences, but a significant fraction of reads are below the minimum 20bp length after trimming and are therefore considered too short and not output (21-29%). This is likely just a function of the size distribution of the library. Let’s look at this graphically:

df_cutadapt_summary <- df_cutadapt %>% 
  mutate(w_o_adapters = in_reads-`w/adapters`) %>% 
  pivot_longer(c(too_short, out_reads, w_o_adapters), values_to = 'reads', names_to = 'fraction')
  

#ggplot(df_cutadapt_summary, aes(x = lib,y=reads, fill = fraction)) + geom_bar(position="stack", stat="identity") + scale_fill_viridis_d()

ggplot(df_cutadapt_summary, aes(x = lib,y=reads, fill = fraction)) + geom_bar(position="fill", stat="identity") + scale_fill_viridis_d()

Here is the largest bottleneck in the entire pipeline, where we lose a significant portion of the reads due to them being too short following adapter trimming (too_short. This is likely because of the library prep strategy, which caused some fragments to have long c tails and shorter genomic fragments. However, the remaining reads now seem to be ready to go to answer our original question: how accurately are ORBIT mutants constructed. As a reminder these remaining reads (out_reads above) passed sequencing quality filters, they have the pInt_kanR backbone sequence, and they have at least 20bp of putative genomic sequence that should allow us to map back to the original ORBIT target genes.

Step 3: Bowtie alignment

Mapping to E. coli genome

The next step is to align the trimmed reads to the E. coli genome. This should be an unbiased way to see how well the sequencing reads from our mutant library actually match the intended ORBIT deletions. I’m using the --very-sensitive-local mapping option. Earlier tests indicated that this gave the most reads mapping and manual inspection indicated that it was mapping correctly. This local option means only part of the read needs to align, not the whole thing. This may not be optimal, but seems necessary because some reads have remnants of the c-tail still present in read 1, which seemed to limit the standard alignment. The E. coli ref genome is built with the genome fasta file MG1655_U000963.fasta by simply calling bowtie build MG1655_U000963.fasta ecoli_ref_genome . Since we are only using the higher quality / more informative read1, I am using the -U option, signifying that all the input reads (from cutadapt) are unpaired. Bowtie will try to map the reads to the genome and save them as a sam file (-S option). I am also capturing the unmapped reads with the --un option, which will go to a different sam file. Finally, with the 2>file.txt option, I’m taking the standard metrics bowtie outputs to the console and writing them to a text file, so we can keep track of the fate of each libraries reads with real numbers.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/3_bowtie_ecoli_output/

bowtie2-build --quiet MG1655_U000963.fasta ecoli_ref_genome 

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_0_left_cutadpt.fastq.gz -S lib_0_left_mapped.sam --un lib_0_left_unmapped.sam 2> lib_0_left_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_1_left_cutadpt.fastq.gz -S lib_1_left_mapped.sam --un lib_1_left_unmapped.sam 2> lib_1_left_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_2_left_cutadpt.fastq.gz -S lib_2_left_mapped.sam --un lib_2_left_unmapped.sam 2> lib_2_left_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_3_left_cutadpt.fastq.gz -S lib_3_left_mapped.sam --un lib_3_left_unmapped.sam 2> lib_3_left_bowtie.txt

Now let’s look at the bowtie metrics text files to see how the alignments went.

ecoli_bowtie_path <- '../../data/seq_data/left_side_TO_libs/3_bowtie_ecoli_output/'

cut_lib_0 <- paste0(cutadapt_path,'lib_0_left_cutadpt_report.tsv')

vec0 <- readLines(paste0(ecoli_bowtie_path,"lib_0_left_bowtie.txt"))
vec1 <- readLines(paste0(ecoli_bowtie_path,"lib_1_left_bowtie.txt"))
vec2 <- readLines(paste0(ecoli_bowtie_path,"lib_2_left_bowtie.txt"))
vec3 <- readLines(paste0(ecoli_bowtie_path,"lib_3_left_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_0 <- parse_number(vec0) %>% setNames(names)
bowtie_1 <- parse_number(vec1) %>% setNames(names)
bowtie_2 <- parse_number(vec2) %>% setNames(names)
bowtie_3 <- parse_number(vec3) %>% setNames(names)

df_bowtie_map_1 <- bind_rows(bowtie_0, bowtie_1,bowtie_2,bowtie_3)

df_bowtie_map_1 %>% kable() %>% kable_styling()
input_reads unpaired zero_align one_align oneplus_align overall_align_percent
32442 32442 5428 24854 2160 83.27
88411 88411 14870 73288 253 83.18
59369 59369 8664 49375 1330 85.41
62844 62844 9271 45850 7723 85.25

For each of the libraries, the overall alignment percentage is 83-85% giving us 10s of thousands of mapped reads for each library. This also means that 15-17% of reads did not align, so we will briefly explore two explanations for what those reads could be.

Evaluate unmapped reads

Mapping to plasmid concatemers

This cell will look complicated, but basically we just have to take the unmapped reads as a sam file (default output) and convert them back into a fastq file, so that we can use them as input for this next bowtie mapping step.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/4_bowtie_pInt_output/

#bowtie2-build --quiet pInt_kanR_half.fasta pInt_kanR_half

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_0_left_unmapped.sam -S lib_0_left_pIntKan_mapped.sam --un lib_0_left_pIntKan_unmapped.sam 2>lib_0_left_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_1_left_unmapped.sam -S lib_1_left_pIntKan_mapped.sam --un lib_1_left_pIntKan_unmapped.sam 2>lib_1_left_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_2_left_unmapped.sam -S lib_2_left_pIntKan_mapped.sam --un lib_2_left_pIntKan_unmapped.sam 2>lib_2_left_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_3_left_unmapped.sam -S lib_3_left_pIntKan_mapped.sam --un lib_3_left_pIntKan_unmapped.sam 2>lib_3_left_pInt_bowtie.txt

Again, let’s check the mapping statistics.

pInt_bowtie_path <- '../../data/seq_data/left_side_TO_libs/4_bowtie_pInt_output/'


vec_p0 <- readLines(paste0(pInt_bowtie_path,"lib_0_left_pInt_bowtie.txt"))
vec_p1 <- readLines(paste0(pInt_bowtie_path,"lib_1_left_pInt_bowtie.txt"))
vec_p2 <- readLines(paste0(pInt_bowtie_path,"lib_2_left_pInt_bowtie.txt"))
vec_p3 <- readLines(paste0(pInt_bowtie_path,"lib_3_left_pInt_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_p0 <- parse_number(vec_p0) %>% setNames(names)
bowtie_p1 <- parse_number(vec_p1) %>% setNames(names)
bowtie_p2 <- parse_number(vec_p2) %>% setNames(names)
bowtie_p3 <- parse_number(vec_p3) %>% setNames(names)

df_bowtie_map_p1 <- bind_rows(bowtie_p0, bowtie_p1,bowtie_p2,bowtie_p3)

df_bowtie_map_p1 %>% kable() %>% kable_styling()
input_reads unpaired zero_align one_align oneplus_align overall_align_percent
5428 5428 2880 2548 0 46.94
14870 14870 10186 4681 3 31.50
8664 8664 5930 2734 0 31.56
9271 9271 6731 2540 0 27.40

So, some reads do map to the backbone of pInt_kanR suggesting they could be plasmid concatemers or other adapter / backbone sequences.

mapping to phiX

Finally, we will check to see if any of our reads map to the phiX control library that was run alongside the sequencing libraries (10% final concentration)

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/5_bowtie_phiX_output/


bowtie2-build --quiet phiX174.fasta phiX174

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_0_left_pIntKan_unmapped.sam -S lib_0_left_phiX_mapped.sam --un lib_0_left_phiX_unmapped.sam 2>lib_0_left_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_1_left_pIntKan_unmapped.sam -S lib_1_left_phiX_mapped.sam --un lib_1_left_phiX_unmapped.sam 2>lib_1_left_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_2_left_pIntKan_unmapped.sam -S lib_2_left_phiX_mapped.sam --un lib_2_left_phiX_unmapped.sam 2>lib_2_left_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_3_left_pIntKan_unmapped.sam -S lib_3_left_phiX_mapped.sam --un lib_3_left_phiX_unmapped.sam 2>lib_3_left_phiX_bowtie.txt

Let’s look at metrics:

phiX_bowtie_path <- '../../data/seq_data/left_side_TO_libs/5_bowtie_phiX_output/'


vec_ph0 <- readLines(paste0(phiX_bowtie_path,"lib_0_left_phiX_bowtie.txt"))
vec_ph1 <- readLines(paste0(phiX_bowtie_path,"lib_1_left_phiX_bowtie.txt"))
vec_ph2 <- readLines(paste0(phiX_bowtie_path,"lib_2_left_phiX_bowtie.txt"))
vec_ph3 <- readLines(paste0(phiX_bowtie_path,"lib_3_left_phiX_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_ph0 <- parse_number(vec_ph0) %>% setNames(names)
bowtie_ph1 <- parse_number(vec_ph1) %>% setNames(names)
bowtie_ph2 <- parse_number(vec_ph2) %>% setNames(names)
bowtie_ph3 <- parse_number(vec_ph3) %>% setNames(names)

df_bowtie_map_ph1 <- bind_rows(bowtie_ph0, bowtie_ph1,bowtie_ph2,bowtie_ph3)

df_bowtie_map_ph1 %>% kable() %>% kable_styling()
input_reads unpaired zero_align one_align oneplus_align overall_align_percent
2880 2880 2810 68 2 2.43
10186 10186 9991 191 4 1.91
5930 5930 5817 112 1 1.91
6731 6731 6616 114 1 1.71

Very few reads align to phiX.

Output mapped reads

Let’s combine all of the mapped reads into a csv and look at a quick overview. We will convert all the files to bam for reading into R.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/6_bam_files/

samtools view -bh ../3_bowtie_ecoli_output/lib_0_left_mapped.sam > lib_0_left_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_1_left_mapped.sam > lib_1_left_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_2_left_mapped.sam > lib_2_left_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_3_left_mapped.sam > lib_3_left_mapped.bam

samtools view -bh ../4_bowtie_pInt_output/lib_0_left_pIntKan_mapped.sam > lib_0_left_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_1_left_pIntKan_mapped.sam > lib_1_left_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_2_left_pIntKan_mapped.sam > lib_2_left_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_3_left_pIntKan_mapped.sam > lib_3_left_pIntKan_mapped.bam

samtools view -bh ../5_bowtie_phiX_output/lib_0_left_phiX_mapped.sam > lib_0_left_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_1_left_phiX_mapped.sam > lib_1_left_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_2_left_phiX_mapped.sam > lib_2_left_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_3_left_phiX_mapped.sam > lib_3_left_phiX_mapped.bam

samtools view -bh ../5_bowtie_phiX_output/lib_0_left_phiX_unmapped.sam > lib_0_left_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_1_left_phiX_unmapped.sam > lib_1_left_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_2_left_phiX_unmapped.sam > lib_2_left_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_3_left_phiX_unmapped.sam > lib_3_left_phiX_unmapped.bam

Now we can import the bam files.

bam_path <- '../../data/seq_data/left_side_TO_libs/6_bam_files/'

df_bam_0_1 <- readGAlignments(file = paste0(bam_path,"lib_0_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 1)
df_bam_1_1 <- readGAlignments(file = paste0(bam_path,"lib_1_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 1)
df_bam_2_1 <- readGAlignments(file = paste0(bam_path,"lib_2_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 1)
df_bam_3_1 <- readGAlignments(file = paste0(bam_path,"lib_3_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 1)

df_bam_0_2 <- readGAlignments(file = paste0(bam_path,"lib_0_left_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 2)
df_bam_1_2 <- readGAlignments(file = paste0(bam_path,"lib_1_left_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 2)
df_bam_2_2 <- readGAlignments(file = paste0(bam_path,"lib_2_left_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 2)
df_bam_3_2 <- readGAlignments(file = paste0(bam_path,"lib_3_left_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 2)

df_bam_0_3 <- readGAlignments(file = paste0(bam_path,"lib_0_left_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 3)
df_bam_1_3 <- readGAlignments(file = paste0(bam_path,"lib_1_left_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 3)
df_bam_2_3 <- readGAlignments(file = paste0(bam_path,"lib_2_left_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 3)
df_bam_3_3 <- readGAlignments(file = paste0(bam_path,"lib_3_left_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 3)

We can combine all these reads from the different steps into one dataframe and save a csv.

df_all_reads <- bind_rows(df_bam_0_1,df_bam_1_1,df_bam_2_1,df_bam_3_1,
          df_bam_0_2,df_bam_1_2,df_bam_2_2,df_bam_3_2,
          df_bam_0_3,df_bam_1_3,df_bam_2_3,df_bam_3_3)

write_csv(df_all_reads, "../../data/seq_data/left_side_TO_libs/df_to_libs_left_all_reads.csv")

df_all_reads %>% head(10) %>% kable() %>% kable_styling()
seqnames strand cigar qwidth start end width njunc lib step
U00096.3
55M6S 61 2979936 2979990 55 0 0 1
U00096.3
30M 30 2979936 2979965 30 0 0 1
U00096.3
48M 48 2979936 2979983 48 0 0 1
U00096.3
52M5S 57 2979936 2979987 52 0 0 1
U00096.3
39M 39 2979936 2979974 39 0 0 1
U00096.3
32M1D42M 74 2977610 2977684 75 0 0 1
U00096.3
57M 57 2095844 2095900 57 0 0 1
U00096.3
75M 75 4098577 4098651 75 0 0 1
U00096.3
45M 45 417103 417147 45 0 0 1
U00096.3
61M 61 417087 417147 61 0 0 1

We will calculate some statistics about the read lengths at the different steps:

df_all_summary <- df_all_reads %>% 
  group_by(lib, step) %>% 
  summarise(n=n(), mean_len = mean(qwidth), median_len = median(qwidth), sd_len = sd(qwidth))

df_all_summary %>% kable() %>% kable_styling()
lib step n mean_len median_len sd_len
0 1 27014 60.08758 67 16.28880
0 2 2548 52.69192 60 15.79955
0 3 70 140.92857 146 18.32818
1 1 73541 58.01308 63 17.23517
1 2 4684 52.47758 60 16.13342
1 3 195 135.64615 146 26.25319
2 1 50705 61.15430 69 16.49508
2 2 2734 54.94660 67 15.15059
2 3 113 143.30973 146 14.12639
3 1 53573 63.13684 74 15.97718
3 2 2540 55.22717 67 15.34865
3 3 115 138.18261 146 22.24607

And we will view the read lengths for all 4 different libraries at the 3 different mapping steps.

ggplot(df_all_reads, aes(x = factor(step),y=qwidth)) + geom_boxplot() + facet_wrap(~lib)

The only notable thing here is that the tiny number of reads aligning to phiX are long, probably because they didn’t contain our pInt_kanR “adaper” sequence.

Finally, let’s look at the fate (number) of our reads across the different mapping steps.

ggplot(df_all_summary, aes(x = lib,y=n, fill = factor(step))) + geom_bar(position="fill", stat="identity") + scale_fill_viridis_d()

So, the vast majority of the mapped reads mapped to the E. coli genome in step 1.

These mapped reads will be further analyzed in the left_side_TO_libs_analysis.Rmd notebook.

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] viridis_0.6.2               viridisLite_0.4.1          
##  [3] kableExtra_1.3.4            GenomicAlignments_1.34.0   
##  [5] Rsamtools_2.14.0            Biostrings_2.66.0          
##  [7] XVector_0.38.0              SummarizedExperiment_1.28.0
##  [9] Biobase_2.58.0              MatrixGenerics_1.10.0      
## [11] matrixStats_0.63.0          GenomicRanges_1.50.1       
## [13] GenomeInfoDb_1.34.9         IRanges_2.32.0             
## [15] S4Vectors_0.36.0            BiocGenerics_0.44.0        
## [17] forcats_0.5.2               stringr_1.5.0              
## [19] dplyr_1.1.0                 purrr_0.3.5                
## [21] readr_2.1.3                 tidyr_1.2.1                
## [23] tibble_3.1.8                ggplot2_3.4.0              
## [25] tidyverse_1.3.2            
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           fs_1.5.2               bit64_4.0.5           
##  [4] lubridate_1.9.0        webshot_0.5.4          httr_1.4.4            
##  [7] tools_4.2.0            backports_1.4.1        bslib_0.4.1           
## [10] utf8_1.2.2             R6_2.5.1               DBI_1.1.3             
## [13] colorspace_2.0-3       withr_2.5.0            gridExtra_2.3         
## [16] tidyselect_1.2.0       bit_4.0.5              compiler_4.2.0        
## [19] cli_3.4.1              rvest_1.0.3            xml2_1.3.3            
## [22] DelayedArray_0.24.0    labeling_0.4.2         sass_0.4.4            
## [25] scales_1.2.1           systemfonts_1.0.4      digest_0.6.30         
## [28] svglite_2.1.0          rmarkdown_2.18         pkgconfig_2.0.3       
## [31] htmltools_0.5.4        highr_0.9              dbplyr_2.2.1          
## [34] fastmap_1.1.0          rlang_1.1.1            readxl_1.4.1          
## [37] rstudioapi_0.14        farver_2.1.1           jquerylib_0.1.4       
## [40] generics_0.1.3         jsonlite_1.8.3         vroom_1.6.0           
## [43] BiocParallel_1.32.4    googlesheets4_1.0.1    RCurl_1.98-1.9        
## [46] magrittr_2.0.3         GenomeInfoDbData_1.2.9 Matrix_1.5-3          
## [49] munsell_0.5.0          fansi_1.0.3            lifecycle_1.0.3       
## [52] stringi_1.7.8          yaml_2.3.6             zlibbioc_1.44.0       
## [55] grid_4.2.0             parallel_4.2.0         crayon_1.5.2          
## [58] lattice_0.20-45        haven_2.5.1            hms_1.1.2             
## [61] knitr_1.41             pillar_1.8.1           codetools_0.2-18      
## [64] reprex_2.0.2           glue_1.6.2             evaluate_0.18         
## [67] modelr_0.1.10          vctrs_0.5.2            tzdb_0.3.0            
## [70] cellranger_1.1.0       gtable_0.3.1           assertthat_0.2.1      
## [73] cachem_1.0.6           xfun_0.35              broom_1.0.1           
## [76] googledrive_2.0.0      gargle_1.2.1           timechange_0.1.1      
## [79] ellipsis_0.3.2