library(tidyverse)
library(GenomicAlignments)
library(kableExtra)

knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=FALSE, echo = TRUE, message=FALSE, warning=FALSE, fig.align="center", fig.retina = 2)

source("../tools/plotting_tools.R")
theme_set(theme_notebook())

The raw sequencing data is available on the SRA and is associated with this BioProject.

Library Name	Run ID and Link	Size
oPool_J1	SRR25076307	8.3 Mb
short_TF_J1	SRR25076303	25 Mb
long_TF_J1	SRR25076301	16 Mb
small_RNA_J1	SRR25076299	18 Mb

Four different ORBIT mutant libraries were constructed using targeting oligo pools. These different deletion libraries were as follows:

Library #0: galK, hisA, metA, leuD (gold standards) + 23 TF genes (27 targets total)
Library #1: All TF genes < 575 bp (76 targets total)
Library #2: All TF genes > 575 bp (226 targets total)
Library #3: All small RNA genes (90 targets total)

Within each library the targeting oligo deletes a target gene, and pInt_kanR should integrate into the newly formed attB site on the genome. This should leave us with a library with the target genes deleted, replaced by the pInt_kanR backbone flanked by the attR and attL sites. To test the accuracy with which these libraries were constructed sequencing libraries were prepared to amplify these genome-pInt junctions. This was done by shearing genomic DNA, adding C tails to free DNA ends and then amplifying junctions with a polyG and pInt specific primer. This notebook processes the raw fastq sequencing files (direct from MiSeq) into a more usable form of reads that are filtered and trimmed to map to the E. coli genome. Then by looking at where these junction reads start (following the pInt sequence) we can determine if the ORBIT mutations were created accurately.

The structure of this data processing is shown below. This processing script lives in the code folder under seq_data_processing and it processes raw data files that live in data/seq_data/left_side_TO_libs. The processing starts from 0_raw_fastq (which houses truly raw data) and proceeds to filter for quality using fastp (1_fastp_output), then the pInt_kanR sequence that is common to all mutants is found and trimmed off along with the polyC tail using cutadapt (2_cutadapt_output). At this stage filtered and trimmed reads should be the flanking E. coli genomic sequences for each mutant, so we map these reads to the E. coli genome with bowtie2 (3_bowtie_ecoli_output). Finally, we examine some of the reads that don’t map by seeing if they still map to the pInt_kanR plasmid even after the previous steps (4_bowtie_pInt_output) and also check if they map to the phiX loading control that we use on the Miseq (bowtie_phiX_output).

After all this, we will convert these various sets of reads into bam files and then into a more usable form for analysis - the final dataframe: df_to_libs_left_all_reads.csv

├── code
│   ├── seq_data_processing
│   │   ├── left_side_TO_libs_processing.Rmd
├── data
│   ├── seq_data
│   │   ├── left_side_TO_libs
│   │   │   ├── 0_raw_fastq
│   │   │   ├── 1_fastp_output
│   │   │   ├── 2_cutadapt_output
│   │   │   ├── 3_bowtie_ecoli_output
│   │   │   ├── 4_bowtie_pInt_output
│   │   │   ├── 5_bowtie_phiX_output
│   │   │   ├── 6_bam_files
│   │   │   ├── df_to_libs_left_all_reads.csv
├── figures

Step 0: Bioinformatics environment setup

This complicated setup is required the first time for m1 macs only, which must run certain programs in an intel like environment. Note that it gets a little funky running bash from rmd, so it may be more straightforward for the user to run bash commands in the terminal or windows equivalent. However, this setup works on multiple machines for the author at the current time.

This setup assumes you already have bash and conda in addition to R/rmd.

source ~/.bash_profile

#**first time running scripts, create conda env**
# use code below when running on *m1 mac* to setup intel like env
# see stack overflow: https://stackoverflow.com/questions/71515117/how-to-set-up-a-conda-osx-64-environment-on-arm-mac
conda create -n bioinfo_env -y

conda activate bioinfo_env

conda config --env --set subdir osx-64

conda install python=3.7 -y

conda install -n bioinfo_env fastp cutadapt bowtie2 samtools starcode --channel bioconda -y

For normal, non-m1 mac conda environment setup:

source ~/.bash_profile
  
#**for normal conda install using this code first time**
conda create -n bioinfo_env fastp cutadapt samtools bowtie2 starcode -y --channel bioconda

With our bioinfo_env setup we are ready to process the data. Note, when running bash scripts in rmd, we have found that we must activate the conda env in each code chunk and also source our bash profile to get things to work correctly.

Step 1: Fastp quality filter

Looking at the raw data, (0_raw_fastq), you can see that these were actually paired end sequencing runs. However, from looking at the data, read2 quality is very low, so we will focus on the higher quality read1 data, which starts from the pInt_kanR backbone and sequences out into the genome (instead of read2 sequencing in from the polyC tail).

Here, we will use the program fastp, with defaults to filter out very low quality reads, and also to get a nice summary of the seq libraries.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/

fastp -i "0_raw_fastq/lib-0-IDToPool_S1_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_0_left_fastp.fastq.gz" -h "1_fastp_output/lib_0_left_report.html" --json /dev/null

fastp -i "0_raw_fastq/lib-1-twist-TFavdovlp-short_S2_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_1_left_fastp.fastq.gz" -h "1_fastp_output/lib_1_left_report.html" --json /dev/null

fastp -i "0_raw_fastq/lib-2-twist-TFavdovlp-long_S3_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_2_left_fastp.fastq.gz" -h "1_fastp_output/lib_2_left_report.html" --json /dev/null

fastp -i "0_raw_fastq/lib-3-twist-sRNA_S4_L001_R1_001.fastq.gz" -o "1_fastp_output/lib_3_left_fastp.fastq.gz" -h "1_fastp_output/lib_3_left_report.html" --json /dev/null

Please see the nice html reports in 1_fastp_output. The main takeaway is the total number of reads we are working with and their quality:

Lib #0: 44k reads
Lib #1: 137k reads
Lib #2: 87k reads
Lib #3: 94k reads

For all 4 libraries, >98% of reads pass the quality filter. ~80% of remaining reads are Q30 or above. These total read counts may seem low, please note that this was a MiSeq nano kit and the loading was not optimal. However, these read counts are perfectly sufficient to quantify 400 mutants.

Step 2: Cutadapt trimming

The next step in the processing pipeline is to trim off the ORBIT plasmid sequence and the polyC tail used in library prep. To do this we will use the ‘linked adapter’ option from cutadapt. I experimented with a few options and this works way better than unlinked. This linked adapter is called with the option -a 5'ADAPTER...3'ADAPTER . Note the polyC tail is specified as 8x C’s C{8}. In looking at the fastq files there are many reads that have way way longer polyC tails, but this cutadapt scheme should trim everything 3’ of the first instance of 8 Cs following the 5’ adapter. Note that the PCR1 reverse primer that is polyG contains 1N, because of ordering issues from sigma. Therefore many of the polyC tails are not perfect. For this reason and because the ORBIT adapter needs to be robustly found, I increased the error tolerance from the standard 10% to 20% (e 0.2). This means that for the polyC tail only 1-2 nt cannot be C.

After trimming adapter the remaining genomic regions will be mapped to the E. coli genome, so I’m also passing a length cutoff of at least 20bp for the trimmed read -m 20. A read shorter than 20 bp will likely not be unique and probably shouldn’t be trusted. This minimum length could be increased later to be more stringent.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/

cutadapt -a "ggaactattacacgtcttgagcgattggtttgtctggtcaaccaccgcggtctccgtcgtcaggatcat;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_0_left_fastp.fastq.gz -o 2_cutadapt_output/lib_0_left_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_0_left_cutadpt_report.tsv

cutadapt -a "ggaactattacacgtcttgagcgattggtttgtctggtcaaccaccgcggtctccgtcgtcaggatcat;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_1_left_fastp.fastq.gz -o 2_cutadapt_output/lib_1_left_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_1_left_cutadpt_report.tsv

cutadapt -a "ggaactattacacgtcttgagcgattggtttgtctggtcaaccaccgcggtctccgtcgtcaggatcat;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_2_left_fastp.fastq.gz -o 2_cutadapt_output/lib_2_left_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_2_left_cutadpt_report.tsv

cutadapt -a "ggaactattacacgtcttgagcgattggtttgtctggtcaaccaccgcggtctccgtcgtcaggatcat;e=0.2;required...C{8};e=0.2;optional" -n 1 -m 20 -e 0.2 --discard-untrimmed 1_fastp_output/lib_3_left_fastp.fastq.gz -o 2_cutadapt_output/lib_3_left_cutadpt.fastq.gz --report=minimal > 2_cutadapt_output/lib_3_left_cutadpt_report.tsv

Now let’s look at the output reports and make sure the trimming worked as expected.

cutadapt_path <- '../../data/seq_data/left_side_TO_libs/2_cutadapt_output/'

cut_lib_0 <- paste0(cutadapt_path,'lib_0_left_cutadpt_report.tsv')
cut_lib_1 <- paste0(cutadapt_path,'lib_1_left_cutadpt_report.tsv')
cut_lib_2 <- paste0(cutadapt_path,'lib_2_left_cutadpt_report.tsv')
cut_lib_3 <- paste0(cutadapt_path,'lib_3_left_cutadpt_report.tsv')

df_cutadapt <- bind_rows(read_delim(cut_lib_0, delim = '\t', skip = length(readLines(cut_lib_0))-2 ) %>% mutate(lib=0),
          read_delim(cut_lib_1, delim = '\t', skip = length(readLines(cut_lib_1))-2 )%>% mutate(lib=1),
          read_delim(cut_lib_2, delim = '\t', skip = length(readLines(cut_lib_2))-2 )%>% mutate(lib=2),
          read_delim(cut_lib_3, delim = '\t', skip = length(readLines(cut_lib_3))-2 )%>% mutate(lib=3)) %>% 
  mutate(frac_in_out = out_reads / in_reads, frac_w_adapt = `w/adapters` / in_reads, frac_too_short = too_short/in_reads)

df_cutadapt %>% kable() %>% kable_styling()

status	in_reads	in_bp	too_short	out_reads	w/adapters	out_bp	lib	frac_in_out	frac_w_adapt	frac_too_short
OK	44316	6667590	9317	32442	41759	1868044	0	0.7320607	0.9423007	0.2102401
OK	137369	20646810	41055	88411	129466	4916920	1	0.6436023	0.9424688	0.2988666
OK	87525	13152982	22949	59369	82318	3490009	2	0.6783091	0.9405084	0.2621994
OK	94334	14179489	26121	62844	88965	3806581	3	0.6661861	0.9430852	0.2768991

Ok, so overall 64-73% of the input reads are being trimmed and output. About 94% have the “adapter” sequences, but a significant fraction of reads are below the minimum 20bp length after trimming and are therefore considered too short and not output (21-29%). This is likely just a function of the size distribution of the library. Let’s look at this graphically:

df_cutadapt_summary <- df_cutadapt %>% 
  mutate(w_o_adapters = in_reads-`w/adapters`) %>% 
  pivot_longer(c(too_short, out_reads, w_o_adapters), values_to = 'reads', names_to = 'fraction')
  

#ggplot(df_cutadapt_summary, aes(x = lib,y=reads, fill = fraction)) + geom_bar(position="stack", stat="identity") + scale_fill_viridis_d()

ggplot(df_cutadapt_summary, aes(x = lib,y=reads, fill = fraction)) + geom_bar(position="fill", stat="identity") + scale_fill_viridis_d()

Here is the largest bottleneck in the entire pipeline, where we lose a significant portion of the reads due to them being too short following adapter trimming (too_short. This is likely because of the library prep strategy, which caused some fragments to have long c tails and shorter genomic fragments. However, the remaining reads now seem to be ready to go to answer our original question: how accurately are ORBIT mutants constructed. As a reminder these remaining reads (out_reads above) passed sequencing quality filters, they have the pInt_kanR backbone sequence, and they have at least 20bp of putative genomic sequence that should allow us to map back to the original ORBIT target genes.

Step 3: Bowtie alignment

Mapping to E. coli genome

The next step is to align the trimmed reads to the E. coli genome. This should be an unbiased way to see how well the sequencing reads from our mutant library actually match the intended ORBIT deletions. I’m using the --very-sensitive-local mapping option. Earlier tests indicated that this gave the most reads mapping and manual inspection indicated that it was mapping correctly. This local option means only part of the read needs to align, not the whole thing. This may not be optimal, but seems necessary because some reads have remnants of the c-tail still present in read 1, which seemed to limit the standard alignment. The E. coli ref genome is built with the genome fasta file MG1655_U000963.fasta by simply calling bowtie build MG1655_U000963.fasta ecoli_ref_genome . Since we are only using the higher quality / more informative read1, I am using the -U option, signifying that all the input reads (from cutadapt) are unpaired. Bowtie will try to map the reads to the genome and save them as a sam file (-S option). I am also capturing the unmapped reads with the --un option, which will go to a different sam file. Finally, with the 2>file.txt option, I’m taking the standard metrics bowtie outputs to the console and writing them to a text file, so we can keep track of the fate of each libraries reads with real numbers.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/3_bowtie_ecoli_output/

bowtie2-build --quiet MG1655_U000963.fasta ecoli_ref_genome 

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_0_left_cutadpt.fastq.gz -S lib_0_left_mapped.sam --un lib_0_left_unmapped.sam 2> lib_0_left_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_1_left_cutadpt.fastq.gz -S lib_1_left_mapped.sam --un lib_1_left_unmapped.sam 2> lib_1_left_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_2_left_cutadpt.fastq.gz -S lib_2_left_mapped.sam --un lib_2_left_unmapped.sam 2> lib_2_left_bowtie.txt

bowtie2 --very-sensitive-local -x ecoli_ref_genome -U ../2_cutadapt_output/lib_3_left_cutadpt.fastq.gz -S lib_3_left_mapped.sam --un lib_3_left_unmapped.sam 2> lib_3_left_bowtie.txt

Now let’s look at the bowtie metrics text files to see how the alignments went.

ecoli_bowtie_path <- '../../data/seq_data/left_side_TO_libs/3_bowtie_ecoli_output/'

cut_lib_0 <- paste0(cutadapt_path,'lib_0_left_cutadpt_report.tsv')

vec0 <- readLines(paste0(ecoli_bowtie_path,"lib_0_left_bowtie.txt"))
vec1 <- readLines(paste0(ecoli_bowtie_path,"lib_1_left_bowtie.txt"))
vec2 <- readLines(paste0(ecoli_bowtie_path,"lib_2_left_bowtie.txt"))
vec3 <- readLines(paste0(ecoli_bowtie_path,"lib_3_left_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_0 <- parse_number(vec0) %>% setNames(names)
bowtie_1 <- parse_number(vec1) %>% setNames(names)
bowtie_2 <- parse_number(vec2) %>% setNames(names)
bowtie_3 <- parse_number(vec3) %>% setNames(names)

df_bowtie_map_1 <- bind_rows(bowtie_0, bowtie_1,bowtie_2,bowtie_3)

df_bowtie_map_1 %>% kable() %>% kable_styling()

input_reads	unpaired	zero_align	one_align	oneplus_align	overall_align_percent
32442	32442	5428	24854	2160	83.27
88411	88411	14870	73288	253	83.18
59369	59369	8664	49375	1330	85.41
62844	62844	9271	45850	7723	85.25

For each of the libraries, the overall alignment percentage is 83-85% giving us 10s of thousands of mapped reads for each library. This also means that 15-17% of reads did not align, so we will briefly explore two explanations for what those reads could be.

Evaluate unmapped reads

Mapping to plasmid concatemers

This cell will look complicated, but basically we just have to take the unmapped reads as a sam file (default output) and convert them back into a fastq file, so that we can use them as input for this next bowtie mapping step.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/4_bowtie_pInt_output/

#bowtie2-build --quiet pInt_kanR_half.fasta pInt_kanR_half

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_0_left_unmapped.sam -S lib_0_left_pIntKan_mapped.sam --un lib_0_left_pIntKan_unmapped.sam 2>lib_0_left_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_1_left_unmapped.sam -S lib_1_left_pIntKan_mapped.sam --un lib_1_left_pIntKan_unmapped.sam 2>lib_1_left_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_2_left_unmapped.sam -S lib_2_left_pIntKan_mapped.sam --un lib_2_left_pIntKan_unmapped.sam 2>lib_2_left_pInt_bowtie.txt

bowtie2 --very-sensitive-local -x pInt_kanR_half -U ../3_bowtie_ecoli_output/lib_3_left_unmapped.sam -S lib_3_left_pIntKan_mapped.sam --un lib_3_left_pIntKan_unmapped.sam 2>lib_3_left_pInt_bowtie.txt

Again, let’s check the mapping statistics.

pInt_bowtie_path <- '../../data/seq_data/left_side_TO_libs/4_bowtie_pInt_output/'


vec_p0 <- readLines(paste0(pInt_bowtie_path,"lib_0_left_pInt_bowtie.txt"))
vec_p1 <- readLines(paste0(pInt_bowtie_path,"lib_1_left_pInt_bowtie.txt"))
vec_p2 <- readLines(paste0(pInt_bowtie_path,"lib_2_left_pInt_bowtie.txt"))
vec_p3 <- readLines(paste0(pInt_bowtie_path,"lib_3_left_pInt_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_p0 <- parse_number(vec_p0) %>% setNames(names)
bowtie_p1 <- parse_number(vec_p1) %>% setNames(names)
bowtie_p2 <- parse_number(vec_p2) %>% setNames(names)
bowtie_p3 <- parse_number(vec_p3) %>% setNames(names)

df_bowtie_map_p1 <- bind_rows(bowtie_p0, bowtie_p1,bowtie_p2,bowtie_p3)

df_bowtie_map_p1 %>% kable() %>% kable_styling()

input_reads	unpaired	zero_align	one_align	oneplus_align	overall_align_percent
5428	5428	2880	2548	0	46.94
14870	14870	10186	4681	3	31.50
8664	8664	5930	2734	0	31.56
9271	9271	6731	2540	0	27.40

So, some reads do map to the backbone of pInt_kanR suggesting they could be plasmid concatemers or other adapter / backbone sequences.

mapping to phiX

Finally, we will check to see if any of our reads map to the phiX control library that was run alongside the sequencing libraries (10% final concentration)

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/5_bowtie_phiX_output/


bowtie2-build --quiet phiX174.fasta phiX174

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_0_left_pIntKan_unmapped.sam -S lib_0_left_phiX_mapped.sam --un lib_0_left_phiX_unmapped.sam 2>lib_0_left_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_1_left_pIntKan_unmapped.sam -S lib_1_left_phiX_mapped.sam --un lib_1_left_phiX_unmapped.sam 2>lib_1_left_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_2_left_pIntKan_unmapped.sam -S lib_2_left_phiX_mapped.sam --un lib_2_left_phiX_unmapped.sam 2>lib_2_left_phiX_bowtie.txt

bowtie2 --very-sensitive-local -x phiX174 -U ../4_bowtie_pInt_output/lib_3_left_pIntKan_unmapped.sam -S lib_3_left_phiX_mapped.sam --un lib_3_left_phiX_unmapped.sam 2>lib_3_left_phiX_bowtie.txt

Let’s look at metrics:

phiX_bowtie_path <- '../../data/seq_data/left_side_TO_libs/5_bowtie_phiX_output/'


vec_ph0 <- readLines(paste0(phiX_bowtie_path,"lib_0_left_phiX_bowtie.txt"))
vec_ph1 <- readLines(paste0(phiX_bowtie_path,"lib_1_left_phiX_bowtie.txt"))
vec_ph2 <- readLines(paste0(phiX_bowtie_path,"lib_2_left_phiX_bowtie.txt"))
vec_ph3 <- readLines(paste0(phiX_bowtie_path,"lib_3_left_phiX_bowtie.txt"))


names <- c('input_reads','unpaired', 'zero_align','one_align','oneplus_align','overall_align_percent')

bowtie_ph0 <- parse_number(vec_ph0) %>% setNames(names)
bowtie_ph1 <- parse_number(vec_ph1) %>% setNames(names)
bowtie_ph2 <- parse_number(vec_ph2) %>% setNames(names)
bowtie_ph3 <- parse_number(vec_ph3) %>% setNames(names)

df_bowtie_map_ph1 <- bind_rows(bowtie_ph0, bowtie_ph1,bowtie_ph2,bowtie_ph3)

df_bowtie_map_ph1 %>% kable() %>% kable_styling()

input_reads	unpaired	zero_align	one_align	oneplus_align	overall_align_percent
2880	2880	2810	68	2	2.43
10186	10186	9991	191	4	1.91
5930	5930	5817	112	1	1.91
6731	6731	6616	114	1	1.71

Very few reads align to phiX.

Output mapped reads

Let’s combine all of the mapped reads into a csv and look at a quick overview. We will convert all the files to bam for reading into R.

source ~/.bash_profile
conda activate bioinfo_env

cd ../../data/seq_data/left_side_TO_libs/6_bam_files/

samtools view -bh ../3_bowtie_ecoli_output/lib_0_left_mapped.sam > lib_0_left_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_1_left_mapped.sam > lib_1_left_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_2_left_mapped.sam > lib_2_left_mapped.bam
samtools view -bh ../3_bowtie_ecoli_output/lib_3_left_mapped.sam > lib_3_left_mapped.bam

samtools view -bh ../4_bowtie_pInt_output/lib_0_left_pIntKan_mapped.sam > lib_0_left_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_1_left_pIntKan_mapped.sam > lib_1_left_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_2_left_pIntKan_mapped.sam > lib_2_left_pIntKan_mapped.bam
samtools view -bh ../4_bowtie_pInt_output/lib_3_left_pIntKan_mapped.sam > lib_3_left_pIntKan_mapped.bam

samtools view -bh ../5_bowtie_phiX_output/lib_0_left_phiX_mapped.sam > lib_0_left_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_1_left_phiX_mapped.sam > lib_1_left_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_2_left_phiX_mapped.sam > lib_2_left_phiX_mapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_3_left_phiX_mapped.sam > lib_3_left_phiX_mapped.bam

samtools view -bh ../5_bowtie_phiX_output/lib_0_left_phiX_unmapped.sam > lib_0_left_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_1_left_phiX_unmapped.sam > lib_1_left_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_2_left_phiX_unmapped.sam > lib_2_left_phiX_unmapped.bam
samtools view -bh ../5_bowtie_phiX_output/lib_3_left_phiX_unmapped.sam > lib_3_left_phiX_unmapped.bam

Now we can import the bam files.

bam_path <- '../../data/seq_data/left_side_TO_libs/6_bam_files/'

df_bam_0_1 <- readGAlignments(file = paste0(bam_path,"lib_0_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 1)
df_bam_1_1 <- readGAlignments(file = paste0(bam_path,"lib_1_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 1)
df_bam_2_1 <- readGAlignments(file = paste0(bam_path,"lib_2_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 1)
df_bam_3_1 <- readGAlignments(file = paste0(bam_path,"lib_3_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 1)

df_bam_0_2 <- readGAlignments(file = paste0(bam_path,"lib_0_left_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 2)
df_bam_1_2 <- readGAlignments(file = paste0(bam_path,"lib_1_left_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 2)
df_bam_2_2 <- readGAlignments(file = paste0(bam_path,"lib_2_left_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 2)
df_bam_3_2 <- readGAlignments(file = paste0(bam_path,"lib_3_left_pIntKan_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 2)

df_bam_0_3 <- readGAlignments(file = paste0(bam_path,"lib_0_left_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=0, step = 3)
df_bam_1_3 <- readGAlignments(file = paste0(bam_path,"lib_1_left_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=1, step = 3)
df_bam_2_3 <- readGAlignments(file = paste0(bam_path,"lib_2_left_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=2, step = 3)
df_bam_3_3 <- readGAlignments(file = paste0(bam_path,"lib_3_left_phiX_mapped.bam")) %>% as_tibble() %>% mutate(lib=3, step = 3)

We can combine all these reads from the different steps into one dataframe and save a csv.

df_all_reads <- bind_rows(df_bam_0_1,df_bam_1_1,df_bam_2_1,df_bam_3_1,
          df_bam_0_2,df_bam_1_2,df_bam_2_2,df_bam_3_2,
          df_bam_0_3,df_bam_1_3,df_bam_2_3,df_bam_3_3)

write_csv(df_all_reads, "../../data/seq_data/left_side_TO_libs/df_to_libs_left_all_reads.csv")

df_all_reads %>% head(10) %>% kable() %>% kable_styling()

seqnames	cigar	qwidth	start	end	width	step
U00096.3	55M6S	61	2979936	2979990	55	1
U00096.3	30M	30	2979936	2979965	30	1
U00096.3	48M	48	2979936	2979983	48	1
U00096.3	52M5S	57	2979936	2979987	52	1
U00096.3	39M	39	2979936	2979974	39	1
U00096.3	32M1D42M	74	2977610	2977684	75	1
U00096.3	57M	57	2095844	2095900	57	1
U00096.3	75M	75	4098577	4098651	75	1
U00096.3	45M	45	417103	417147	45	1
U00096.3	61M	61	417087	417147	61	1

We will calculate some statistics about the read lengths at the different steps:

df_all_summary <- df_all_reads %>% 
  group_by(lib, step) %>% 
  summarise(n=n(), mean_len = mean(qwidth), median_len = median(qwidth), sd_len = sd(qwidth))

df_all_summary %>% kable() %>% kable_styling()

lib	step	n	mean_len	median_len	sd_len
0	1	27014	60.08758	67	16.28880
0	2	2548	52.69192	60	15.79955
0	3	70	140.92857	146	18.32818
1	1	73541	58.01308	63	17.23517
1	2	4684	52.47758	60	16.13342
1	3	195	135.64615	146	26.25319
2	1	50705	61.15430	69	16.49508
2	2	2734	54.94660	67	15.15059
2	3	113	143.30973	146	14.12639
3	1	53573	63.13684	74	15.97718
3	2	2540	55.22717	67	15.34865
3	3	115	138.18261	146	22.24607

And we will view the read lengths for all 4 different libraries at the 3 different mapping steps.

ggplot(df_all_reads, aes(x = factor(step),y=qwidth)) + geom_boxplot() + facet_wrap(~lib)

The only notable thing here is that the tiny number of reads aligning to phiX are long, probably because they didn’t contain our pInt_kanR “adaper” sequence.

Finally, let’s look at the fate (number) of our reads across the different mapping steps.

ggplot(df_all_summary, aes(x = lib,y=n, fill = factor(step))) + geom_bar(position="fill", stat="identity") + scale_fill_viridis_d()

So, the vast majority of the mapped reads mapped to the E. coli genome in step 1.

These mapped reads will be further analyzed in the left_side_TO_libs_analysis.Rmd notebook.

sessionInfo()

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] viridis_0.6.2               viridisLite_0.4.1          
##  [3] kableExtra_1.3.4            GenomicAlignments_1.34.0   
##  [5] Rsamtools_2.14.0            Biostrings_2.66.0          
##  [7] XVector_0.38.0              SummarizedExperiment_1.28.0
##  [9] Biobase_2.58.0              MatrixGenerics_1.10.0      
## [11] matrixStats_0.63.0          GenomicRanges_1.50.1       
## [13] GenomeInfoDb_1.34.9         IRanges_2.32.0             
## [15] S4Vectors_0.36.0            BiocGenerics_0.44.0        
## [17] forcats_0.5.2               stringr_1.5.0              
## [19] dplyr_1.1.0                 purrr_0.3.5                
## [21] readr_2.1.3                 tidyr_1.2.1                
## [23] tibble_3.1.8                ggplot2_3.4.0              
## [25] tidyverse_1.3.2            
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           fs_1.5.2               bit64_4.0.5           
##  [4] lubridate_1.9.0        webshot_0.5.4          httr_1.4.4            
##  [7] tools_4.2.0            backports_1.4.1        bslib_0.4.1           
## [10] utf8_1.2.2             R6_2.5.1               DBI_1.1.3             
## [13] colorspace_2.0-3       withr_2.5.0            gridExtra_2.3         
## [16] tidyselect_1.2.0       bit_4.0.5              compiler_4.2.0        
## [19] cli_3.4.1              rvest_1.0.3            xml2_1.3.3            
## [22] DelayedArray_0.24.0    labeling_0.4.2         sass_0.4.4            
## [25] scales_1.2.1           systemfonts_1.0.4      digest_0.6.30         
## [28] svglite_2.1.0          rmarkdown_2.18         pkgconfig_2.0.3       
## [31] htmltools_0.5.4        highr_0.9              dbplyr_2.2.1          
## [34] fastmap_1.1.0          rlang_1.1.1            readxl_1.4.1          
## [37] rstudioapi_0.14        farver_2.1.1           jquerylib_0.1.4       
## [40] generics_0.1.3         jsonlite_1.8.3         vroom_1.6.0           
## [43] BiocParallel_1.32.4    googlesheets4_1.0.1    RCurl_1.98-1.9        
## [46] magrittr_2.0.3         GenomeInfoDbData_1.2.9 Matrix_1.5-3          
## [49] munsell_0.5.0          fansi_1.0.3            lifecycle_1.0.3       
## [52] stringi_1.7.8          yaml_2.3.6             zlibbioc_1.44.0       
## [55] grid_4.2.0             parallel_4.2.0         crayon_1.5.2          
## [58] lattice_0.20-45        haven_2.5.1            hms_1.1.2             
## [61] knitr_1.41             pillar_1.8.1           codetools_0.2-18      
## [64] reprex_2.0.2           glue_1.6.2             evaluate_0.18         
## [67] modelr_0.1.10          vctrs_0.5.2            tzdb_0.3.0            
## [70] cellranger_1.1.0       gtable_0.3.1           assertthat_0.2.1      
## [73] cachem_1.0.6           xfun_0.35              broom_1.0.1           
## [76] googledrive_2.0.0      gargle_1.2.1           timechange_0.1.1      
## [79] ellipsis_0.3.2

Seq data processing: Left side TO libs

E. coli ORBIT 2022

Scott H. Saunders