library(tidyverse)
library(GenomicAlignments)
library(kableExtra)
library(ggridges)

knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=FALSE, echo = TRUE, message=FALSE, warning=FALSE, fig.align="center", fig.retina = 2)

source("../tools/plotting_tools.R")
theme_set(theme_notebook())

This notebook performs the analysis on the mapped reads for the four ORBIT libraries that were constructed. As a reminder, these different deletion libraries were as follows:

Within each library the targeting oligo deletes a target gene, and pInt_kanR should integrate into the newly formed attB site on the genome. This should leave us with a library with the target genes deleted, replaced by the pInt_kanR backbone flanked by the attR and attL sites. To test the accuracy with which these libraries were constructed sequencing libraries were prepared to amplify these genome-pInt junctions. This was done by shearing genomic DNA, adding C tails to free DNA ends and then amplifying junctions with a polyG and pInt specific primer. This notebook takes the processed (filtered, trimmed, mapped) reads from the processing notebook and uses these mapped sequences to analyze how accurately the designed ORBIT mutants were created.

Read in data and figure out TO / read orientation

First, we need to read in the mapped reads for each left side library. These are stored as bam files. Here we combine all of the mapped reads into one dataframe that will go through analysis.

bam_path <- "../../data/seq_data/left_side_TO_libs/6_bam_files/"

df_bam_0_1 <- readGAlignments(file = paste0(bam_path,"lib_0_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=0)

df_bam_1_1 <- readGAlignments(file = paste0(bam_path,"lib_1_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=1)

df_bam_2_1 <- readGAlignments(file = paste0(bam_path,"lib_2_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=2)

df_bam_3_1 <- readGAlignments(file = paste0(bam_path,"lib_3_left_mapped.bam")) %>% as_tibble() %>% mutate(lib=3)

#df_bam_0 %>% group_by(start) %>% summarise(n=n()) %>% arrange(desc(n))


df_data_all <- bind_rows(df_bam_0_1,df_bam_1_1,df_bam_2_1,df_bam_3_1)

df_data_all %>% head(10) %>% kable() %>% kable_styling()
seqnames strand cigar qwidth start end width njunc lib
U00096.3
55M6S 61 2979936 2979990 55 0 0
U00096.3
30M 30 2979936 2979965 30 0 0
U00096.3
48M 48 2979936 2979983 48 0 0
U00096.3
52M5S 57 2979936 2979987 52 0 0
U00096.3
39M 39 2979936 2979974 39 0 0
U00096.3
32M1D42M 74 2977610 2977684 75 0 0
U00096.3
57M 57 2095844 2095900 57 0 0
U00096.3
75M 75 4098577 4098651 75 0 0
U00096.3
45M 45 417103 417147 45 0 0
U00096.3
61M 61 417087 417147 61 0 0

This full data frame is ~330k lines, meaning that we are working with that many reads total. Next we need to read in the information for the targeting oligos that we designed. Each oligo was ordered to delete a certain region of the genome, as specified by a left and right genomic position, corresponding roughly to the start and end of the target genes. Here we read in the information for the different TO libraries.

df_TOs <- read_csv("../../../ecoli_ORBIT_paper_final/data/seq_data/left_side_TO_libs/df_TOs.csv")

df_TOs
## # A tibble: 420 × 7
##      lib gene  left_oligo_pos right_oligo_pos oligo              product go_term
##    <dbl> <chr>          <dbl>           <dbl> <chr>              <chr>   <chr>  
##  1     1 alpA         2758649         2758836 CTCTCTGCAACCAAAGT… CP4-57… DNA bi…
##  2     1 argR         3384708         3385153 GCACAATAATGTTGTAT… DNA-bi… DNA bi…
##  3     1 ariR         1216374         1216615 AAACTTATACTTAATAA… putati… protei…
##  4     1 arsR         3648533         3648861 TCGAAGAGAGACACTAC… DNA-bi… transc…
##  5     1 asnC         3926565         3926998 TAAAATAGAATGAATCA… DNA-bi… protei…
##  6     1 bolA          454477          454769 GCGCCGACAGTTGCAAA… DNA-bi… protei…
##  7     1 caiF           34305           34675 ACGCCCGGTATCAACGA… DNA-bi… DNA-bi…
##  8     1 cedA         1813441         1813658 GCAAAACCGGGAGGTAT… cell d… DNA bi…
##  9     1 cspA         3720054         3720241 TCGCCGAAAGGCACACT… cold s… nuclei…
## 10     1 cspC         1907246         1907430 CACACTTCAGATCAGTG… stress… nuclei…
## # … with 410 more rows

Figure out starting position of read for each TO. Remember that only the fwd sequence of attB was used in the actual oligo to prevent PCR issues. Normally the attB sequence would be directed the same way as the gene, but here the only thing that affects the directionality is the replichore. So for replichore 1 (>3.9M & <1.6M), the oligo sequence comes from the reverse strand, meaning that the attB also points “reverse”. Therefore when sequencing the “right side library” the read maps to the forward + strand and the “left side library” maps to the reverse - strand. For replichore 2 it is the opposite. The TO sequence comes from the + strand, meaning attB points fwd. Therefore the upstream (aka right) library maps to the - strand and the downstream (aka left) maps to the + strand.

We can see the evidence that this is actually how the ORBIT library sequences are oriented by looking at which strand the reads map to vs. the genomic position (replichore).

ggplot(df_data_all %>% mutate(strand = factor(strand, levels = c('-','+'))) , 
       aes(x = start, y = strand)) + 
  geom_vline(xintercept = c(1590250.5,3923882.5), color = 'red')+
  geom_jitter(height=0.25, shape =21, alpha = 0.01)+
  facet_wrap(~lib, ncol = 1, scales = 'free')

It is clear that most reads are coming from the + strand on replichore 1 for all libraries, and the - strand for replichore 2 as expected. Interestingly, the small number of reads that do not conform to this, may indicate an off target integration that wasn’t specified in our designed TO library.

This orientation issue is important to understand, because we need to be able to match the mapped reads to where we expect a read to start if the designed TO worked perfectly. Now that we know our orientation intuition seems right, we can look at the list of TOs and assign what should be the theoretical “read_start_position” if the mutant were made perfectly. Remember these oligos were specified by the region between two genomic positions (left_oligo_pos & right_oligo_pos), so the first genomic sequence outside of these insertions should either be the left or right specified position. The upstream junction (aka right) should start at the right_oligo_pos for replichore 1 targets, and the left_oligo_pos for replichore 2 targets based on their orientations.

Therefore, we will just assign the correct oligo pos to the TOs.

ori <-  3923882.5
ter <-  1590250.5

##*** CHANGING FOR RIGHT SIDE

df_TOs <- df_TOs %>% 
  mutate(gen_rep = ifelse(left_oligo_pos < ori & left_oligo_pos > ter, 2,1)) %>% 
  mutate(read_start_pos = ifelse(gen_rep==1, right_oligo_pos,left_oligo_pos)) %>% 
  mutate(read_start_pos = ifelse(gene == 'metA', left_oligo_pos, read_start_pos)) #metA is the only special case, because it uses the - direction attB site in lib 0.

df_TOs 
## # A tibble: 420 × 9
##      lib gene  left_oligo_pos right_olig…¹ oligo product go_term gen_rep read_…²
##    <dbl> <chr>          <dbl>        <dbl> <chr> <chr>   <chr>     <dbl>   <dbl>
##  1     1 alpA         2758649      2758836 CTCT… CP4-57… DNA bi…       2 2758649
##  2     1 argR         3384708      3385153 GCAC… DNA-bi… DNA bi…       2 3384708
##  3     1 ariR         1216374      1216615 AAAC… putati… protei…       1 1216615
##  4     1 arsR         3648533      3648861 TCGA… DNA-bi… transc…       2 3648533
##  5     1 asnC         3926565      3926998 TAAA… DNA-bi… protei…       1 3926998
##  6     1 bolA          454477       454769 GCGC… DNA-bi… protei…       1  454769
##  7     1 caiF           34305        34675 ACGC… DNA-bi… DNA-bi…       1   34675
##  8     1 cedA         1813441      1813658 GCAA… cell d… DNA bi…       2 1813441
##  9     1 cspA         3720054      3720241 TCGC… cold s… nuclei…       2 3720054
## 10     1 cspC         1907246      1907430 CACA… stress… nuclei…       2 1907246
## # … with 410 more rows, and abbreviated variable names ¹​right_oligo_pos,
## #   ²​read_start_pos

There is one exception here, which is the metA oligo. This is because this oligo was only in the library 0 from IDT, which did not require PCR amplification, so the same oligo sequence was used as from the rest of the paper, which has attB facing the opposite orientation as the scheme described above (attB faces the same way as the gene).

Match reads to putative TOs

With all that sorted we can now iterate through every read in the dataset and try to find the closest TO match. Then we can evaluate how close the reads are to the expected start sites for their inferred TO.

This for loop achieves this very simply, by iterating through each read, then measuring the distance between the read start position and the start positions of every TO designed for that library. The nearest TO is assigned to the read as a ‘gene_id’ and that minimum distance is recorded with the read. This loop takes ~30 min to run on a desktop, but the csv output can be read in without running.

# Takes ~30 min to run

#Initialize dataframe to store reads
df_read_nearest <- tibble()

#Based on alignment, if read maps to + strand, then the illumina start of the read is the start of the alignment.
#if read maps to - strand then illumina read start is actually the end of the mapped read
df_test <- df_data_all %>% 
  mutate(read_start_pos = ifelse(strand =='+', start, end))

#Pre split the different TOs into dataframes by library to make loop run faster.
df_TO_0 <- df_TOs %>% filter(lib==0)
df_TO_1 <- df_TOs %>% filter(lib==1)
df_TO_2 <- df_TOs %>% filter(lib==2)
df_TO_3 <- df_TOs %>% filter(lib==3)

#Iterates through each mapped read
for( row in 1:nrow(df_test)){
  
  #picks the set of TOs that matches the library number
  if(df_test$lib[row]==0){
    df_to <- df_TO_0
    }else if(df_test$lib[row]==1){
    df_to <- df_TO_1
    }else if(df_test$lib[row]==2){
    df_to <- df_TO_2
    }else if(df_test$lib[row]==3){
    df_to <- df_TO_3
    }
  
  #Compares the `read start pos` of each designed TO in the library, to the `read start pos` of the read (vector).
  #Then finds the minimum difference min(abs()) of the distances between this read and all TOs
  min_val <- min(abs(df_to$read_start_pos - df_test$read_start_pos[row]))
  
  #Same as above, but returns the TO index of the min val, 
  #which is then used to get the actual TO that has the closest start pos to the read. 
  #The gene this TO deletes is assigned to the read as `gene_id` and the min_val is also stored
  min_index <- which.min(abs(df_to$read_start_pos - df_test$read_start_pos[row]))
  gene_id <- df_to$gene[min_index]
  df_reads <- tibble(df_test[row,],gene_id = gene_id, min_val = min_val) 
  
  #add this read to the growing dataframe of processed reads
  df_read_nearest <- bind_rows(df_read_nearest, df_reads)
  
} 

df_read_nearest

write_csv(df_read_nearest, '../../data/seq_data/left_side_TO_libs/df_read_nearest_left.csv')

Now that we have gone through and categorized each read and calculated the minimum distance to a TO target, we can further classify reads as perfect (min_val=0) or as off target if it falls beyond 500bp away (min_val>500). Then we can get summary statistics. What fraction of reads are off target?

df_read_nearest <- read_csv('../../data/seq_data/left_side_TO_libs/df_read_nearest_left.csv')

df_read_nearest <- df_read_nearest %>% 
  mutate(gene_id = ifelse(min_val >500, 'off_target',gene_id)) %>% 
  mutate(off_target = ifelse(gene_id=='off_target',T,F)) %>% 
  mutate(perfect = ifelse(min_val==0, T, F))

df_read_nearest %>% group_by(lib, off_target) %>% summarise(n=n()) %>% 
  pivot_wider(names_from = off_target, names_prefix = 'off_target_', values_from = n) %>% 
  mutate(total =off_target_TRUE + off_target_FALSE) %>% 
  mutate(frac_off_target = off_target_TRUE / (off_target_TRUE + off_target_FALSE))
## # A tibble: 4 × 5
## # Groups:   lib [4]
##     lib off_target_FALSE off_target_TRUE total frac_off_target
##   <dbl>            <int>           <int> <int>           <dbl>
## 1     0            24497            2517 27014          0.0932
## 2     1            72541            1000 73541          0.0136
## 3     2            48883            1822 50705          0.0359
## 4     3            52633             940 53573          0.0175

Less than 4% of reads are off target across all the twist libraries! About 9.3% of reads were off target for the IDT lib #0. And we can also look at how many reads were perfect:

df_read_nearest %>% group_by(lib, perfect) %>% summarise(n=n()) %>% 
  pivot_wider(names_from = perfect, names_prefix = 'perfect_', values_from = n) %>% 
  mutate(total = perfect_TRUE + perfect_FALSE) %>% 
  mutate(frac_perfect = perfect_TRUE / (perfect_TRUE + perfect_FALSE))
## # A tibble: 4 × 5
## # Groups:   lib [4]
##     lib perfect_FALSE perfect_TRUE total frac_perfect
##   <dbl>         <int>        <int> <int>        <dbl>
## 1     0          2859        24155 27014        0.894
## 2     1          1814        71727 73541        0.975
## 3     2          2429        48276 50705        0.952
## 4     3          2287        51286 53573        0.957

More than 95% of reads from all the twist libraries were perfect (>89% of the IDT lib0)! We can visualize the perfect and off target score graphically, which is shown as a summary metric in the main fig:

df_read_nearest %>% 
  mutate(status = paste0('off_',off_target, '_p_',perfect)) %>% 
  mutate(status = fct_relevel(status, 'off_TRUE_p_FALSE','off_FALSE_p_FALSE','off_FALSE_p_TRUE')) %>% 
  ggplot(aes(x = lib, fill = status)) + 
    geom_bar(color = 'black',position = 'fill') +
    scale_y_continuous(labels = scales::label_percent())+
    scale_fill_viridis_d(labels = c('Off target / Not Perfect','On Target / Not Perfect','On Target / Perfect'), option = 'A')  + 
    labs(y = 'Reads', x = 'Library', fill = 'Category')

How many of the intended TO mutations were recovered?

Instead of taking a “read-centric” view, we can instead take a “target-centric” view and ask how many of the TOs we designed actually yielded mutants in our libraries. First we will count up all the reads for each gene_id - this will also be useful later.

#Group by lib and gene and count up reads assigned to each gene
df_read_nearest_summary <- df_read_nearest %>% 
  group_by(lib, gene_id) %>% 
  summarise( n=n(), min_val = mean(min_val))

df_read_nearest_summary %>% head(10) %>% kable() %>% kable_styling()
lib gene_id n min_val
0 acrR 1079 0.0787766
0 alaS 1 0.0000000
0 argP 468 0.0064103
0 arsR 117 0.1025641
0 envR 51 0.1176471
0 fnr 121 0.0247934
0 fur 19 0.0000000
0 galK 2040 0.0470588
0 galR 1622 0.0197287
0 hisA 2476 0.0076737

And then we will see if there are any TOs that have no reads. From there we can look at the fraction of each TO library for which a mutant was not recovered (within the 500bp threshold).

#Merge the read counts 
df_TOs_found <- left_join(df_TOs, df_read_nearest_summary, by = c('gene'='gene_id', 'lib')) %>% 
  mutate(found = ifelse(is.na(n) & is.na(min_val), F, T))

df_TOs_found %>% 
  group_by(lib, found) %>% 
  summarise(n=n()) %>% 
  pivot_wider(names_from = found, names_prefix = 'found_', values_from = n, values_fill = 0) %>% 
  mutate(frac_found = found_TRUE / (found_TRUE + found_FALSE))
## # A tibble: 4 × 4
## # Groups:   lib [4]
##     lib found_TRUE found_FALSE frac_found
##   <dbl>      <int>       <int>      <dbl>
## 1     0         27           0      1    
## 2     1         70           4      0.946
## 3     2        206          20      0.912
## 4     3         88           5      0.946

So for all 4 libraries, at least 90% of the TOs were found with mutants represented. These are the mutants that were not found:

df_TO_not_found <- df_TOs_found %>% filter(found == F)

df_TO_not_found %>% select(-go_term,-oligo) %>% kable() %>% kable_styling()
lib gene left_oligo_pos right_oligo_pos product gen_rep read_start_pos n min_val found
1 dicA 1647939 1648321 DNA-binding transcriptional dual regulator DicA 2 1647939 NA NA FALSE
1 hns 1292529 1292917 DNA-binding transcriptional dual regulator H-NS 1 1292917 NA NA FALSE
1 mqsA 3167871 3168241 antitoxin of the MqsRA toxin-antitoxin system / DNA-binding transcriptional repressor MqsA 2 3167871 NA NA FALSE
1 racR 1419785 1420236 Rac prophage; DNA-binding transcriptional repressor RacR 1 1420236 NA NA FALSE
2 ada 2309361 2310400 DNA-binding transcriptional dual regulator / DNA repair protein Ada 2 2309361 NA NA FALSE
2 alaS 2819401 2822006 alanine&mdash;tRNA ligase/DNA-binding transcriptional repressor 2 2819401 NA NA FALSE
2 arcA 4639610 4640301 ArcA-P<sup>asp54</sup>-phosphate // Phosphorylated DNA-binding transcriptional dual regulator ArcA // DNA-binding transcriptional dual regulator ArcA 1 4640301 NA NA FALSE
2 birA 4173087 4174027 DNA-binding transcriptional repressor/biotin-[acetyl-CoA-carboxylase] ligase BirA 1 4174027 NA NA FALSE
2 dcuR 4349335 4350029 DcuR // DNA-binding transcriptional activator DcuR 1 4350029 NA NA FALSE
2 dnaA 3882346 3883724 chromosomal replication initiator protein DnaA 2 3882346 NA NA FALSE
2 fnr 1398794 1399521 DNA-binding transcriptional dual regulator FNR 1 1399521 NA NA FALSE
2 gntR 3577751 3578721 DNA-binding transcriptional repressor GntR 2 3577751 NA NA FALSE
2 lexA 4257120 4257703 DNA-binding transcriptional repressor LexA 1 4257703 NA NA FALSE
2 metR 4011883 4012811 DNA-binding transcriptional dual regulator MetR 1 4012811 NA NA FALSE
2 nusA 3316059 3317521 transcription termination/antitermination protein NusA 2 3316059 NA NA FALSE
2 obgE 3330602 3331749 GTPase ObgE 2 3330602 NA NA FALSE
2 rpoD 3213052 3214868 RNA polymerase, sigma 70 (sigma D) factor 2 3213052 NA NA FALSE
2 rpoE 2709457 2710007 RNA polymerase sigma E factor 2 2709457 NA NA FALSE
2 rpoH 3599949 3600778 RNA polymerase, sigma 32 (sigma H) factor 2 3599949 NA NA FALSE
2 sgrR 75664 77294 DNA-binding transcriptional dual regulator SgrR 1 77294 NA NA FALSE
2 uvrY 1994723 1995354 Phosphorylated DNA-binding transcriptional activator UvrY // DNA-binding transcriptional activator UvrY 2 1994723 NA NA FALSE
2 yfiE 2714459 2715315 putative LysR-type DNA-binding transcriptional regulator YfiE 2 2714459 NA NA FALSE
2 yheO 3475738 3476435 DNA-binding transcriptional regulator YheO 2 3475738 NA NA FALSE
2 yidP 3863904 3864595 putative DNA-binding transcriptional regulator YidP 2 3863904 NA NA FALSE
3 ffs 476447 476562 signal recognition particle 4.5S RNA 1 476562 NA NA FALSE
3 ohsC 2700519 2700599 small regulatory RNA OhsC 2 2700519 NA NA FALSE
3 orzP 2204401 2204485 small RNA OrzP 2 2204401 NA NA FALSE
3 sroE 2640594 2640687 small RNA SroE 2 2640594 NA NA FALSE
3 tff 189711 189848 putative small RNA T44 1 189848 NA NA FALSE

There are 29 spread across the 3 twist libraries. We will investigate in further detail in combination with the left side library.

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] viridis_0.6.2               viridisLite_0.4.1          
##  [3] ggridges_0.5.4              kableExtra_1.3.4           
##  [5] GenomicAlignments_1.34.0    Rsamtools_2.14.0           
##  [7] Biostrings_2.66.0           XVector_0.38.0             
##  [9] SummarizedExperiment_1.28.0 Biobase_2.58.0             
## [11] MatrixGenerics_1.10.0       matrixStats_0.63.0         
## [13] GenomicRanges_1.50.1        GenomeInfoDb_1.34.4        
## [15] IRanges_2.32.0              S4Vectors_0.36.0           
## [17] BiocGenerics_0.44.0         forcats_0.5.2              
## [19] stringr_1.5.0               dplyr_1.0.10               
## [21] purrr_0.3.5                 readr_2.1.3                
## [23] tidyr_1.2.1                 tibble_3.1.8               
## [25] ggplot2_3.4.0               tidyverse_1.3.2            
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           fs_1.5.2               bit64_4.0.5           
##  [4] lubridate_1.9.0        webshot_0.5.4          httr_1.4.4            
##  [7] tools_4.2.0            backports_1.4.1        bslib_0.4.1           
## [10] utf8_1.2.2             R6_2.5.1               DBI_1.1.3             
## [13] colorspace_2.0-3       withr_2.5.0            gridExtra_2.3         
## [16] tidyselect_1.2.0       bit_4.0.5              compiler_4.2.0        
## [19] cli_3.4.1              rvest_1.0.3            xml2_1.3.3            
## [22] DelayedArray_0.24.0    labeling_0.4.2         sass_0.4.4            
## [25] scales_1.2.1           systemfonts_1.0.4      digest_0.6.30         
## [28] svglite_2.1.0          rmarkdown_2.18         pkgconfig_2.0.3       
## [31] htmltools_0.5.3        highr_0.9              dbplyr_2.2.1          
## [34] fastmap_1.1.0          rlang_1.0.6            readxl_1.4.1          
## [37] rstudioapi_0.14        farver_2.1.1           jquerylib_0.1.4       
## [40] generics_0.1.3         jsonlite_1.8.3         vroom_1.6.0           
## [43] BiocParallel_1.32.4    googlesheets4_1.0.1    RCurl_1.98-1.9        
## [46] magrittr_2.0.3         GenomeInfoDbData_1.2.9 Matrix_1.5-3          
## [49] munsell_0.5.0          fansi_1.0.3            lifecycle_1.0.3       
## [52] stringi_1.7.8          yaml_2.3.6             zlibbioc_1.44.0       
## [55] grid_4.2.0             parallel_4.2.0         crayon_1.5.2          
## [58] lattice_0.20-45        haven_2.5.1            hms_1.1.2             
## [61] knitr_1.41             pillar_1.8.1           codetools_0.2-18      
## [64] reprex_2.0.2           glue_1.6.2             evaluate_0.18         
## [67] modelr_0.1.10          vctrs_0.5.1            tzdb_0.3.0            
## [70] cellranger_1.1.0       gtable_0.3.1           assertthat_0.2.1      
## [73] cachem_1.0.6           xfun_0.35              broom_1.0.1           
## [76] googledrive_2.0.0      gargle_1.2.1           timechange_0.1.1      
## [79] ellipsis_0.3.2