Notes

This notebook includes code to generate figures S12 and S13. The purpose of this notebook is to examine what factors might explain the differences in mutant abundance (read counts) within the ORBIT libraries.

The read counts for each gene target comes from the combined analysis fo junctions 1 and 2 in this code notebook: code/seq_data_processing/upstream_downstream_combined_analysis.Rmd


Setup packages and plotting for the notebook:

# Check packages
source("../tools/package_setup.R")
# Load packages
library(tidyverse)
library(cowplot)
library(patchwork)
library(kableExtra)
library(ggrastr)
library(ggdist)
library(ggridges)
library(modelr)
library(XNAString)
library(Biostrings)
# Code display options
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=FALSE, echo = TRUE, message=FALSE, warning=FALSE, fig.align="center", fig.retina = 2)
# Load plotting tools
source("../tools/plotting_tools.R")
#Modify the plot theme
theme_set(theme_notebook())

Upstream vs downstream junctions

First, we will start by reading in the junction 1 and junction 2 read count data from the combined analysis.

df_TOs_found_up_down <- read_csv("../seq_data_processing/df_TOs_found_up_down.csv") %>% mutate(n_down = ifelse(is.na(n_down), 0, n_down), n_up = ifelse(is.na(n_up), 0, n_up))

df_TOs_found_up_down %>% select(lib, gene, n_down, n_up, found_score)
## # A tibble: 420 × 5
##      lib gene  n_down  n_up found_score
##    <dbl> <chr>  <dbl> <dbl>       <dbl>
##  1     1 alpA    1622  1906           2
##  2     1 argR     170   283           2
##  3     1 ariR     115    34           2
##  4     1 arsR     131   422           2
##  5     1 asnC     550   429           2
##  6     1 bolA     969  1745           2
##  7     1 caiF     789  3308           2
##  8     1 cedA     195   407           2
##  9     1 cspA      80   187           2
## 10     1 cspC     339  1137           2
## # … with 410 more rows

Let’s start by looking at the distribution of read counts for targets within each library.

plot_n_up <- ggplot(df_TOs_found_up_down, aes(x = factor(lib), y = n_up, shape = factor(lib))) + 
  geom_boxplot() + 
  geom_jitter( width = 0.2, height =0, alpha = 0.1)+ 
  scale_y_log10(limits = c(NA, 16500)) + 
  scale_x_discrete(labels = c('oPool', 'short TFs','long TFs','small RNAs'))+
  scale_shape_manual(values = c(4, 21, 22, 24))+
  labs(x = "Library", y = 'Perfect reads per target', title = 'Upstream junction')+
  guides(shape = 'none')

#plot_n_up

plot_n_down <- ggplot(df_TOs_found_up_down, aes(x = factor(lib), y = n_down, shape = factor(lib))) + 
  geom_boxplot() + 
  geom_jitter(width = 0.2, height =0, alpha = 0.1) + 
  scale_y_log10(limits = c(NA, 16500))+
  scale_x_discrete(labels = c('oPool', 'short TFs','long TFs','small RNAs')) +
  scale_shape_manual(values = c(4, 21, 22, 24))+
  labs(x = "Library", y = 'Perfect reads per target', title = 'Downstream junction') + 
  guides(shape = 'none')

#plot_n_down

plot_n_down | plot_n_up

We can see that the putative abundance varies greatly from ~1 read to 10,000 in several cases. What explains these differences in read abundance (and possibly ORBIT efficiency).

Let’s look at the most obvious correlation that we expect to find. Junction 1 and Junction 2 read abundances should correlate, assuming that they are each present in the same mutant cells. To compare fairly accross libraries we will normalize the read counts to the total counts from the library.

df_counts_norm <- df_TOs_found_up_down %>% 
  group_by(lib) %>% 
  mutate(down_total = sum(n_down, na.rm = T), up_total = sum(n_up,na.rm = T)) %>% 
  mutate(down_norm = n_down / down_total, up_norm = n_up / up_total)

df_counts_norm %>% select(lib, gene, n_down, n_up, found_score, down_norm, up_norm)
## # A tibble: 420 × 7
## # Groups:   lib [4]
##      lib gene  n_down  n_up found_score down_norm  up_norm
##    <dbl> <chr>  <dbl> <dbl>       <dbl>     <dbl>    <dbl>
##  1     1 alpA    1622  1906           2   0.0226  0.0160  
##  2     1 argR     170   283           2   0.00237 0.00238 
##  3     1 ariR     115    34           2   0.00160 0.000286
##  4     1 arsR     131   422           2   0.00183 0.00355 
##  5     1 asnC     550   429           2   0.00767 0.00361 
##  6     1 bolA     969  1745           2   0.0135  0.0147  
##  7     1 caiF     789  3308           2   0.0110  0.0278  
##  8     1 cedA     195   407           2   0.00272 0.00342 
##  9     1 cspA      80   187           2   0.00112 0.00157 
## 10     1 cspC     339  1137           2   0.00473 0.00956 
## # … with 410 more rows

Now we can fit a linear model to the log transformed normalized data and assess graphically and statistically.

#Fit linear model on log transformed data, excluding zero counts.
summary(lm(up_norm ~ down_norm, data = df_counts_norm %>% filter(down_norm>0 & up_norm >0) %>% mutate(left_norm = log(down_norm, 10), right_norm = log(up_norm, 10))))
## 
## Call:
## lm(formula = up_norm ~ down_norm, data = df_counts_norm %>% filter(down_norm > 
##     0 & up_norm > 0) %>% mutate(left_norm = log(down_norm, 10), 
##     right_norm = log(up_norm, 10)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.061247 -0.002441 -0.001217  0.001288  0.081796 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0015348  0.0006239    2.46   0.0143 *  
## down_norm   0.8527150  0.0276068   30.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01085 on 382 degrees of freedom
## Multiple R-squared:  0.7141, Adjusted R-squared:  0.7133 
## F-statistic: 954.1 on 1 and 382 DF,  p-value: < 2.2e-16
plot_up_down_cor <- ggplot(df_counts_norm, aes(x = down_norm, y = up_norm, shape = factor(lib), group = 'all')) + 
  geom_smooth(method = 'lm', se = F, alpha = 0.75)+
  #geom_abline(slope = 0.8478, intercept = 0.564, color = 'red')+
  geom_point(color = 'black', alpha = 0.25) + 
  scale_x_log10() + scale_y_log10() + 
  scale_shape_manual(values = c(4, 21, 22, 24), labels = c('oPool', 'short TFs','long TFs','small RNAs'))+
  labs(x = 'Downstream junction\nNormalized perfect reads per target', y = 'Upstream junction\nNormalized perfect reads per target', shape = "Library", caption = 'Adj R2 = 0.71, p = 2.2e-16')


plot_up_down_cor

Indeed Junction 1 and 2 read counts correlate quite well. Here is the combined figure that shows up in Fig. S12.

plot_grid(plot_n_down, plot_n_up, plot_up_down_cor + guides(shape ='none'), ncol = 3, align = 'hv', axis = 'lr')

Prepare oligo variables

With that first quality control question out of the way, now we will examine other oligo parameters and how well they correlate with putative mutant abundance. First we need to calculate and prepare all the variables we need.

TO abundances

First we will add the TO abundance data (twist libraries only) to our df_counts_norm dataset. Let’s read in the TO abundance data from code/seq_data_processing/TO_abundance

df_abundance <- read_csv("../seq_data_processing/TO_abundance/df_TO_abundances.csv")

df_abundance %>% select(lib, gene, TO_reads)
## # A tibble: 393 × 3
##      lib gene  TO_reads
##    <dbl> <chr>    <dbl>
##  1     1 alpA      4403
##  2     1 argR      7478
##  3     1 ariR      3283
##  4     1 arsR      2577
##  5     1 asnC      2621
##  6     1 bolA      4313
##  7     1 caiF      3715
##  8     1 cedA      2421
##  9     1 cspA      2916
## 10     1 cspC      5415
## # … with 383 more rows

Let’s normalize the TO counts to each library total and add this to the df_counts_norm dataframe.

df_abundance_norm <- df_abundance %>% 
  select(gene, lib, to_abun = TO_reads) %>% #**focus on perfect matches, rename to to_abun
  group_by(lib) %>% 
  mutate(lib_total = sum(to_abun, na.rm = T)) %>% 
  mutate(to_abun_norm = to_abun / lib_total)

df_counts_norm_TO <- left_join(df_counts_norm,df_abundance_norm, 
                               by = c('gene','lib'))

df_counts_norm_TO %>% select(gene, lib, to_abun, to_abun_norm)
## # A tibble: 420 × 4
## # Groups:   lib [4]
##    gene    lib to_abun to_abun_norm
##    <chr> <dbl>   <dbl>        <dbl>
##  1 alpA      1    4403      0.0148 
##  2 argR      1    7478      0.0252 
##  3 ariR      1    3283      0.0110 
##  4 arsR      1    2577      0.00867
##  5 asnC      1    2621      0.00882
##  6 bolA      1    4313      0.0145 
##  7 caiF      1    3715      0.0125 
##  8 cedA      1    2421      0.00814
##  9 cspA      1    2916      0.00981
## 10 cspC      1    5415      0.0182 
## # … with 410 more rows

Oligo free energy

The major variable we need to calculate are the free energies for the targeting oligos. The following code cell iterates through each targeting oligo and uses the XNAstring package to calculate the oligo monomer free energy, oligo homodimer free energy and the free energy of each homology arm duplex (complexed with homology on genome).

#takes a min or two to run

df_oligo_energy <- tibble()

for(row in 1:nrow(df_TOs_found_up_down)){
  
  oligo <- df_TOs_found_up_down$oligo[row]
  
  #print(row)
  
  oligo_len <- nchar(oligo)
  
  arm_len <- (oligo_len-38)/2
  
  arm_1 <- str_sub(oligo, start = 1, end = arm_len)
  
  arm_2 <- str_sub(oligo, oligo_len - arm_len + 1, oligo_len)
  
  oligo_xna <- XNAString(base = toupper(oligo))
  
  oligo_homodimer_xna <- XNAString(base = c(toupper(oligo),toupper(oligo)))
  
  arm_1_xna <- XNAString(base = c(arm_1, as.character(reverseComplement(DNAString(arm_1)))))
  
  arm_2_xna <- XNAString(base = c(arm_2, as.character(reverseComplement(DNAString(arm_2)))))
  
  oligo_monomer <- predictMfeStructure(oligo_xna)
  
  oligo_homodimer <- predictDuplexStructure(oligo_homodimer_xna)
  
  arm_1_duplex <- predictDuplexStructure(arm_1_xna)
  
  arm_2_duplex <- predictDuplexStructure(arm_2_xna)
  
  oligo_energy <- bind_cols(df_TOs_found_up_down[row,],
                            tibble('oligo_monomer' = oligo_monomer$mfe, 
                                   'oligo_homodimer'=oligo_homodimer$mfe, 
                                   'arm_1_duplex'=arm_1_duplex$mfe,
                                   'arm_2_duplex'=arm_2_duplex$mfe))
  
  df_oligo_energy <- bind_rows(df_oligo_energy, oligo_energy)
  
  
}

#write_csv(df_oligo_energy, "df_oligo_energy.csv")

#df_oligo_energy

We can read in df_oligo energy here without running the above cell.

df_oligo_energy <- read_csv("df_oligo_energy.csv")

df_oligo_energy %>% select(lib, gene, oligo_monomer, oligo_homodimer, arm_1_duplex, arm_2_duplex)
## # A tibble: 420 × 6
##      lib gene  oligo_monomer oligo_homodimer arm_1_duplex arm_2_duplex
##    <dbl> <chr>         <dbl>           <dbl>        <dbl>        <dbl>
##  1     1 alpA          -30             -67          -84.3        -75.7
##  2     1 argR          -38.3           -82.7        -82.5        -93.2
##  3     1 ariR          -27             -64.1        -66.9        -70.2
##  4     1 arsR          -28             -63          -87.7        -65.2
##  5     1 asnC          -30.1           -68.1        -70.6        -82.3
##  6     1 bolA          -39.2           -86.8        -91.7        -87.2
##  7     1 caiF          -25.1           -59.3        -80.6        -77.9
##  8     1 cedA          -27             -60.7        -89.1        -70  
##  9     1 cspA          -33.3           -78.3        -77.7        -82.6
## 10     1 cspC          -36.8           -81.1        -82.1        -81.3
## # … with 410 more rows

Other

Once we have calculated these values, we need to do some slight rearranging, by adding the homology arm duplex free energies together to get a combined homology free energy. We will also calculate the distance from the target to the origin of replication.

#genomic positions for origin, terminus and genome length.
ori <- 3923882.5
ter <- 1590250
gen_len <- 4641652

df_vars <- left_join(df_counts_norm_TO, df_oligo_energy %>% select(lib, gene, oligo_monomer, oligo_homodimer, arm_1_duplex, arm_2_duplex), by =c('gene', 'lib')) %>%
  mutate(arm_total_duplex = arm_1_duplex + arm_2_duplex) %>% # get total homology arm free energy
  mutate(ori_dist = ifelse(left_oligo_pos > ter, abs(ori-left_oligo_pos), gen_len-ori+left_oligo_pos)) %>%  #Calculate distance to origin from vals above
  mutate(del_len = right_oligo_pos - left_oligo_pos) %>%  #calculate deletion length
  mutate(lib = factor(lib,levels = c("0","1","2","3"),labels = c('oPool','short TFs','long TFs','small RNAs'))) #organize lib names and levels

#write_csv(df_vars, "df_vars.csv")

df_vars %>% select(lib, gene, arm_total_duplex, ori_dist, to_abun, del_len)
## # A tibble: 420 × 6
## # Groups:   lib [4]
##    lib       gene  arm_total_duplex ori_dist to_abun del_len
##    <fct>     <chr>            <dbl>    <dbl>   <dbl>   <dbl>
##  1 short TFs alpA             -160  1165234.    4403     187
##  2 short TFs argR             -176.  539174.    7478     445
##  3 short TFs ariR             -137. 1934144.    3283     241
##  4 short TFs arsR             -153.  275350.    2577     328
##  5 short TFs asnC             -153.    2682.    2621     433
##  6 short TFs bolA             -179. 1172246.    4313     292
##  7 short TFs caiF             -158.  752074.    3715     370
##  8 short TFs cedA             -159. 2110442.    2421     217
##  9 short TFs cspA             -160.  203828.    2916     187
## 10 short TFs cspC             -163. 2016636.    5415     184
## # … with 410 more rows

Model Oligo parameters vs. mutant abundance

Now that we have prepared all of the oligo variables, we can fit linear models to see how well each variable correlates with the observed mutant library reads.

Deletion length

del_len_mod <- function(df) {
  lm(up_log~del_len_log , data = df)
}

df_del_len <- df_vars %>% 
  filter(n_up>0) %>% 
  mutate(up_log = log(n_up,10), del_len_log = log(del_len, 10)) %>% 
  group_by(lib) %>% 
  nest() %>% 
  mutate(model = map(data, del_len_mod)) %>% 
  mutate(glance = map(model, broom::glance))%>% 
  mutate(summary = map(model, broom::tidy, conf.int = T))

df_vars %>% group_by(lib) %>% summarise(min_del_len = min(del_len), max_del_len = max(del_len))
## # A tibble: 4 × 3
##   lib        min_del_len max_del_len
##   <fct>            <dbl>       <dbl>
## 1 oPool              292        2680
## 2 short TFs          175         547
## 3 long TFs           550        3937
## 4 small RNAs          54         370
df_del_len %>% unnest(glance)
## # A tibble: 4 × 16
## # Groups:   lib [4]
##   lib   data     model r.squ…¹ adj.r.…² sigma stati…³ p.value    df logLik   AIC
##   <fct> <list>   <lis>   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl>
## 1 shor… <tibble> <lm>  1.71e-2  0.00242 0.753  1.16     0.284     1  -77.3 161. 
## 2 long… <tibble> <lm>  6.37e-3  0.00153 0.648  1.31     0.253     1 -203.  412. 
## 3 smal… <tibble> <lm>  6.50e-4 -0.0115  0.880  0.0533   0.818     1 -107.  221. 
## 4 oPool <tibble> <lm>  3.95e-2  0.00112 0.796  1.03     0.320     1  -31.1  68.2
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>, summary <list>, and abbreviated variable names ¹​r.squared,
## #   ²​adj.r.squared, ³​statistic
df_del_len %>% unnest(summary)
## # A tibble: 8 × 11
## # Groups:   lib [4]
##   lib      data     model glance   term  estim…¹ std.e…² stati…³ p.value conf.…⁴
##   <fct>    <list>   <lis> <list>   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 short T… <tibble> <lm>  <tibble> (Int…   0.973   1.63    0.597 5.53e-1  -2.28 
## 2 short T… <tibble> <lm>  <tibble> del_…   0.696   0.645   1.08  2.84e-1  -0.591
## 3 long TFs <tibble> <lm>  <tibble> (Int…   3.40    0.986   3.44  6.99e-4   1.45 
## 4 long TFs <tibble> <lm>  <tibble> del_…  -0.385   0.336  -1.15  2.53e-1  -1.05 
## 5 small R… <tibble> <lm>  <tibble> (Int…   2.65    0.965   2.74  7.52e-3   0.726
## 6 small R… <tibble> <lm>  <tibble> del_…  -0.108   0.469  -0.231 8.18e-1  -1.04 
## 7 oPool    <tibble> <lm>  <tibble> (Int…   4.77    2.02    2.36  2.62e-2   0.612
## 8 oPool    <tibble> <lm>  <tibble> del_…  -0.711   0.701  -1.01  3.20e-1  -2.15 
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## #   ¹​estimate, ²​std.error, ³​statistic, ⁴​conf.low
del_len_labs <- df_del_len %>% unnest(glance) %>% 
  mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>% 
  mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>% 
  mutate(labels = paste0(lib,": ",label_1," ",label_2))

plot_del_len <- ggplot(df_vars, aes(x = del_len, y = n_up)) +
  geom_smooth(method = 'lm')+ 
  geom_point(shape = 21, alpha = 0.25) +
  #geom_text(data = del_len_labs, aes())
  scale_x_log10() + scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') + 
  labs(x = "Deletion length (bp)", y = 'Reads per target (upstream junction)', caption = paste(del_len_labs$labels, collapse = "\n"), title = 'Deletion length')

plot_del_len

Targeting oligo abundance

to_abun_mod <- function(df) {
  lm(up_log~to_abun , data = df)
}

df_to_abun <- df_vars %>% 
  filter(n_up>0) %>% 
  filter(lib != "oPool") %>% 
  mutate(up_log = log(n_up,10)) %>% 
 # mutate(lib = factor(lib,labels = c('short TFs','long TFs','small RNAs'))) %>% 
  group_by(lib) %>% 
  nest() %>% 
  mutate(model = map(data, to_abun_mod)) %>% 
  mutate(glance = map(model, broom::glance))%>% 
  mutate(summary = map(model, broom::tidy, conf.int = T))

df_to_abun %>% unnest(glance)
## # A tibble: 3 × 16
## # Groups:   lib [3]
##   lib    data     model r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC
##   <fct>  <list>   <lis>   <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl>
## 1 short… <tibble> <lm>   0.137   0.124  0.706   10.6  1.76e-3     1  -72.8  152.
## 2 long … <tibble> <lm>   0.0619  0.0573 0.630   13.5  3.00e-4     1 -197.   400.
## 3 small… <tibble> <lm>   0.0894  0.0783 0.840    8.05 5.74e-3     1 -104.   213.
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>, summary <list>, and abbreviated variable names ¹​r.squared,
## #   ²​adj.r.squared, ³​statistic
df_to_abun %>% unnest(summary)
## # A tibble: 6 × 11
## # Groups:   lib [3]
##   lib     data     model glance   term  estim…¹ std.e…² stati…³  p.value conf.…⁴
##   <fct>   <list>   <lis> <list>   <chr>   <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
## 1 short … <tibble> <lm>  <tibble> (Int… 1.98e+0 2.46e-1    8.03 2.16e-11 1.49e+0
## 2 short … <tibble> <lm>  <tibble> to_a… 1.87e-4 5.75e-5    3.26 1.76e- 3 7.26e-5
## 3 long T… <tibble> <lm>  <tibble> (Int… 1.81e+0 1.30e-1   13.9  1.97e-31 1.56e+0
## 4 long T… <tibble> <lm>  <tibble> to_a… 5.66e-4 1.54e-4    3.68 3.00e- 4 2.63e-4
## 5 small … <tibble> <lm>  <tibble> (Int… 1.83e+0 2.29e-1    7.98 7.80e-12 1.37e+0
## 6 small … <tibble> <lm>  <tibble> to_a… 1.97e-4 6.96e-5    2.84 5.74e- 3 5.90e-5
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## #   ¹​estimate, ²​std.error, ³​statistic, ⁴​conf.low
to_abun_labs <- df_to_abun %>% unnest(glance) %>% 
  mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>% 
  mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>% 
  mutate(labels = paste0(lib,": ",label_1," ",label_2))

plot_to_abun <- ggplot(df_vars, aes(x = to_abun, y = n_up)) +
  geom_smooth(method = 'lm')+ 
  geom_point(shape = 21, alpha = 0.25) +
  scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') + 
  labs(x = "Oligo abundance (reads)", y = 'Reads per target (upstream junction)', caption = paste(to_abun_labs$labels, collapse = "\n"), title = 'Oligo abundance')

plot_to_abun

Homology arm duplex free energy

hom_energy_mod <- function(df) {
  lm(up_log~arm_total_duplex, data = df)
}

df_hom_energy <- df_vars %>% 
  filter(n_up >0) %>% 
  mutate(up_log = log(n_up,10)) %>% 
  #mutate(lib = factor(lib,labels = c('oPool','short TFs','long TFs','small RNAs'))) %>% 
  group_by(lib) %>% 
  nest() %>% 
  mutate(model = map(data, hom_energy_mod)) %>% 
  mutate(glance = map(model, broom::glance))%>% 
  mutate(summary = map(model, broom::tidy, conf.int = T))

df_hom_energy %>% unnest(glance)
## # A tibble: 4 × 16
## # Groups:   lib [4]
##   lib    data     model r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC
##   <fct>  <list>   <lis>   <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl>
## 1 short… <tibble> <lm>   0.126   0.113  0.710   9.64  2.79e-3     1  -73.3 153. 
## 2 long … <tibble> <lm>   0.124   0.120  0.608  29.0   1.98e-7     1 -190.  386. 
## 3 small… <tibble> <lm>   0.0434  0.0318 0.861   3.72  5.72e-2     1 -106.  217. 
## 4 oPool  <tibble> <lm>   0.0214 -0.0177 0.803   0.548 4.66e-1     1  -31.4  68.7
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>, summary <list>, and abbreviated variable names ¹​r.squared,
## #   ²​adj.r.squared, ³​statistic
df_hom_energy %>% unnest(summary)
## # A tibble: 8 × 11
## # Groups:   lib [4]
##   lib    data     model glance   term  estimate std.e…¹ statis…² p.value conf.…³
##   <fct>  <list>   <lis> <list>   <chr>    <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
## 1 short… <tibble> <lm>  <tibble> (Int… -1.26    1.29    -0.977   3.32e-1 -3.83  
## 2 short… <tibble> <lm>  <tibble> arm_… -0.0248  0.00798 -3.10    2.79e-3 -0.0407
## 3 long … <tibble> <lm>  <tibble> (Int… -0.801   0.571   -1.40    1.62e-1 -1.93  
## 4 long … <tibble> <lm>  <tibble> arm_… -0.0186  0.00346 -5.38    1.98e-7 -0.0255
## 5 small… <tibble> <lm>  <tibble> (Int… -0.00443 1.26    -0.00351 9.97e-1 -2.52  
## 6 small… <tibble> <lm>  <tibble> arm_… -0.0146  0.00758 -1.93    5.72e-2 -0.0297
## 7 oPool  <tibble> <lm>  <tibble> (Int…  0.865   2.52     0.343   7.34e-1 -4.33  
## 8 oPool  <tibble> <lm>  <tibble> arm_… -0.0113  0.0153  -0.740   4.66e-1 -0.0428
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## #   ¹​std.error, ²​statistic, ³​conf.low
hom_energy_labs <- df_hom_energy %>% unnest(glance) %>% 
  mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>% 
  mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>% 
  mutate(labels = paste0(lib,": ",label_1," ",label_2))

plot_hom_energy <- ggplot(df_vars, aes(x = arm_total_duplex, y = n_up)) +
  geom_smooth(method = 'lm')+ 
  geom_point(shape = 21, alpha = 0.25) +
  scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') + 
  labs(x = "Homology arm free energy (kcal/mol)", y = 'Reads per target (upstream junction)', caption = paste(hom_energy_labs$labels, collapse = "\n"), title = 'Oligo homology arm duplex free energy')

plot_hom_energy

Monomer oligo folding free energy

mon_energy_mod <- function(df) {
  lm(up_log~oligo_monomer, data = df)
}

df_mon_energy <- df_vars %>% 
  filter(n_up >0) %>% 
  mutate(up_log = log(n_up,10)) %>% 
  #mutate(lib = factor(lib,labels = c('oPool','short TFs','long TFs','small RNAs'))) %>% 
  group_by(lib) %>% 
  nest() %>% 
  mutate(model = map(data, mon_energy_mod)) %>% 
  mutate(glance = map(model, broom::glance))%>% 
  mutate(summary = map(model, broom::tidy, conf.int = T))

df_mon_energy %>% unnest(glance)
## # A tibble: 4 × 16
## # Groups:   lib [4]
##   lib   data     model r.squ…¹ adj.r.…² sigma stati…³ p.value    df logLik   AIC
##   <fct> <list>   <lis>   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl>
## 1 shor… <tibble> <lm>  0.0759   0.0621  0.730   5.50  2.19e-2     1  -75.2 156. 
## 2 long… <tibble> <lm>  0.0604   0.0558  0.630  13.2   3.57e-4     1 -197.  400. 
## 3 smal… <tibble> <lm>  0.00225 -0.00992 0.879   0.185 6.69e-1     1 -107.  221. 
## 4 oPool <tibble> <lm>  0.0253  -0.0136  0.802   0.650 4.28e-1     1  -31.3  68.6
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>, summary <list>, and abbreviated variable names ¹​r.squared,
## #   ²​adj.r.squared, ³​statistic
df_mon_energy %>% unnest(summary)
## # A tibble: 8 × 11
## # Groups:   lib [4]
##   lib     data     model glance   term  estimate std.e…¹ stati…² p.value conf.…³
##   <fct>   <list>   <lis> <list>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 short … <tibble> <lm>  <tibble> (Int…  1.39    0.578     2.41  1.87e-2  0.239 
## 2 short … <tibble> <lm>  <tibble> olig… -0.0416  0.0177   -2.35  2.19e-2 -0.0769
## 3 long T… <tibble> <lm>  <tibble> (Int…  1.20    0.296     4.07  6.76e-5  0.620 
## 4 long T… <tibble> <lm>  <tibble> olig… -0.0307  0.00845  -3.63  3.57e-4 -0.0473
## 5 small … <tibble> <lm>  <tibble> (Int…  2.20    0.521     4.23  6.05e-5  1.17  
## 6 small … <tibble> <lm>  <tibble> olig… -0.00606 0.0141   -0.430 6.69e-1 -0.0341
## 7 oPool   <tibble> <lm>  <tibble> (Int…  1.92    1.01      1.90  6.86e-2 -0.158 
## 8 oPool   <tibble> <lm>  <tibble> olig… -0.0233  0.0289   -0.806 4.28e-1 -0.0828
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## #   ¹​std.error, ²​statistic, ³​conf.low
mon_energy_labs <- df_mon_energy %>% unnest(glance) %>% 
  mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>% 
  mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>% 
  mutate(labels = paste0(lib,": ",label_1," ",label_2))

plot_mon_energy <- ggplot(df_vars, aes(x = oligo_monomer, y = n_up)) +
  geom_smooth(method = 'lm')+ 
  geom_point(shape = 21, alpha = 0.25) +
  scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') + 
  labs(x = "Oligo monomer free energy (kcal/mol)", y = 'Reads per target (upstream junction)', caption = paste(mon_energy_labs$labels, collapse = "\n"), title = 'Oligo monomer free energy')

plot_mon_energy

Distance to origin of replication

ori_dist_mod <- function(df) {
  lm(up_log~ori_dist, data = df)
}

df_ori_dist <- df_vars %>% 
  filter(n_up >0) %>% 
  mutate(up_log = log(n_up,10)) %>% 
  #mutate(lib = factor(lib,labels = c('oPool','short TFs','long TFs','small RNAs'))) %>% 
  group_by(lib) %>% 
  nest() %>% 
  mutate(model = map(data, ori_dist_mod)) %>% 
  mutate(glance = map(model, broom::glance))%>% 
  mutate(summary = map(model, broom::tidy, conf.int = T))

df_ori_dist %>% unnest(glance)
## # A tibble: 4 × 16
## # Groups:   lib [4]
##   lib    data     model r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC
##   <fct>  <list>   <lis>   <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl>
## 1 short… <tibble> <lm>  1.67e-5 -0.0149 0.760 0.00112 0.973       1  -77.9 162. 
## 2 long … <tibble> <lm>  3.63e-2  0.0316 0.638 7.72    0.00596     1 -200.  405. 
## 3 small… <tibble> <lm>  2.98e-2  0.0179 0.867 2.52    0.117       1 -106.  218. 
## 4 oPool  <tibble> <lm>  2.20e-3 -0.0377 0.811 0.0551  0.816       1  -31.6  69.2
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   nobs <int>, summary <list>, and abbreviated variable names ¹​r.squared,
## #   ²​adj.r.squared, ³​statistic
df_ori_dist %>% unnest(summary)
## # A tibble: 8 × 11
## # Groups:   lib [4]
##   lib   data     model glance   term  estimate std.e…¹ stati…²  p.value conf.low
##   <fct> <list>   <lis> <list>   <chr>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
## 1 shor… <tibble> <lm>  <tibble> (Int…  2.74e+0 1.75e-1 15.7    4.32e-24  2.39e+0
## 2 shor… <tibble> <lm>  <tibble> ori_… -4.39e-9 1.31e-7 -0.0334 9.73e- 1 -2.67e-7
## 3 long… <tibble> <lm>  <tibble> (Int…  2.47e+0 8.66e-2 28.5    2.50e-73  2.30e+0
## 4 long… <tibble> <lm>  <tibble> ori_… -1.83e-7 6.59e-8 -2.78   5.96e- 3 -3.13e-7
## 5 smal… <tibble> <lm>  <tibble> (Int…  2.66e+0 1.78e-1 15.0    3.45e-25  2.31e+0
## 6 smal… <tibble> <lm>  <tibble> ori_… -2.11e-7 1.33e-7 -1.59   1.17e- 1 -4.76e-7
## 7 oPool <tibble> <lm>  <tibble> (Int…  2.67e+0 2.79e-1  9.60   7.25e-10  2.10e+0
## 8 oPool <tibble> <lm>  <tibble> ori_…  5.61e-8 2.39e-7  0.235  8.16e- 1 -4.37e-7
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## #   ¹​std.error, ²​statistic
ori_dist_labs <- df_ori_dist %>% unnest(glance) %>% 
  mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>% 
  mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>% 
  mutate(labels = paste0(lib,": ",label_1," ",label_2))

 Mb_label <- function(x){
 # from s to ns
 lab <- paste(x / 1000000)
 }

plot_ori_dist <- ggplot(df_vars, aes(x = ori_dist, y = n_up)) +
  geom_smooth(se = F, color = 'red')+
  geom_smooth(method = 'lm')+ 
  geom_point(shape = 21, alpha = 0.25) +
  scale_x_continuous(labels = Mb_label)+
  scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') + 
  labs(x = "Distance to ori (Mb)", y = 'Reads per target (upstream junction)', caption = paste(ori_dist_labs$labels, collapse = "\n"), title = 'Distance to ori')

plot_ori_dist

Assemble single variable plots

theme_set(theme_figure())

single_vars <- plot_grid(plot_del_len, plot_to_abun, plot_hom_energy, plot_mon_energy, plot_ori_dist, ncol = 2, align = 'hv', axis = 'tblr')

single_vars

#save_plot("../../figures/r_pdf_figs/supp_figs/mut_lib_correlations.pdf",single_vars, base_height = 10, base_width = 8)

3 variable model

theme_set(theme_notebook())

mod_3_var <- function(df) {
  #lm(right_norm_log~arm_total_duplex+ ori_dist + to_abun_norm , data = df)
  lm(up_log~arm_total_duplex+ ori_dist + to_abun_norm , data = df)
}


df_mod_3_var <- df_vars %>% 
  filter(lib != 'oPool') %>% 
  filter(n_up>0) %>% 
  mutate(up_norm_log = log(up_norm, 10), up_log = log(n_up,10)) %>% 
  #mutate(lib = factor(lib,labels = c('short TFs','long TFs','small RNAs'))) %>% 
  group_by(lib) %>% 
  nest() %>% 
  mutate(model = map(data, mod_3_var)) %>% 
  mutate(resids = map2(data, model, add_residuals)) %>% 
  mutate(preds = map2(data, model, add_predictions)) %>% 
  mutate(glance = map(model, broom::glance))%>% 
  mutate(summary = map(model, broom::tidy, conf.int = T))

df_mod_3_var %>% unnest(glance)
## # A tibble: 3 × 18
## # Groups:   lib [3]
##   lib     data     model resids   preds    r.squ…¹ adj.r…² sigma stati…³ p.value
##   <fct>   <list>   <lis> <list>   <list>     <dbl>   <dbl> <dbl>   <dbl>   <dbl>
## 1 short … <tibble> <lm>  <tibble> <tibble>   0.173  0.135  0.701    4.53 6.03e-3
## 2 long T… <tibble> <lm>  <tibble> <tibble>   0.144  0.132  0.604   11.4  5.99e-7
## 3 small … <tibble> <lm>  <tibble> <tibble>   0.102  0.0683 0.844    3.03 3.42e-2
## # … with 8 more variables: df <dbl>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
## #   deviance <dbl>, df.residual <int>, nobs <int>, summary <list>, and
## #   abbreviated variable names ¹​r.squared, ²​adj.r.squared, ³​statistic
df_mod_3_var %>% unnest(summary)
## # A tibble: 12 × 13
## # Groups:   lib [3]
##    lib        data     model  resids   preds    glance   term   estimate std.e…¹
##    <fct>      <list>   <list> <list>   <list>   <list>   <chr>     <dbl>   <dbl>
##  1 short TFs  <tibble> <lm>   <tibble> <tibble> <tibble> (Inte… -2.80e-1 1.38e+0
##  2 short TFs  <tibble> <lm>   <tibble> <tibble> <tibble> arm_t… -1.55e-2 9.22e-3
##  3 short TFs  <tibble> <lm>   <tibble> <tibble> <tibble> ori_d… -6.80e-9 1.21e-7
##  4 short TFs  <tibble> <lm>   <tibble> <tibble> <tibble> to_ab…  3.83e+1 1.99e+1
##  5 long TFs   <tibble> <lm>   <tibble> <tibble> <tibble> (Inte… -2.09e-1 6.57e-1
##  6 long TFs   <tibble> <lm>   <tibble> <tibble> <tibble> arm_t… -1.52e-2 4.35e-3
##  7 long TFs   <tibble> <lm>   <tibble> <tibble> <tibble> ori_d… -1.32e-7 6.34e-8
##  8 long TFs   <tibble> <lm>   <tibble> <tibble> <tibble> to_ab…  2.87e+1 3.33e+1
##  9 small RNAs <tibble> <lm>   <tibble> <tibble> <tibble> (Inte…  1.94e+0 1.51e+0
## 10 small RNAs <tibble> <lm>   <tibble> <tibble> <tibble> arm_t… -6.14e-4 9.87e-3
## 11 small RNAs <tibble> <lm>   <tibble> <tibble> <tibble> ori_d… -1.40e-7 1.33e-7
## 12 small RNAs <tibble> <lm>   <tibble> <tibble> <tibble> to_ab…  4.97e+1 2.60e+1
## # … with 4 more variables: statistic <dbl>, p.value <dbl>, conf.low <dbl>,
## #   conf.high <dbl>, and abbreviated variable name ¹​std.error
mod_3_var_labs <- df_mod_3_var %>% unnest(glance) %>% 
  mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>% 
  mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>% 
  mutate(labels = paste0(lib,": ",label_1," ",label_2))

plot_3_var_preds <- ggplot(data=df_mod_3_var  %>% unnest( preds), aes(x = pred , y = up_log )) + 
  geom_abline(slope = 1, intercept = 0,linetype = 2, color = 'light gray')+
  geom_point(shape = 21, alpha = 0.25) + 
  facet_wrap(~lib, ncol = 3, scales = 'free')+
  #geom_text(data = model_labs %>% filter(lib=='long TFs'), aes(x =1.25, y = 2.5, label = labels))+
  scale_x_continuous(limits = c(1,3),breaks = c(1,2,3), labels = c( "10", "100","1000"))+
  scale_y_continuous(limits = c(0,4),breaks = c(1,2,3), labels = c( "10", "100","1000"))+
  labs(y = "Experimental reads per target",x = 'Predicted reads per target', subtitle = 'Reads per target ~ oligo abundance + homology arm free energy + distance to ori', title = '3 var model')

plot_3_var_preds

plot_3_var_resids <- ggplot(data=df_mod_3_var  %>% unnest(preds, resids), aes(x = pred , y = resid )) + 
  geom_hline(yintercept = 0, linetype = 2, color = 'light gray')+
  geom_smooth()+
  geom_point(shape = 21, alpha = 0.25) + 
  facet_wrap(~lib, ncol = 3, scales = 'free')+
  scale_x_continuous(limits = c(1,3),breaks = c(1,2,3), labels = c( "10", "100","1000"))+
  scale_y_continuous(breaks = c(-2,-1,0,1), labels = c('-100x','-10x','0','+10x')) +
  labs(y = "Residual error",x = 'Predicted reads per target', caption = paste(mod_3_var_labs$labels, collapse = "\n"))

plot_3_var_resids

plot_3_var_preds / plot_3_var_resids

Assemble full figures

theme_set(theme_figure())

top_grid <- plot_grid(plot_n_down, plot_n_up, plot_up_down_cor + guides(shape ='none'), ncol = 3, align = 'hv', axis = 'tblr')

#middle_grid <- plot_grid(plot_del_len, plot_to_abun, plot_hom_energy, plot_mon_energy, plot_ori_dist, ncol = 3, align = 'hv', axis = 'tblr')

bottom_grid <- plot_grid(plot_3_var_preds, plot_3_var_resids,ncol = 1, align = 'hv', axis = 'lr')

full_grid <- plot_grid(top_grid, bottom_grid, ncol = 1, align = 'hv', axis = 'lr', rel_heights = c(1,2))

full_grid

#save_plot("../../figures/r_pdf_figs/supp_figs/mut_lib_up_down_mod.pdf",full_grid, base_height = 8, base_width = 8)
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] Biostrings_2.66.0   GenomeInfoDb_1.34.9 XVector_0.38.0     
##  [4] IRanges_2.32.0      S4Vectors_0.36.0    BiocGenerics_0.44.0
##  [7] XNAString_1.6.0     modelr_0.1.10       ggridges_0.5.4     
## [10] ggdist_3.2.1        patchwork_1.1.2     ggrastr_1.0.1      
## [13] kableExtra_1.3.4    cowplot_1.1.1       viridis_0.6.2      
## [16] viridisLite_0.4.1   knitr_1.41          forcats_0.5.2      
## [19] stringr_1.5.0       dplyr_1.1.0         purrr_0.3.5        
## [22] readr_2.1.3         tidyr_1.2.1         tibble_3.1.8       
## [25] ggplot2_3.4.0       tidyverse_1.3.2    
## 
## loaded via a namespace (and not attached):
##   [1] googledrive_2.0.0           ggbeeswarm_0.7.1           
##   [3] colorspace_2.0-3            rjson_0.2.21               
##   [5] ellipsis_0.3.2              GenomicRanges_1.50.1       
##   [7] fs_1.5.2                    rstudioapi_0.14            
##   [9] listenv_0.9.0               farver_2.1.1               
##  [11] bit64_4.0.5                 fansi_1.0.3                
##  [13] lubridate_1.9.0             xml2_1.3.3                 
##  [15] splines_4.2.0               codetools_0.2-18           
##  [17] cachem_1.0.6                jsonlite_1.8.3             
##  [19] Rsamtools_2.14.0            broom_1.0.1                
##  [21] dbplyr_2.2.1                compiler_4.2.0             
##  [23] httr_1.4.4                  backports_1.4.1            
##  [25] assertthat_0.2.1            Matrix_1.5-3               
##  [27] fastmap_1.1.0               gargle_1.2.1               
##  [29] cli_3.4.1                   htmltools_0.5.4            
##  [31] tools_4.2.0                 gtable_0.3.1               
##  [33] glue_1.6.2                  GenomeInfoDbData_1.2.9     
##  [35] Rcpp_1.0.9                  Biobase_2.58.0             
##  [37] cellranger_1.1.0            jquerylib_0.1.4            
##  [39] vctrs_0.5.2                 nlme_3.1-160               
##  [41] svglite_2.1.0               rtracklayer_1.58.0         
##  [43] xfun_0.35                   globals_0.16.2             
##  [45] rvest_1.0.3                 timechange_0.1.1           
##  [47] lifecycle_1.0.3             restfulr_0.0.15            
##  [49] XML_3.99-0.14               googlesheets4_1.0.1        
##  [51] future_1.31.0               zlibbioc_1.44.0            
##  [53] scales_1.2.1                vroom_1.6.0                
##  [55] BSgenome_1.66.3             hms_1.1.2                  
##  [57] MatrixGenerics_1.10.0       parallel_4.2.0             
##  [59] SummarizedExperiment_1.28.0 yaml_2.3.6                 
##  [61] gridExtra_2.3               sass_0.4.4                 
##  [63] stringi_1.7.8               highr_0.9                  
##  [65] BiocIO_1.8.0                BiocParallel_1.32.4        
##  [67] rlang_1.1.1                 pkgconfig_2.0.3            
##  [69] systemfonts_1.0.4           matrixStats_0.63.0         
##  [71] bitops_1.0-7                distributional_0.3.1       
##  [73] evaluate_0.18               lattice_0.20-45            
##  [75] labeling_0.4.2              GenomicAlignments_1.34.0   
##  [77] htmlwidgets_1.6.1           bit_4.0.5                  
##  [79] tidyselect_1.2.0            parallelly_1.34.0          
##  [81] magrittr_2.0.3              R6_2.5.1                   
##  [83] generics_0.1.3              DelayedArray_0.24.0        
##  [85] DBI_1.1.3                   mgcv_1.8-41                
##  [87] pillar_1.8.1                haven_2.5.1                
##  [89] withr_2.5.0                 RCurl_1.98-1.9             
##  [91] future.apply_1.11.0         crayon_1.5.2               
##  [93] utf8_1.2.2                  formattable_0.2.1          
##  [95] tzdb_0.3.0                  rmarkdown_2.18             
##  [97] grid_4.2.0                  readxl_1.4.1               
##  [99] data.table_1.14.6           reprex_2.0.2               
## [101] digest_0.6.30               webshot_0.5.4              
## [103] munsell_0.5.0               beeswarm_0.4.0             
## [105] vipor_0.4.5                 bslib_0.4.1