This notebook includes code to generate figures S12 and S13. The purpose of this notebook is to examine what factors might explain the differences in mutant abundance (read counts) within the ORBIT libraries.
The read counts for each gene target comes from the combined analysis
fo junctions 1 and 2 in this code notebook:
code/seq_data_processing/upstream_downstream_combined_analysis.Rmd
Setup packages and plotting for the notebook:
# Check packages
source("../tools/package_setup.R")
# Load packages
library(tidyverse)
library(cowplot)
library(patchwork)
library(kableExtra)
library(ggrastr)
library(ggdist)
library(ggridges)
library(modelr)
library(XNAString)
library(Biostrings)
# Code display options
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=FALSE, echo = TRUE, message=FALSE, warning=FALSE, fig.align="center", fig.retina = 2)
# Load plotting tools
source("../tools/plotting_tools.R")
#Modify the plot theme
theme_set(theme_notebook())
First, we will start by reading in the junction 1 and junction 2 read count data from the combined analysis.
df_TOs_found_up_down <- read_csv("../seq_data_processing/df_TOs_found_up_down.csv") %>% mutate(n_down = ifelse(is.na(n_down), 0, n_down), n_up = ifelse(is.na(n_up), 0, n_up))
df_TOs_found_up_down %>% select(lib, gene, n_down, n_up, found_score)
## # A tibble: 420 × 5
## lib gene n_down n_up found_score
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 alpA 1622 1906 2
## 2 1 argR 170 283 2
## 3 1 ariR 115 34 2
## 4 1 arsR 131 422 2
## 5 1 asnC 550 429 2
## 6 1 bolA 969 1745 2
## 7 1 caiF 789 3308 2
## 8 1 cedA 195 407 2
## 9 1 cspA 80 187 2
## 10 1 cspC 339 1137 2
## # … with 410 more rows
Let’s start by looking at the distribution of read counts for targets within each library.
plot_n_up <- ggplot(df_TOs_found_up_down, aes(x = factor(lib), y = n_up, shape = factor(lib))) +
geom_boxplot() +
geom_jitter( width = 0.2, height =0, alpha = 0.1)+
scale_y_log10(limits = c(NA, 16500)) +
scale_x_discrete(labels = c('oPool', 'short TFs','long TFs','small RNAs'))+
scale_shape_manual(values = c(4, 21, 22, 24))+
labs(x = "Library", y = 'Perfect reads per target', title = 'Upstream junction')+
guides(shape = 'none')
#plot_n_up
plot_n_down <- ggplot(df_TOs_found_up_down, aes(x = factor(lib), y = n_down, shape = factor(lib))) +
geom_boxplot() +
geom_jitter(width = 0.2, height =0, alpha = 0.1) +
scale_y_log10(limits = c(NA, 16500))+
scale_x_discrete(labels = c('oPool', 'short TFs','long TFs','small RNAs')) +
scale_shape_manual(values = c(4, 21, 22, 24))+
labs(x = "Library", y = 'Perfect reads per target', title = 'Downstream junction') +
guides(shape = 'none')
#plot_n_down
plot_n_down | plot_n_up
We can see that the putative abundance varies greatly from ~1 read to 10,000 in several cases. What explains these differences in read abundance (and possibly ORBIT efficiency).
Let’s look at the most obvious correlation that we expect to find. Junction 1 and Junction 2 read abundances should correlate, assuming that they are each present in the same mutant cells. To compare fairly accross libraries we will normalize the read counts to the total counts from the library.
df_counts_norm <- df_TOs_found_up_down %>%
group_by(lib) %>%
mutate(down_total = sum(n_down, na.rm = T), up_total = sum(n_up,na.rm = T)) %>%
mutate(down_norm = n_down / down_total, up_norm = n_up / up_total)
df_counts_norm %>% select(lib, gene, n_down, n_up, found_score, down_norm, up_norm)
## # A tibble: 420 × 7
## # Groups: lib [4]
## lib gene n_down n_up found_score down_norm up_norm
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 alpA 1622 1906 2 0.0226 0.0160
## 2 1 argR 170 283 2 0.00237 0.00238
## 3 1 ariR 115 34 2 0.00160 0.000286
## 4 1 arsR 131 422 2 0.00183 0.00355
## 5 1 asnC 550 429 2 0.00767 0.00361
## 6 1 bolA 969 1745 2 0.0135 0.0147
## 7 1 caiF 789 3308 2 0.0110 0.0278
## 8 1 cedA 195 407 2 0.00272 0.00342
## 9 1 cspA 80 187 2 0.00112 0.00157
## 10 1 cspC 339 1137 2 0.00473 0.00956
## # … with 410 more rows
Now we can fit a linear model to the log transformed normalized data and assess graphically and statistically.
#Fit linear model on log transformed data, excluding zero counts.
summary(lm(up_norm ~ down_norm, data = df_counts_norm %>% filter(down_norm>0 & up_norm >0) %>% mutate(left_norm = log(down_norm, 10), right_norm = log(up_norm, 10))))
##
## Call:
## lm(formula = up_norm ~ down_norm, data = df_counts_norm %>% filter(down_norm >
## 0 & up_norm > 0) %>% mutate(left_norm = log(down_norm, 10),
## right_norm = log(up_norm, 10)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.061247 -0.002441 -0.001217 0.001288 0.081796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0015348 0.0006239 2.46 0.0143 *
## down_norm 0.8527150 0.0276068 30.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01085 on 382 degrees of freedom
## Multiple R-squared: 0.7141, Adjusted R-squared: 0.7133
## F-statistic: 954.1 on 1 and 382 DF, p-value: < 2.2e-16
plot_up_down_cor <- ggplot(df_counts_norm, aes(x = down_norm, y = up_norm, shape = factor(lib), group = 'all')) +
geom_smooth(method = 'lm', se = F, alpha = 0.75)+
#geom_abline(slope = 0.8478, intercept = 0.564, color = 'red')+
geom_point(color = 'black', alpha = 0.25) +
scale_x_log10() + scale_y_log10() +
scale_shape_manual(values = c(4, 21, 22, 24), labels = c('oPool', 'short TFs','long TFs','small RNAs'))+
labs(x = 'Downstream junction\nNormalized perfect reads per target', y = 'Upstream junction\nNormalized perfect reads per target', shape = "Library", caption = 'Adj R2 = 0.71, p = 2.2e-16')
plot_up_down_cor
Indeed Junction 1 and 2 read counts correlate quite well. Here is the combined figure that shows up in Fig. S12.
With that first quality control question out of the way, now we will examine other oligo parameters and how well they correlate with putative mutant abundance. First we need to calculate and prepare all the variables we need.
First we will add the TO abundance data (twist libraries only) to our
df_counts_norm
dataset. Let’s read in the TO abundance data
from code/seq_data_processing/TO_abundance
df_abundance <- read_csv("../seq_data_processing/TO_abundance/df_TO_abundances.csv")
df_abundance %>% select(lib, gene, TO_reads)
## # A tibble: 393 × 3
## lib gene TO_reads
## <dbl> <chr> <dbl>
## 1 1 alpA 4403
## 2 1 argR 7478
## 3 1 ariR 3283
## 4 1 arsR 2577
## 5 1 asnC 2621
## 6 1 bolA 4313
## 7 1 caiF 3715
## 8 1 cedA 2421
## 9 1 cspA 2916
## 10 1 cspC 5415
## # … with 383 more rows
Let’s normalize the TO counts to each library total and add this to the df_counts_norm dataframe.
df_abundance_norm <- df_abundance %>%
select(gene, lib, to_abun = TO_reads) %>% #**focus on perfect matches, rename to to_abun
group_by(lib) %>%
mutate(lib_total = sum(to_abun, na.rm = T)) %>%
mutate(to_abun_norm = to_abun / lib_total)
df_counts_norm_TO <- left_join(df_counts_norm,df_abundance_norm,
by = c('gene','lib'))
df_counts_norm_TO %>% select(gene, lib, to_abun, to_abun_norm)
## # A tibble: 420 × 4
## # Groups: lib [4]
## gene lib to_abun to_abun_norm
## <chr> <dbl> <dbl> <dbl>
## 1 alpA 1 4403 0.0148
## 2 argR 1 7478 0.0252
## 3 ariR 1 3283 0.0110
## 4 arsR 1 2577 0.00867
## 5 asnC 1 2621 0.00882
## 6 bolA 1 4313 0.0145
## 7 caiF 1 3715 0.0125
## 8 cedA 1 2421 0.00814
## 9 cspA 1 2916 0.00981
## 10 cspC 1 5415 0.0182
## # … with 410 more rows
The major variable we need to calculate are the free energies for the targeting oligos. The following code cell iterates through each targeting oligo and uses the XNAstring package to calculate the oligo monomer free energy, oligo homodimer free energy and the free energy of each homology arm duplex (complexed with homology on genome).
#takes a min or two to run
df_oligo_energy <- tibble()
for(row in 1:nrow(df_TOs_found_up_down)){
oligo <- df_TOs_found_up_down$oligo[row]
#print(row)
oligo_len <- nchar(oligo)
arm_len <- (oligo_len-38)/2
arm_1 <- str_sub(oligo, start = 1, end = arm_len)
arm_2 <- str_sub(oligo, oligo_len - arm_len + 1, oligo_len)
oligo_xna <- XNAString(base = toupper(oligo))
oligo_homodimer_xna <- XNAString(base = c(toupper(oligo),toupper(oligo)))
arm_1_xna <- XNAString(base = c(arm_1, as.character(reverseComplement(DNAString(arm_1)))))
arm_2_xna <- XNAString(base = c(arm_2, as.character(reverseComplement(DNAString(arm_2)))))
oligo_monomer <- predictMfeStructure(oligo_xna)
oligo_homodimer <- predictDuplexStructure(oligo_homodimer_xna)
arm_1_duplex <- predictDuplexStructure(arm_1_xna)
arm_2_duplex <- predictDuplexStructure(arm_2_xna)
oligo_energy <- bind_cols(df_TOs_found_up_down[row,],
tibble('oligo_monomer' = oligo_monomer$mfe,
'oligo_homodimer'=oligo_homodimer$mfe,
'arm_1_duplex'=arm_1_duplex$mfe,
'arm_2_duplex'=arm_2_duplex$mfe))
df_oligo_energy <- bind_rows(df_oligo_energy, oligo_energy)
}
#write_csv(df_oligo_energy, "df_oligo_energy.csv")
#df_oligo_energy
We can read in df_oligo energy here without running the above cell.
df_oligo_energy <- read_csv("df_oligo_energy.csv")
df_oligo_energy %>% select(lib, gene, oligo_monomer, oligo_homodimer, arm_1_duplex, arm_2_duplex)
## # A tibble: 420 × 6
## lib gene oligo_monomer oligo_homodimer arm_1_duplex arm_2_duplex
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 alpA -30 -67 -84.3 -75.7
## 2 1 argR -38.3 -82.7 -82.5 -93.2
## 3 1 ariR -27 -64.1 -66.9 -70.2
## 4 1 arsR -28 -63 -87.7 -65.2
## 5 1 asnC -30.1 -68.1 -70.6 -82.3
## 6 1 bolA -39.2 -86.8 -91.7 -87.2
## 7 1 caiF -25.1 -59.3 -80.6 -77.9
## 8 1 cedA -27 -60.7 -89.1 -70
## 9 1 cspA -33.3 -78.3 -77.7 -82.6
## 10 1 cspC -36.8 -81.1 -82.1 -81.3
## # … with 410 more rows
Once we have calculated these values, we need to do some slight rearranging, by adding the homology arm duplex free energies together to get a combined homology free energy. We will also calculate the distance from the target to the origin of replication.
#genomic positions for origin, terminus and genome length.
ori <- 3923882.5
ter <- 1590250
gen_len <- 4641652
df_vars <- left_join(df_counts_norm_TO, df_oligo_energy %>% select(lib, gene, oligo_monomer, oligo_homodimer, arm_1_duplex, arm_2_duplex), by =c('gene', 'lib')) %>%
mutate(arm_total_duplex = arm_1_duplex + arm_2_duplex) %>% # get total homology arm free energy
mutate(ori_dist = ifelse(left_oligo_pos > ter, abs(ori-left_oligo_pos), gen_len-ori+left_oligo_pos)) %>% #Calculate distance to origin from vals above
mutate(del_len = right_oligo_pos - left_oligo_pos) %>% #calculate deletion length
mutate(lib = factor(lib,levels = c("0","1","2","3"),labels = c('oPool','short TFs','long TFs','small RNAs'))) #organize lib names and levels
#write_csv(df_vars, "df_vars.csv")
df_vars %>% select(lib, gene, arm_total_duplex, ori_dist, to_abun, del_len)
## # A tibble: 420 × 6
## # Groups: lib [4]
## lib gene arm_total_duplex ori_dist to_abun del_len
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 short TFs alpA -160 1165234. 4403 187
## 2 short TFs argR -176. 539174. 7478 445
## 3 short TFs ariR -137. 1934144. 3283 241
## 4 short TFs arsR -153. 275350. 2577 328
## 5 short TFs asnC -153. 2682. 2621 433
## 6 short TFs bolA -179. 1172246. 4313 292
## 7 short TFs caiF -158. 752074. 3715 370
## 8 short TFs cedA -159. 2110442. 2421 217
## 9 short TFs cspA -160. 203828. 2916 187
## 10 short TFs cspC -163. 2016636. 5415 184
## # … with 410 more rows
Now that we have prepared all of the oligo variables, we can fit linear models to see how well each variable correlates with the observed mutant library reads.
del_len_mod <- function(df) {
lm(up_log~del_len_log , data = df)
}
df_del_len <- df_vars %>%
filter(n_up>0) %>%
mutate(up_log = log(n_up,10), del_len_log = log(del_len, 10)) %>%
group_by(lib) %>%
nest() %>%
mutate(model = map(data, del_len_mod)) %>%
mutate(glance = map(model, broom::glance))%>%
mutate(summary = map(model, broom::tidy, conf.int = T))
df_vars %>% group_by(lib) %>% summarise(min_del_len = min(del_len), max_del_len = max(del_len))
## # A tibble: 4 × 3
## lib min_del_len max_del_len
## <fct> <dbl> <dbl>
## 1 oPool 292 2680
## 2 short TFs 175 547
## 3 long TFs 550 3937
## 4 small RNAs 54 370
## # A tibble: 4 × 16
## # Groups: lib [4]
## lib data model r.squ…¹ adj.r.…² sigma stati…³ p.value df logLik AIC
## <fct> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 shor… <tibble> <lm> 1.71e-2 0.00242 0.753 1.16 0.284 1 -77.3 161.
## 2 long… <tibble> <lm> 6.37e-3 0.00153 0.648 1.31 0.253 1 -203. 412.
## 3 smal… <tibble> <lm> 6.50e-4 -0.0115 0.880 0.0533 0.818 1 -107. 221.
## 4 oPool <tibble> <lm> 3.95e-2 0.00112 0.796 1.03 0.320 1 -31.1 68.2
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>, summary <list>, and abbreviated variable names ¹r.squared,
## # ²adj.r.squared, ³statistic
## # A tibble: 8 × 11
## # Groups: lib [4]
## lib data model glance term estim…¹ std.e…² stati…³ p.value conf.…⁴
## <fct> <list> <lis> <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 short T… <tibble> <lm> <tibble> (Int… 0.973 1.63 0.597 5.53e-1 -2.28
## 2 short T… <tibble> <lm> <tibble> del_… 0.696 0.645 1.08 2.84e-1 -0.591
## 3 long TFs <tibble> <lm> <tibble> (Int… 3.40 0.986 3.44 6.99e-4 1.45
## 4 long TFs <tibble> <lm> <tibble> del_… -0.385 0.336 -1.15 2.53e-1 -1.05
## 5 small R… <tibble> <lm> <tibble> (Int… 2.65 0.965 2.74 7.52e-3 0.726
## 6 small R… <tibble> <lm> <tibble> del_… -0.108 0.469 -0.231 8.18e-1 -1.04
## 7 oPool <tibble> <lm> <tibble> (Int… 4.77 2.02 2.36 2.62e-2 0.612
## 8 oPool <tibble> <lm> <tibble> del_… -0.711 0.701 -1.01 3.20e-1 -2.15
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## # ¹estimate, ²std.error, ³statistic, ⁴conf.low
del_len_labs <- df_del_len %>% unnest(glance) %>%
mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>%
mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>%
mutate(labels = paste0(lib,": ",label_1," ",label_2))
plot_del_len <- ggplot(df_vars, aes(x = del_len, y = n_up)) +
geom_smooth(method = 'lm')+
geom_point(shape = 21, alpha = 0.25) +
#geom_text(data = del_len_labs, aes())
scale_x_log10() + scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') +
labs(x = "Deletion length (bp)", y = 'Reads per target (upstream junction)', caption = paste(del_len_labs$labels, collapse = "\n"), title = 'Deletion length')
plot_del_len
to_abun_mod <- function(df) {
lm(up_log~to_abun , data = df)
}
df_to_abun <- df_vars %>%
filter(n_up>0) %>%
filter(lib != "oPool") %>%
mutate(up_log = log(n_up,10)) %>%
# mutate(lib = factor(lib,labels = c('short TFs','long TFs','small RNAs'))) %>%
group_by(lib) %>%
nest() %>%
mutate(model = map(data, to_abun_mod)) %>%
mutate(glance = map(model, broom::glance))%>%
mutate(summary = map(model, broom::tidy, conf.int = T))
df_to_abun %>% unnest(glance)
## # A tibble: 3 × 16
## # Groups: lib [3]
## lib data model r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC
## <fct> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 short… <tibble> <lm> 0.137 0.124 0.706 10.6 1.76e-3 1 -72.8 152.
## 2 long … <tibble> <lm> 0.0619 0.0573 0.630 13.5 3.00e-4 1 -197. 400.
## 3 small… <tibble> <lm> 0.0894 0.0783 0.840 8.05 5.74e-3 1 -104. 213.
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>, summary <list>, and abbreviated variable names ¹r.squared,
## # ²adj.r.squared, ³statistic
## # A tibble: 6 × 11
## # Groups: lib [3]
## lib data model glance term estim…¹ std.e…² stati…³ p.value conf.…⁴
## <fct> <list> <lis> <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 short … <tibble> <lm> <tibble> (Int… 1.98e+0 2.46e-1 8.03 2.16e-11 1.49e+0
## 2 short … <tibble> <lm> <tibble> to_a… 1.87e-4 5.75e-5 3.26 1.76e- 3 7.26e-5
## 3 long T… <tibble> <lm> <tibble> (Int… 1.81e+0 1.30e-1 13.9 1.97e-31 1.56e+0
## 4 long T… <tibble> <lm> <tibble> to_a… 5.66e-4 1.54e-4 3.68 3.00e- 4 2.63e-4
## 5 small … <tibble> <lm> <tibble> (Int… 1.83e+0 2.29e-1 7.98 7.80e-12 1.37e+0
## 6 small … <tibble> <lm> <tibble> to_a… 1.97e-4 6.96e-5 2.84 5.74e- 3 5.90e-5
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## # ¹estimate, ²std.error, ³statistic, ⁴conf.low
to_abun_labs <- df_to_abun %>% unnest(glance) %>%
mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>%
mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>%
mutate(labels = paste0(lib,": ",label_1," ",label_2))
plot_to_abun <- ggplot(df_vars, aes(x = to_abun, y = n_up)) +
geom_smooth(method = 'lm')+
geom_point(shape = 21, alpha = 0.25) +
scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') +
labs(x = "Oligo abundance (reads)", y = 'Reads per target (upstream junction)', caption = paste(to_abun_labs$labels, collapse = "\n"), title = 'Oligo abundance')
plot_to_abun
hom_energy_mod <- function(df) {
lm(up_log~arm_total_duplex, data = df)
}
df_hom_energy <- df_vars %>%
filter(n_up >0) %>%
mutate(up_log = log(n_up,10)) %>%
#mutate(lib = factor(lib,labels = c('oPool','short TFs','long TFs','small RNAs'))) %>%
group_by(lib) %>%
nest() %>%
mutate(model = map(data, hom_energy_mod)) %>%
mutate(glance = map(model, broom::glance))%>%
mutate(summary = map(model, broom::tidy, conf.int = T))
df_hom_energy %>% unnest(glance)
## # A tibble: 4 × 16
## # Groups: lib [4]
## lib data model r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC
## <fct> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 short… <tibble> <lm> 0.126 0.113 0.710 9.64 2.79e-3 1 -73.3 153.
## 2 long … <tibble> <lm> 0.124 0.120 0.608 29.0 1.98e-7 1 -190. 386.
## 3 small… <tibble> <lm> 0.0434 0.0318 0.861 3.72 5.72e-2 1 -106. 217.
## 4 oPool <tibble> <lm> 0.0214 -0.0177 0.803 0.548 4.66e-1 1 -31.4 68.7
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>, summary <list>, and abbreviated variable names ¹r.squared,
## # ²adj.r.squared, ³statistic
## # A tibble: 8 × 11
## # Groups: lib [4]
## lib data model glance term estimate std.e…¹ statis…² p.value conf.…³
## <fct> <list> <lis> <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 short… <tibble> <lm> <tibble> (Int… -1.26 1.29 -0.977 3.32e-1 -3.83
## 2 short… <tibble> <lm> <tibble> arm_… -0.0248 0.00798 -3.10 2.79e-3 -0.0407
## 3 long … <tibble> <lm> <tibble> (Int… -0.801 0.571 -1.40 1.62e-1 -1.93
## 4 long … <tibble> <lm> <tibble> arm_… -0.0186 0.00346 -5.38 1.98e-7 -0.0255
## 5 small… <tibble> <lm> <tibble> (Int… -0.00443 1.26 -0.00351 9.97e-1 -2.52
## 6 small… <tibble> <lm> <tibble> arm_… -0.0146 0.00758 -1.93 5.72e-2 -0.0297
## 7 oPool <tibble> <lm> <tibble> (Int… 0.865 2.52 0.343 7.34e-1 -4.33
## 8 oPool <tibble> <lm> <tibble> arm_… -0.0113 0.0153 -0.740 4.66e-1 -0.0428
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## # ¹std.error, ²statistic, ³conf.low
hom_energy_labs <- df_hom_energy %>% unnest(glance) %>%
mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>%
mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>%
mutate(labels = paste0(lib,": ",label_1," ",label_2))
plot_hom_energy <- ggplot(df_vars, aes(x = arm_total_duplex, y = n_up)) +
geom_smooth(method = 'lm')+
geom_point(shape = 21, alpha = 0.25) +
scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') +
labs(x = "Homology arm free energy (kcal/mol)", y = 'Reads per target (upstream junction)', caption = paste(hom_energy_labs$labels, collapse = "\n"), title = 'Oligo homology arm duplex free energy')
plot_hom_energy
mon_energy_mod <- function(df) {
lm(up_log~oligo_monomer, data = df)
}
df_mon_energy <- df_vars %>%
filter(n_up >0) %>%
mutate(up_log = log(n_up,10)) %>%
#mutate(lib = factor(lib,labels = c('oPool','short TFs','long TFs','small RNAs'))) %>%
group_by(lib) %>%
nest() %>%
mutate(model = map(data, mon_energy_mod)) %>%
mutate(glance = map(model, broom::glance))%>%
mutate(summary = map(model, broom::tidy, conf.int = T))
df_mon_energy %>% unnest(glance)
## # A tibble: 4 × 16
## # Groups: lib [4]
## lib data model r.squ…¹ adj.r.…² sigma stati…³ p.value df logLik AIC
## <fct> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 shor… <tibble> <lm> 0.0759 0.0621 0.730 5.50 2.19e-2 1 -75.2 156.
## 2 long… <tibble> <lm> 0.0604 0.0558 0.630 13.2 3.57e-4 1 -197. 400.
## 3 smal… <tibble> <lm> 0.00225 -0.00992 0.879 0.185 6.69e-1 1 -107. 221.
## 4 oPool <tibble> <lm> 0.0253 -0.0136 0.802 0.650 4.28e-1 1 -31.3 68.6
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>, summary <list>, and abbreviated variable names ¹r.squared,
## # ²adj.r.squared, ³statistic
## # A tibble: 8 × 11
## # Groups: lib [4]
## lib data model glance term estimate std.e…¹ stati…² p.value conf.…³
## <fct> <list> <lis> <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 short … <tibble> <lm> <tibble> (Int… 1.39 0.578 2.41 1.87e-2 0.239
## 2 short … <tibble> <lm> <tibble> olig… -0.0416 0.0177 -2.35 2.19e-2 -0.0769
## 3 long T… <tibble> <lm> <tibble> (Int… 1.20 0.296 4.07 6.76e-5 0.620
## 4 long T… <tibble> <lm> <tibble> olig… -0.0307 0.00845 -3.63 3.57e-4 -0.0473
## 5 small … <tibble> <lm> <tibble> (Int… 2.20 0.521 4.23 6.05e-5 1.17
## 6 small … <tibble> <lm> <tibble> olig… -0.00606 0.0141 -0.430 6.69e-1 -0.0341
## 7 oPool <tibble> <lm> <tibble> (Int… 1.92 1.01 1.90 6.86e-2 -0.158
## 8 oPool <tibble> <lm> <tibble> olig… -0.0233 0.0289 -0.806 4.28e-1 -0.0828
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## # ¹std.error, ²statistic, ³conf.low
mon_energy_labs <- df_mon_energy %>% unnest(glance) %>%
mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>%
mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>%
mutate(labels = paste0(lib,": ",label_1," ",label_2))
plot_mon_energy <- ggplot(df_vars, aes(x = oligo_monomer, y = n_up)) +
geom_smooth(method = 'lm')+
geom_point(shape = 21, alpha = 0.25) +
scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') +
labs(x = "Oligo monomer free energy (kcal/mol)", y = 'Reads per target (upstream junction)', caption = paste(mon_energy_labs$labels, collapse = "\n"), title = 'Oligo monomer free energy')
plot_mon_energy
ori_dist_mod <- function(df) {
lm(up_log~ori_dist, data = df)
}
df_ori_dist <- df_vars %>%
filter(n_up >0) %>%
mutate(up_log = log(n_up,10)) %>%
#mutate(lib = factor(lib,labels = c('oPool','short TFs','long TFs','small RNAs'))) %>%
group_by(lib) %>%
nest() %>%
mutate(model = map(data, ori_dist_mod)) %>%
mutate(glance = map(model, broom::glance))%>%
mutate(summary = map(model, broom::tidy, conf.int = T))
df_ori_dist %>% unnest(glance)
## # A tibble: 4 × 16
## # Groups: lib [4]
## lib data model r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC
## <fct> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 short… <tibble> <lm> 1.67e-5 -0.0149 0.760 0.00112 0.973 1 -77.9 162.
## 2 long … <tibble> <lm> 3.63e-2 0.0316 0.638 7.72 0.00596 1 -200. 405.
## 3 small… <tibble> <lm> 2.98e-2 0.0179 0.867 2.52 0.117 1 -106. 218.
## 4 oPool <tibble> <lm> 2.20e-3 -0.0377 0.811 0.0551 0.816 1 -31.6 69.2
## # … with 5 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
## # nobs <int>, summary <list>, and abbreviated variable names ¹r.squared,
## # ²adj.r.squared, ³statistic
## # A tibble: 8 × 11
## # Groups: lib [4]
## lib data model glance term estimate std.e…¹ stati…² p.value conf.low
## <fct> <list> <lis> <list> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 shor… <tibble> <lm> <tibble> (Int… 2.74e+0 1.75e-1 15.7 4.32e-24 2.39e+0
## 2 shor… <tibble> <lm> <tibble> ori_… -4.39e-9 1.31e-7 -0.0334 9.73e- 1 -2.67e-7
## 3 long… <tibble> <lm> <tibble> (Int… 2.47e+0 8.66e-2 28.5 2.50e-73 2.30e+0
## 4 long… <tibble> <lm> <tibble> ori_… -1.83e-7 6.59e-8 -2.78 5.96e- 3 -3.13e-7
## 5 smal… <tibble> <lm> <tibble> (Int… 2.66e+0 1.78e-1 15.0 3.45e-25 2.31e+0
## 6 smal… <tibble> <lm> <tibble> ori_… -2.11e-7 1.33e-7 -1.59 1.17e- 1 -4.76e-7
## 7 oPool <tibble> <lm> <tibble> (Int… 2.67e+0 2.79e-1 9.60 7.25e-10 2.10e+0
## 8 oPool <tibble> <lm> <tibble> ori_… 5.61e-8 2.39e-7 0.235 8.16e- 1 -4.37e-7
## # … with 1 more variable: conf.high <dbl>, and abbreviated variable names
## # ¹std.error, ²statistic
ori_dist_labs <- df_ori_dist %>% unnest(glance) %>%
mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>%
mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>%
mutate(labels = paste0(lib,": ",label_1," ",label_2))
Mb_label <- function(x){
# from s to ns
lab <- paste(x / 1000000)
}
plot_ori_dist <- ggplot(df_vars, aes(x = ori_dist, y = n_up)) +
geom_smooth(se = F, color = 'red')+
geom_smooth(method = 'lm')+
geom_point(shape = 21, alpha = 0.25) +
scale_x_continuous(labels = Mb_label)+
scale_y_log10()+facet_wrap(~lib, ncol = 2, scales = 'free') +
labs(x = "Distance to ori (Mb)", y = 'Reads per target (upstream junction)', caption = paste(ori_dist_labs$labels, collapse = "\n"), title = 'Distance to ori')
plot_ori_dist
theme_set(theme_notebook())
mod_3_var <- function(df) {
#lm(right_norm_log~arm_total_duplex+ ori_dist + to_abun_norm , data = df)
lm(up_log~arm_total_duplex+ ori_dist + to_abun_norm , data = df)
}
df_mod_3_var <- df_vars %>%
filter(lib != 'oPool') %>%
filter(n_up>0) %>%
mutate(up_norm_log = log(up_norm, 10), up_log = log(n_up,10)) %>%
#mutate(lib = factor(lib,labels = c('short TFs','long TFs','small RNAs'))) %>%
group_by(lib) %>%
nest() %>%
mutate(model = map(data, mod_3_var)) %>%
mutate(resids = map2(data, model, add_residuals)) %>%
mutate(preds = map2(data, model, add_predictions)) %>%
mutate(glance = map(model, broom::glance))%>%
mutate(summary = map(model, broom::tidy, conf.int = T))
df_mod_3_var %>% unnest(glance)
## # A tibble: 3 × 18
## # Groups: lib [3]
## lib data model resids preds r.squ…¹ adj.r…² sigma stati…³ p.value
## <fct> <list> <lis> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 short … <tibble> <lm> <tibble> <tibble> 0.173 0.135 0.701 4.53 6.03e-3
## 2 long T… <tibble> <lm> <tibble> <tibble> 0.144 0.132 0.604 11.4 5.99e-7
## 3 small … <tibble> <lm> <tibble> <tibble> 0.102 0.0683 0.844 3.03 3.42e-2
## # … with 8 more variables: df <dbl>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>, nobs <int>, summary <list>, and
## # abbreviated variable names ¹r.squared, ²adj.r.squared, ³statistic
## # A tibble: 12 × 13
## # Groups: lib [3]
## lib data model resids preds glance term estimate std.e…¹
## <fct> <list> <list> <list> <list> <list> <chr> <dbl> <dbl>
## 1 short TFs <tibble> <lm> <tibble> <tibble> <tibble> (Inte… -2.80e-1 1.38e+0
## 2 short TFs <tibble> <lm> <tibble> <tibble> <tibble> arm_t… -1.55e-2 9.22e-3
## 3 short TFs <tibble> <lm> <tibble> <tibble> <tibble> ori_d… -6.80e-9 1.21e-7
## 4 short TFs <tibble> <lm> <tibble> <tibble> <tibble> to_ab… 3.83e+1 1.99e+1
## 5 long TFs <tibble> <lm> <tibble> <tibble> <tibble> (Inte… -2.09e-1 6.57e-1
## 6 long TFs <tibble> <lm> <tibble> <tibble> <tibble> arm_t… -1.52e-2 4.35e-3
## 7 long TFs <tibble> <lm> <tibble> <tibble> <tibble> ori_d… -1.32e-7 6.34e-8
## 8 long TFs <tibble> <lm> <tibble> <tibble> <tibble> to_ab… 2.87e+1 3.33e+1
## 9 small RNAs <tibble> <lm> <tibble> <tibble> <tibble> (Inte… 1.94e+0 1.51e+0
## 10 small RNAs <tibble> <lm> <tibble> <tibble> <tibble> arm_t… -6.14e-4 9.87e-3
## 11 small RNAs <tibble> <lm> <tibble> <tibble> <tibble> ori_d… -1.40e-7 1.33e-7
## 12 small RNAs <tibble> <lm> <tibble> <tibble> <tibble> to_ab… 4.97e+1 2.60e+1
## # … with 4 more variables: statistic <dbl>, p.value <dbl>, conf.low <dbl>,
## # conf.high <dbl>, and abbreviated variable name ¹std.error
mod_3_var_labs <- df_mod_3_var %>% unnest(glance) %>%
mutate(label_1 = paste0("adj R2 = ", round(adj.r.squared, digits = 3))) %>%
mutate(label_2 = paste0("p = ", signif(p.value, digits = 2))) %>%
mutate(labels = paste0(lib,": ",label_1," ",label_2))
plot_3_var_preds <- ggplot(data=df_mod_3_var %>% unnest( preds), aes(x = pred , y = up_log )) +
geom_abline(slope = 1, intercept = 0,linetype = 2, color = 'light gray')+
geom_point(shape = 21, alpha = 0.25) +
facet_wrap(~lib, ncol = 3, scales = 'free')+
#geom_text(data = model_labs %>% filter(lib=='long TFs'), aes(x =1.25, y = 2.5, label = labels))+
scale_x_continuous(limits = c(1,3),breaks = c(1,2,3), labels = c( "10", "100","1000"))+
scale_y_continuous(limits = c(0,4),breaks = c(1,2,3), labels = c( "10", "100","1000"))+
labs(y = "Experimental reads per target",x = 'Predicted reads per target', subtitle = 'Reads per target ~ oligo abundance + homology arm free energy + distance to ori', title = '3 var model')
plot_3_var_preds
plot_3_var_resids <- ggplot(data=df_mod_3_var %>% unnest(preds, resids), aes(x = pred , y = resid )) +
geom_hline(yintercept = 0, linetype = 2, color = 'light gray')+
geom_smooth()+
geom_point(shape = 21, alpha = 0.25) +
facet_wrap(~lib, ncol = 3, scales = 'free')+
scale_x_continuous(limits = c(1,3),breaks = c(1,2,3), labels = c( "10", "100","1000"))+
scale_y_continuous(breaks = c(-2,-1,0,1), labels = c('-100x','-10x','0','+10x')) +
labs(y = "Residual error",x = 'Predicted reads per target', caption = paste(mod_3_var_labs$labels, collapse = "\n"))
plot_3_var_resids
theme_set(theme_figure())
top_grid <- plot_grid(plot_n_down, plot_n_up, plot_up_down_cor + guides(shape ='none'), ncol = 3, align = 'hv', axis = 'tblr')
#middle_grid <- plot_grid(plot_del_len, plot_to_abun, plot_hom_energy, plot_mon_energy, plot_ori_dist, ncol = 3, align = 'hv', axis = 'tblr')
bottom_grid <- plot_grid(plot_3_var_preds, plot_3_var_resids,ncol = 1, align = 'hv', axis = 'lr')
full_grid <- plot_grid(top_grid, bottom_grid, ncol = 1, align = 'hv', axis = 'lr', rel_heights = c(1,2))
full_grid
#save_plot("../../figures/r_pdf_figs/supp_figs/mut_lib_up_down_mod.pdf",full_grid, base_height = 8, base_width = 8)
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Biostrings_2.66.0 GenomeInfoDb_1.34.9 XVector_0.38.0
## [4] IRanges_2.32.0 S4Vectors_0.36.0 BiocGenerics_0.44.0
## [7] XNAString_1.6.0 modelr_0.1.10 ggridges_0.5.4
## [10] ggdist_3.2.1 patchwork_1.1.2 ggrastr_1.0.1
## [13] kableExtra_1.3.4 cowplot_1.1.1 viridis_0.6.2
## [16] viridisLite_0.4.1 knitr_1.41 forcats_0.5.2
## [19] stringr_1.5.0 dplyr_1.1.0 purrr_0.3.5
## [22] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
## [25] ggplot2_3.4.0 tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] googledrive_2.0.0 ggbeeswarm_0.7.1
## [3] colorspace_2.0-3 rjson_0.2.21
## [5] ellipsis_0.3.2 GenomicRanges_1.50.1
## [7] fs_1.5.2 rstudioapi_0.14
## [9] listenv_0.9.0 farver_2.1.1
## [11] bit64_4.0.5 fansi_1.0.3
## [13] lubridate_1.9.0 xml2_1.3.3
## [15] splines_4.2.0 codetools_0.2-18
## [17] cachem_1.0.6 jsonlite_1.8.3
## [19] Rsamtools_2.14.0 broom_1.0.1
## [21] dbplyr_2.2.1 compiler_4.2.0
## [23] httr_1.4.4 backports_1.4.1
## [25] assertthat_0.2.1 Matrix_1.5-3
## [27] fastmap_1.1.0 gargle_1.2.1
## [29] cli_3.4.1 htmltools_0.5.4
## [31] tools_4.2.0 gtable_0.3.1
## [33] glue_1.6.2 GenomeInfoDbData_1.2.9
## [35] Rcpp_1.0.9 Biobase_2.58.0
## [37] cellranger_1.1.0 jquerylib_0.1.4
## [39] vctrs_0.5.2 nlme_3.1-160
## [41] svglite_2.1.0 rtracklayer_1.58.0
## [43] xfun_0.35 globals_0.16.2
## [45] rvest_1.0.3 timechange_0.1.1
## [47] lifecycle_1.0.3 restfulr_0.0.15
## [49] XML_3.99-0.14 googlesheets4_1.0.1
## [51] future_1.31.0 zlibbioc_1.44.0
## [53] scales_1.2.1 vroom_1.6.0
## [55] BSgenome_1.66.3 hms_1.1.2
## [57] MatrixGenerics_1.10.0 parallel_4.2.0
## [59] SummarizedExperiment_1.28.0 yaml_2.3.6
## [61] gridExtra_2.3 sass_0.4.4
## [63] stringi_1.7.8 highr_0.9
## [65] BiocIO_1.8.0 BiocParallel_1.32.4
## [67] rlang_1.1.1 pkgconfig_2.0.3
## [69] systemfonts_1.0.4 matrixStats_0.63.0
## [71] bitops_1.0-7 distributional_0.3.1
## [73] evaluate_0.18 lattice_0.20-45
## [75] labeling_0.4.2 GenomicAlignments_1.34.0
## [77] htmlwidgets_1.6.1 bit_4.0.5
## [79] tidyselect_1.2.0 parallelly_1.34.0
## [81] magrittr_2.0.3 R6_2.5.1
## [83] generics_0.1.3 DelayedArray_0.24.0
## [85] DBI_1.1.3 mgcv_1.8-41
## [87] pillar_1.8.1 haven_2.5.1
## [89] withr_2.5.0 RCurl_1.98-1.9
## [91] future.apply_1.11.0 crayon_1.5.2
## [93] utf8_1.2.2 formattable_0.2.1
## [95] tzdb_0.3.0 rmarkdown_2.18
## [97] grid_4.2.0 readxl_1.4.1
## [99] data.table_1.14.6 reprex_2.0.2
## [101] digest_0.6.30 webshot_0.5.4
## [103] munsell_0.5.0 beeswarm_0.4.0
## [105] vipor_0.4.5 bslib_0.4.1