ORBIT small RNA deletions¶

Just like the ∆TF library, let's make ORBIT oligos to delete all 93 small RNAs annotated in E. coli's genome.

biocyc link

In [2]:
import numpy as np
import random
import string
import pandas as pd
import holoviews as hv
from holoviews import opts,dim
import Bio.Seq as Seq
import Bio.SeqIO
#from plotnine import *
import inspect

import wgregseq
%load_ext autoreload
%autoreload 2

hv.extension('bokeh')

pd.options.display.max_colwidth = 200
In [3]:
for record in Bio.SeqIO.parse('sequencev3.fasta', "fasta"):
    genome = str(record.seq)
    
print("Length genome: {}".format(len(genome)))
print("First 100 bases: {}".format(genome[:100]))
Length genome: 4641652
First 100 bases: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT
In [4]:
df = pd.read_csv("all_RNAs_except_rRNAs_and_tRNAs.txt", sep = '\t')
df = df.drop(['Sequence - coordinates of DNA region','Sequence - length of region'], axis = 1)

df
Out[4]:
Gene Name Product Name Left-End-Position Right-End-Position
0 3'ETS-<i>leuZ</i> small regulatory RNA 3'ETS<sup><i>leuZ</i></sup> 1991748 1991814
1 agrA small RNA AgrA 3648063 3648144
2 agrB small regulatory RNA AgrB 3648294 3648375
3 arcZ small regulatory RNA ArcZ 3350577 3350697
4 arrS small regulatory RNA ArrS 3657985 3658054
... ... ... ... ...
88 sroH small RNA SroH 4190327 4190487
89 ssrA tmRNA 2755593 2755955
90 ssrS 6S RNA 3055983 3056165
91 symR small regulatory RNA antitoxin SymR 4579835 4579911
92 tff putative small RNA T44 189712 189847

93 rows × 4 columns

First, let's make sure that this table is complete.

In [5]:
print('Unique\n',df.nunique())

print('\nNulls\n',df.isnull().sum())
Unique
 Gene Name             93
Product Name          93
Left-End-Position     93
Right-End-Position    93
dtype: int64

Nulls
 Gene Name             0
Product Name          0
Left-End-Position     0
Right-End-Position    0
dtype: int64

Ok, so we don't seem to have any missing values for our essential parameters, left, right and direction.

Gene length considerations¶

Another thing we should examine before proceeding is the length distribution of these genes. Remember these oligos will need to delete almost the entire gene, and efficiencies diminish significantly as deletion length increases.

In [6]:
df['length'] = df['Right-End-Position'] - df['Left-End-Position']

df['length'].describe()
Out[6]:
count     93.000000
mean     124.688172
std       69.009763
min       52.000000
25%       76.000000
50%      100.000000
75%      157.000000
max      368.000000
Name: length, dtype: float64

Ok, so there's quite a range in lengths from 200 bp to almost 4kb. That will yield some pretty big differences in efficiency. However, we can see that the middle 50% falls between 575 and 932bp, which isn't so bad. Let's take a look at the distributions:

In [7]:
scatter = hv.Scatter(df, 'Left-End-Position', 'length').opts(width = 500) 

scatter 
Out[7]:
In [8]:
df['Direction'] = '+'
df['left_oligo_pos'] = df['Left-End-Position']-1
df['right_oligo_pos'] = df['Right-End-Position']+1
df.head()
Out[8]:
Gene Name Product Name Left-End-Position Right-End-Position length Direction left_oligo_pos right_oligo_pos
0 3'ETS-<i>leuZ</i> small regulatory RNA 3'ETS<sup><i>leuZ</i></sup> 1991748 1991814 66 + 1991747 1991815
1 agrA small RNA AgrA 3648063 3648144 81 + 3648062 3648145
2 agrB small regulatory RNA AgrB 3648294 3648375 81 + 3648293 3648376
3 arcZ small regulatory RNA ArcZ 3350577 3350697 120 + 3350576 3350698
4 arrS small regulatory RNA ArrS 3657985 3658054 69 + 3657984 3658055
In [9]:
df_oligos = wgregseq.get_target_oligo_df_2(df, 'left_oligo_pos', 'right_oligo_pos','Direction',genome, attB_lock = True)
df_oligos
/Users/scottsaunders/Reg-Seq2/software_module/wgregseq/orbit.py:406: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp['oligo'] = oligo
Out[9]:
Gene Name Product Name Left-End-Position Right-End-Position length Direction left_oligo_pos right_oligo_pos oligo
0 3'ETS-<i>leuZ</i> small regulatory RNA 3'ETS<sup><i>leuZ</i></sup> 1991748 1991814 66 + 1991747 1991815 TGGTGATTAAAAATTAAGGAGGGTGTAACGACAAGTTGCAGGCACggcttgtcgacgacggcggtctccgtcgtcaggatcatTGGTACCCGGAGCGGGACTTGAACCCGCACAGCGCGAACGCCGAG
1 agrA small RNA AgrA 3648063 3648144 81 + 3648062 3648145 AGCACGTCCTTGCAATAGTTTCAGTATGGTATTAGCATTGATGCGggcttgtcgacgacggcggtctccgtcgtcaggatcatACATCCGGATTCGGACAAGGCTTAATATGACGATGACCCAGTGAA
2 agrB small regulatory RNA AgrB 3648294 3648375 81 + 3648293 3648376 CGCTAATTCTTGCAATGTTAGCCACTGGCTAATAGTATTGAGCTGggcttgtcgacgacggcggtctccgtcgtcaggatcatACGTCCTGATTCAGACCTCCTTTCAAATGAATAGCCAACTCAAAA
3 arcZ small regulatory RNA ArcZ 3350577 3350697 120 + 3350576 3350698 ACTGATTCATGTAACAAATCATTTAAGTTTTGCTATCTTAACTGCggcttgtcgacgacggcggtctccgtcgtcaggatcatAGTGGCTTTTGCCACCCACGCTTTCAGCACTTCTACGTCGTGACG
4 arrS small regulatory RNA ArrS 3657985 3658054 69 + 3657984 3658055 CTGAAGACATGAATGCGTTATTTACTCAGGTAATTTCAATGCGTTggcttgtcgacgacggcggtctccgtcgtcaggatcatATTTTAACTTTAGTAATATTCTTCAGAGATCACAAACTGGTTATT
... ... ... ... ... ... ... ... ... ...
88 sroH small RNA SroH 4190327 4190487 160 + 4190326 4190488 AGAGATCTGATTGTAAGAGAGTAAATACTCAACTATGATAGAGACggcttgtcgacgacggcggtctccgtcgtcaggatcatGTTATTTTGAGGGCTGAGGAAGCTGCTTATTTCTCAATAAGTTGT
89 ssrA tmRNA 2755593 2755955 362 + 2755592 2755956 CTGGTCATGGCGCTCATAAATCTGGTATACTTACCTTTACACATTggcttgtcgacgacggcggtctccgtcgtcaggatcatAAATTCTCCATCGGTGATTACCAGAGTCATCCGATGAAGTCCTAA
90 ssrS 6S RNA 3055983 3056165 182 + 3055982 3056166 ATGACACTTTTCGGTTTACTGTGGTAGAGTAACCGTGAAGACAAAggcttgtcgacgacggcggtctccgtcgtcaggatcatCCTTCTTATCTGGCACCAGCCATGACGCAACTACCAGAACTCCCA
91 symR small regulatory RNA antitoxin SymR 4579835 4579911 76 + 4579834 4579912 TAGCTGGACTTTCCCCATATTTACTGATGATATATACAGGTATTTggcttgtcgacgacggcggtctccgtcgtcaggatcatGACACGCATTCTATTGCACAACCGTTCGAAGCAGAAGTCTCCCCG
92 tff putative small RNA T44 189712 189847 135 + 189711 189848 GCATGGAAACAGTTGCCATGATTAAAACCTCTATATAAAAGTTGGggcttgtcgacgacggcggtctccgtcgtcaggatcatGCGCGCTTTATACCACAAATACGTCGTGGACACCAATAATTGTTG

93 rows × 9 columns

In [16]:
df_oligos.to_csv("twist_orbit_small_RNA.csv")
In [ ]: