Just like the ∆TF library, let's make ORBIT oligos to delete all 93 small RNAs annotated in E. coli's genome.
import numpy as np
import random
import string
import pandas as pd
import holoviews as hv
from holoviews import opts,dim
import Bio.Seq as Seq
import Bio.SeqIO
#from plotnine import *
import inspect
import wgregseq
%load_ext autoreload
%autoreload 2
hv.extension('bokeh')
pd.options.display.max_colwidth = 200
for record in Bio.SeqIO.parse('sequencev3.fasta', "fasta"):
genome = str(record.seq)
print("Length genome: {}".format(len(genome)))
print("First 100 bases: {}".format(genome[:100]))
Length genome: 4641652 First 100 bases: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT
df = pd.read_csv("all_RNAs_except_rRNAs_and_tRNAs.txt", sep = '\t')
df = df.drop(['Sequence - coordinates of DNA region','Sequence - length of region'], axis = 1)
df
Gene Name | Product Name | Left-End-Position | Right-End-Position | |
---|---|---|---|---|
0 | 3'ETS-<i>leuZ</i> | small regulatory RNA 3'ETS<sup><i>leuZ</i></sup> | 1991748 | 1991814 |
1 | agrA | small RNA AgrA | 3648063 | 3648144 |
2 | agrB | small regulatory RNA AgrB | 3648294 | 3648375 |
3 | arcZ | small regulatory RNA ArcZ | 3350577 | 3350697 |
4 | arrS | small regulatory RNA ArrS | 3657985 | 3658054 |
... | ... | ... | ... | ... |
88 | sroH | small RNA SroH | 4190327 | 4190487 |
89 | ssrA | tmRNA | 2755593 | 2755955 |
90 | ssrS | 6S RNA | 3055983 | 3056165 |
91 | symR | small regulatory RNA antitoxin SymR | 4579835 | 4579911 |
92 | tff | putative small RNA T44 | 189712 | 189847 |
93 rows × 4 columns
First, let's make sure that this table is complete.
print('Unique\n',df.nunique())
print('\nNulls\n',df.isnull().sum())
Unique Gene Name 93 Product Name 93 Left-End-Position 93 Right-End-Position 93 dtype: int64 Nulls Gene Name 0 Product Name 0 Left-End-Position 0 Right-End-Position 0 dtype: int64
Ok, so we don't seem to have any missing values for our essential parameters, left, right and direction.
Another thing we should examine before proceeding is the length distribution of these genes. Remember these oligos will need to delete almost the entire gene, and efficiencies diminish significantly as deletion length increases.
df['length'] = df['Right-End-Position'] - df['Left-End-Position']
df['length'].describe()
count 93.000000 mean 124.688172 std 69.009763 min 52.000000 25% 76.000000 50% 100.000000 75% 157.000000 max 368.000000 Name: length, dtype: float64
Ok, so there's quite a range in lengths from 200 bp to almost 4kb. That will yield some pretty big differences in efficiency. However, we can see that the middle 50% falls between 575 and 932bp, which isn't so bad. Let's take a look at the distributions:
scatter = hv.Scatter(df, 'Left-End-Position', 'length').opts(width = 500)
scatter
df['Direction'] = '+'
df['left_oligo_pos'] = df['Left-End-Position']-1
df['right_oligo_pos'] = df['Right-End-Position']+1
df.head()
Gene Name | Product Name | Left-End-Position | Right-End-Position | length | Direction | left_oligo_pos | right_oligo_pos | |
---|---|---|---|---|---|---|---|---|
0 | 3'ETS-<i>leuZ</i> | small regulatory RNA 3'ETS<sup><i>leuZ</i></sup> | 1991748 | 1991814 | 66 | + | 1991747 | 1991815 |
1 | agrA | small RNA AgrA | 3648063 | 3648144 | 81 | + | 3648062 | 3648145 |
2 | agrB | small regulatory RNA AgrB | 3648294 | 3648375 | 81 | + | 3648293 | 3648376 |
3 | arcZ | small regulatory RNA ArcZ | 3350577 | 3350697 | 120 | + | 3350576 | 3350698 |
4 | arrS | small regulatory RNA ArrS | 3657985 | 3658054 | 69 | + | 3657984 | 3658055 |
df_oligos = wgregseq.get_target_oligo_df_2(df, 'left_oligo_pos', 'right_oligo_pos','Direction',genome, attB_lock = True)
df_oligos
/Users/scottsaunders/Reg-Seq2/software_module/wgregseq/orbit.py:406: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_tmp['oligo'] = oligo
Gene Name | Product Name | Left-End-Position | Right-End-Position | length | Direction | left_oligo_pos | right_oligo_pos | oligo | |
---|---|---|---|---|---|---|---|---|---|
0 | 3'ETS-<i>leuZ</i> | small regulatory RNA 3'ETS<sup><i>leuZ</i></sup> | 1991748 | 1991814 | 66 | + | 1991747 | 1991815 | TGGTGATTAAAAATTAAGGAGGGTGTAACGACAAGTTGCAGGCACggcttgtcgacgacggcggtctccgtcgtcaggatcatTGGTACCCGGAGCGGGACTTGAACCCGCACAGCGCGAACGCCGAG |
1 | agrA | small RNA AgrA | 3648063 | 3648144 | 81 | + | 3648062 | 3648145 | AGCACGTCCTTGCAATAGTTTCAGTATGGTATTAGCATTGATGCGggcttgtcgacgacggcggtctccgtcgtcaggatcatACATCCGGATTCGGACAAGGCTTAATATGACGATGACCCAGTGAA |
2 | agrB | small regulatory RNA AgrB | 3648294 | 3648375 | 81 | + | 3648293 | 3648376 | CGCTAATTCTTGCAATGTTAGCCACTGGCTAATAGTATTGAGCTGggcttgtcgacgacggcggtctccgtcgtcaggatcatACGTCCTGATTCAGACCTCCTTTCAAATGAATAGCCAACTCAAAA |
3 | arcZ | small regulatory RNA ArcZ | 3350577 | 3350697 | 120 | + | 3350576 | 3350698 | ACTGATTCATGTAACAAATCATTTAAGTTTTGCTATCTTAACTGCggcttgtcgacgacggcggtctccgtcgtcaggatcatAGTGGCTTTTGCCACCCACGCTTTCAGCACTTCTACGTCGTGACG |
4 | arrS | small regulatory RNA ArrS | 3657985 | 3658054 | 69 | + | 3657984 | 3658055 | CTGAAGACATGAATGCGTTATTTACTCAGGTAATTTCAATGCGTTggcttgtcgacgacggcggtctccgtcgtcaggatcatATTTTAACTTTAGTAATATTCTTCAGAGATCACAAACTGGTTATT |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88 | sroH | small RNA SroH | 4190327 | 4190487 | 160 | + | 4190326 | 4190488 | AGAGATCTGATTGTAAGAGAGTAAATACTCAACTATGATAGAGACggcttgtcgacgacggcggtctccgtcgtcaggatcatGTTATTTTGAGGGCTGAGGAAGCTGCTTATTTCTCAATAAGTTGT |
89 | ssrA | tmRNA | 2755593 | 2755955 | 362 | + | 2755592 | 2755956 | CTGGTCATGGCGCTCATAAATCTGGTATACTTACCTTTACACATTggcttgtcgacgacggcggtctccgtcgtcaggatcatAAATTCTCCATCGGTGATTACCAGAGTCATCCGATGAAGTCCTAA |
90 | ssrS | 6S RNA | 3055983 | 3056165 | 182 | + | 3055982 | 3056166 | ATGACACTTTTCGGTTTACTGTGGTAGAGTAACCGTGAAGACAAAggcttgtcgacgacggcggtctccgtcgtcaggatcatCCTTCTTATCTGGCACCAGCCATGACGCAACTACCAGAACTCCCA |
91 | symR | small regulatory RNA antitoxin SymR | 4579835 | 4579911 | 76 | + | 4579834 | 4579912 | TAGCTGGACTTTCCCCATATTTACTGATGATATATACAGGTATTTggcttgtcgacgacggcggtctccgtcgtcaggatcatGACACGCATTCTATTGCACAACCGTTCGAAGCAGAAGTCTCCCCG |
92 | tff | putative small RNA T44 | 189712 | 189847 | 135 | + | 189711 | 189848 | GCATGGAAACAGTTGCCATGATTAAAACCTCTATATAAAAGTTGGggcttgtcgacgacggcggtctccgtcgtcaggatcatGCGCGCTTTATACCACAAATACGTCGTGGACACCAATAATTGTTG |
93 rows × 9 columns
df_oligos.to_csv("twist_orbit_small_RNA.csv")