TWIST: ORBIT TF Deletion¶

(c) 2020 Scott H Saunders. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.


In [1]:
import numpy as np
import random
import string
import pandas as pd
import holoviews as hv
from holoviews import opts,dim
import Bio.Seq as Seq
import Bio.SeqIO
#from plotnine import *
import inspect

import wgregseq
%load_ext autoreload
%autoreload 2

hv.extension('bokeh')

pd.options.display.max_colwidth = 200

The goal of this notebook is to design and use a new function that accurately designs ORBIT targeting oligos. The basic approach will be to reproduce the method of the MODEST server, which is described in a bit more detail in the accompanying paper. MODEST does several different fancy things that could be fun to implement one day, but for now I'm just trying to reproduce the most basic behavior. In theory I could use the webserver itself and then hack things together in text files, but the server is very slow and unreliable...and it's not open source so I can't modify or anything.

The basic approach MODEST takes is to take in genomic coordinates, find the appropriate homology arms, and then determine whether the forward or reverse sequence should be used to get an oligo that targets the lagging strand. We are already capable of grabbing bits of the genome from genomic coordinates, so really the only tricky part is determining the lagging strand. Reminder, during DNA replication both + and - strands are copied simultaneously in opposite directions. The new DNA strand that is synthesized continuously in the same direction as the replication fork is called the leading strand, and the new strand that is synthesized discontinuosly in small pieces that are later joined (okazaki fragments) is called the lagging strand. That's about the extent of the issue you learn in high school biology, but when thinking about specific genes on E. coli's circular genome it gets slightly more complicated.

DNA replication of the circular chromosome is initiated at the origin (oriC) with two replication forks proceeding in opposite directions. Both forks continue until they reach the 'terminus' region, which is surprisingly ambiguously defined. Basically there's a protein called Tus that binds to sites called Ter that stalls these replication forks (so they don't crash into each other?). The terminus is a relatively large regions that mostly encompasses these sites on the opposite side of the chromosome from oriC. Therefore it's not exactly clear what exact position DNA replication terminates, but one more specific site is called dif. Apparently there's a native recombinase that binds to this site and separates the two replicated chromosomes (this is cool - should learn more), so perhaps that is the truest termination site. See this paper for more detail. Ultimately, it's complicated and not that important for our purpose. The important part is that these two replication forks that travel in opposite directions from the origin to the terminus divide the chromosome into two 'replichores'.

drawing

For each targeting oligo we need to figure out which replichore we are in, because that determines whether the '+' or '-' strand is lagging. The first confusing thing about this is that genomic position is linear from 0 to 4.6M, but the chromosome is circular. The above diagram explains how to map between the two and figure out which replichore belongs to which linear genomic positions:

Replichore_1 = (pos > ori) | (pos < terminus)

Replichore_2 = (pos < ori) & (pos > terminus)

The diagram below then shows the replication forks for each terminus and how to get from replichore to lagging strand. Essentially, for replichore 2 we can simply take the '+' strand sequence directly, but for replichore 1 we need to reverse complement ('-' strand).

get_target_oligo()¶

With that sorted out let's go ahead and write our first function get_replichore(). First we need to establish the positions of ori and terminus. Ori is taken as the mean of the oriC locus:

In [2]:
ori = np.mean([3923767,3923998])
ori
Out[2]:
3923882.5

And the terminus is taken to be very close to the dif site, which is also near Tus. I took a nearby intergenic coordinate 1,590,250.5 and made it a fraction so that we don't throw any unwanted errors. Let's take a look at the documentation for get_replichore()

In [3]:
lines = inspect.getsource(wgregseq.get_replichore)
print(lines)
def get_replichore(pos, ori = 3923882.5, ter = 1590250.5 ):
    
    """
    Determine the replichore of a bacterial chromosome for a certain position. Requires origin and terminus positions. Assumes E. coli like organization.
    
    pos : int
        Genomic coordinate of interest.
    ori : float
        Genomic coordinate of the origin of replication. 
    ter : float
        Genomic coordinate of the replication terminus.
    """
    
    pos = int(pos)
    
    if((pos<0)| (pos>4641652)):
        raise TypeError("position must be within genome.")
    
    if((pos > ori) | (pos<ter)):
       rep = 1
    elif((pos<ori) & (pos>ter)):
       rep = 2
    
    return rep

Let's test it out with position 0 (should be replichore 1) and position 2 M (should be replichore 2)

In [4]:
print('pos 0 = ', wgregseq.get_replichore(pos = 0))
print('pos 2M = ', wgregseq.get_replichore(pos = 2000000))
pos 0 =  1
pos 2M =  2

Looks good. Let's now write our function get_target_oligo(). This function has been through a few rounds of development in this notebook, so it has grown a bit complicated. Let's break down the steps required:

  1. Calculate homology arm length (from total homology / 2), which assumes symmetric arms and an even homology length.
  2. Get the '+' strand sequence for the left and right arms. Note that this must be indexed properly, such that the left position is the last nt that is kept both in the genome and on the oligo (before attB), and the right position is the first nt after attB/in the genome.
  3. Determine the replichore and reverse complement homology arms accordingly.
  4. Determine the direction of the attB site

This last part is somewhat complex, because there are 4 possible scenarios

  • Replichore = 1, Direction = '+'
  • Replichore = 1, Direction = '-'
  • Replichore = 2, Direction = '+'
  • Replichore = 2, Direction = '-'

Let's start by explaining the simplest example first, Replichore = 2, Direction = '+'. In this case the homology arm sequences come directly from the '+' strand. If we are deleting a gene, then the 5' end of the oligo will have the left and upstream homology arm. Then the 3' end of the oligo will have the right and downstream homology arm. In this case we simply want to paste the fwd sequence of attB between these two arms to get an insertion of pInt in the expected direction (e.g. for pDel gro promoter facing downstream).

For Replichore = 2, Direction = '-', the gene is facing the opposite direction. Therefore the oligo still comes from the '+' strand, but now the 5' end of the oligo is the left and downstream homology arm and the 3' end is the right and upstream homology arm. Therefore we need to reverse complement the attB sequence, so that it is now facing downstream.

For Replichore = 1, Direction = '+', the gene is on the '+' strand, but the oligo sequence comes from the '-' strand. Therefore the 5' end of the oligo is right and downstream and the 3' end of the oligo is left and upstream. So, to get attB facing downstream, we need to reverse complement.

Finally for Replichore = 1, Direction = '-' we have the actual orientation / position of galK, so just think about that. The gene is on the '-' strand, but the oligo sequence also comes from the '-' strand. Here the 5' end of the oligo right and upstream and the 3' end is left and downstream. Therefore the forward attB sequence can be used to face downstream.

Here's a table to simplify things, and just remember that on the oligo, typically we want attB facing the same way as the gene: downstream.

Replichore gene dir 5' abs-rel pos 3' abs-rel pos attB
1 + right-down left-up rev
1 - right-up left-down fwd
2 + left-up right-down fwd
2 - left-down right-up rev

for + gene_dir locus should look like: (left) | upstream | attB_fwd | downstream (right)

for - gene_dir locus should look like: (left) | downstream | attB_rev | upstream (right)

With all of that reasoned out, we can write our function with some simple if statements:

In [5]:
lines = inspect.getsource(wgregseq.get_target_oligo)
print(lines)
def get_target_oligo(left_pos, right_pos, genome, homology = 90, attB_dir = '+', attB_fwd_seq = 'ggcttgtcgacgacggcggtctccgtcgtcaggatcat',  verbose = False):
    """
    Given a set of parameters, get an ORBIT oligo that targets the lagging strand. 
    Left and right positions are absolute genomic coordinates that specify the final nucleotides to keep unmodified in the genome, 
    everything in between will be replaced by attB. In other words the left position nucleotide is the final nt before attB in the oligo.
    The right position nt is the first nt after attB in the oligo.
    
    This function determines the lagging strand by calling `get_replichore()` on the left_pos.
    Typically attB_dir should be set to the same direction as the gene of interest, such that the integrating plasmid will insert with payload facing downstream.
    attB_fwd_seq can be modified, and the total homology can be modified, but should be an even number since homology arms are symmetric. 
    
    Verbose prints helpful statements for testing functionality.
    
    Parameters
    -----------------
    left_pos : int
        Left genomic coordinate of desired attB insertion. attB is added immediately after this nt.
    right_pos : int
        Right genomic coordinate of desired attB insertion. attB is added immediately before this nt.
    genome : str
        Genome as a string.
    homology : int (even)
        Total homology length desired for oligo. Arm length = homology / 2.
    attB_dir : chr ('+' or '-')
        Desired direction of attB  based on genomic strand. Typically same direction as gene.
    attB_fwd_seq : str
        Sequence of attB to insert between homology arms.
    verbose : bool
        If true, prints details about genomic positions and replichore.
    Returns
    ---------------
    oligo : str
        Targeting oligo against lagging strand, including the attB sequence in the correct orientation.
    """
    
    left_pos = int(left_pos)
    
    right_pos = int(right_pos)
    
    # Arm length is 1/2 total homology. Arms are symmetric
    arm_len = int(homology / 2)
    
    # Arms from genome string. Note 0 indexing of string vs. 1 indexing of genomic coordinates.
    # As written, should be inclusive.
    left_arm = genome[(left_pos - arm_len):left_pos]
    
    right_arm = genome[(right_pos - 1):(right_pos - 1 + arm_len)]

    # Generate attB reverse sequence
    seq_attB = Seq(attB_fwd_seq)
    attB_rev_seq = str(seq_attB.reverse_complement())
    
    # Replichore 1
    if get_replichore(left_pos) == 1:
        
        rep = 1
        
        # Reverse complement replichore 1 sequences.
        left_arm_seq = Seq(left_arm)
        left_arm_prime = str(left_arm_seq.reverse_complement())
        
        right_arm_seq = Seq(right_arm)
        right_arm_prime = str(right_arm_seq.reverse_complement())
        
        # Determine attB direction and paste fwd/rev seq accordingly
        if attB_dir == '+':
            
            oligo = right_arm_prime + attB_rev_seq + left_arm_prime
            
        elif attB_dir == '-':
            
            oligo = right_arm_prime + attB_fwd_seq + left_arm_prime
    
    # Replichore 2
    elif get_replichore(left_pos) == 2:
        
        rep = 2
        
        # '+' arm sequence used. Determine attB direction and paste accordingly.
        if attB_dir == '+':
            
            oligo = left_arm + attB_fwd_seq + right_arm
        
        elif attB_dir == '-':
            
            oligo = left_arm + attB_rev_seq + right_arm    
            
    # Verbose print statements
    if verbose:
        
        print('left_arm_coord = ', left_pos - arm_len,' : ', left_pos)
        print('right_arm_coord = ', right_pos - 1, ' : ', right_pos -1 + arm_len)
        print('Replichore = ', rep)
    
    return oligo

Let's read in our genome. Remember this is a string indexed from 0.

In [6]:
for record in Bio.SeqIO.parse('sequencev3.fasta', "fasta"):
    genome = str(record.seq)
    
print("Length genome: {}".format(len(genome)))
print("First 100 bases: {}".format(genome[:100]))
Length genome: 4641652
First 100 bases: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT

Let's see if we can simply get an oligo that targets the first 8 bases of the genome. Remember this is replichore 1, so we should get the reverse complement. We will also replace the attB sequence with a space to make it simpler.

In [7]:
wgregseq.get_target_oligo(4, 5, genome, 8, '+',' ', True)
left_arm_coord =  0  :  4
right_arm_coord =  4  :  8
Replichore =  1
Out[7]:
'GAAA AGCT'

Looks good! As a reminder the left_pos, right_pos should read like "keep the 4th nucleotide as the last in the homology arm and therefore the last unmodified nt in the genome before inserting attB". Same thing for right pos, where that is the first nt of the right homology arm and first nt in the locus after attB.

Now let's dive in and try to compare this function the MODEST generated oligos. First, I generated MODEST oligos for 5 different genes that represented a mix of replichore 1 and 2. Some are very close to the terminus which was helpful. Note all the genes are on the + strand so that we can easily look for the ATG start codon. For each oligo let's compare MODEST to get_target_oligo() by directly comparing strings and using nothing ("") as the attB sequence.

*A note on translating to MODEST format. get_target_oligo asks what is the last nucleotide you want to keep on either side? Modest asks what is the position of the first nt to delete and how many do you want to delete. so modest Gene X pos = 45, del = 10 is the same as keep the 44th nt and delete nt 45 to nt 54 and keep the 55th nucleotide as the left end. Also remember that the genomic coordinate is the first nucelotide, so to get to the 44th nt we add 43 and to get to the 55th nt we add 54.

In [8]:
cbrC_start = 3898022

new_cbrC = wgregseq.get_target_oligo(3898022 + 43,3898022 + 44 + 10, genome, 90, "+", "" , True)
mod_cbrC = 'TATGACTCAAAATATCAGGCCGTTACCCCAATTCAAATATCATCCGAAACAGGCGCATTTGAACAGGATAAAACCGTAGAGTGCGATTGC'

print( 'custom cbrC: ', new_cbrC)
print( 'modest cbrC: ', mod_cbrC)
print( 'Equivalent? ', new_cbrC==mod_cbrC)
left_arm_coord =  3898020  :  3898065
right_arm_coord =  3898075  :  3898120
Replichore =  2
custom cbrC:  TATGACTCAAAATATCAGGCCGTTACCCCAATTCAAATATCATCCGAAACAGGCGCATTTGAACAGGATAAAACCGTAGAGTGCGATTGC
modest cbrC:  TATGACTCAAAATATCAGGCCGTTACCCCAATTCAAATATCATCCGAAACAGGCGCATTTGAACAGGATAAAACCGTAGAGTGCGATTGC
Equivalent?  True
In [9]:
asnA_start = 3927155

new_asnA = wgregseq.get_target_oligo(3927155 + 43,3927155 + 44 + 10, genome, 90, "+", "" , True)
mod_asnA = 'CTGGACTTCGATCAGCCCCAGACGTTCTTCCAGTTGACGAGAAAAACGAAGCTAATTTGACGTTGTTTGGCAATGTAAGCGGTTTTCATT'

print( 'custom asnA: ', new_asnA)
print( 'modest asnA: ', mod_asnA)
print( 'Equivalent? ', new_asnA==mod_asnA)
left_arm_coord =  3927153  :  3927198
right_arm_coord =  3927208  :  3927253
Replichore =  1
custom asnA:  CTGGACTTCGATCAGCCCCAGACGTTCTTCCAGTTGACGAGAAAAACGAAGCTAATTTGACGTTGTTTGGCAATGTAAGCGGTTTTCATT
modest asnA:  CTGGACTTCGATCAGCCCCAGACGTTCTTCCAGTTGACGAGAAAAACGAAGCTAATTTGACGTTGTTTGGCAATGTAAGCGGTTTTCATT
Equivalent?  True
In [10]:
cysB_start = 1333855

new_cysB = wgregseq.get_target_oligo(1333855 + 43,1333855 + 44 + 10, genome, 90, "+", "", True)
mod_cysB = 'GATCCCGGGTTGTGATGTGTAAAGTCCTTCCGCTGTTGATGAGACTGATTGACCACCTCAACAATATAGCGAAGTTGTTGTAATTTCATG'

print( 'custom cysB: ', new_cysB)
print( 'modest cysB: ', mod_cysB)
print( 'Equivalent? ', new_cysB==mod_cysB)
left_arm_coord =  1333853  :  1333898
right_arm_coord =  1333908  :  1333953
Replichore =  1
custom cysB:  GATCCCGGGTTGTGATGTGTAAAGTCCTTCCGCTGTTGATGAGACTGATTGACCACCTCAACAATATAGCGAAGTTGTTGTAATTTCATG
modest cysB:  GATCCCGGGTTGTGATGTGTAAAGTCCTTCCGCTGTTGATGAGACTGATTGACCACCTCAACAATATAGCGAAGTTGTTGTAATTTCATG
Equivalent?  True
In [11]:
manA_start = 1688576

new = wgregseq.get_target_oligo(1688576 + 43,1688576 + 44 + 10, genome, 90, "+", "" , True)
mod = 'CATGCAAAAACTCATTAACTCAGTGCAAAACTATGCCTGGGGCAGTTGACTGAACTTTATGGTATGGAAAATCCGTCCAGCCAGCCGATG'

print( 'custom manA: ', new)
print( 'modest manA: ', mod)
print( 'Equivalent? ', new==mod)
left_arm_coord =  1688574  :  1688619
right_arm_coord =  1688629  :  1688674
Replichore =  2
custom manA:  CATGCAAAAACTCATTAACTCAGTGCAAAACTATGCCTGGGGCAGTTGACTGAACTTTATGGTATGGAAAATCCGTCCAGCCAGCCGATG
modest manA:  CATGCAAAAACTCATTAACTCAGTGCAAAACTATGCCTGGGGCAGTTGACTGAACTTTATGGTATGGAAAATCCGTCCAGCCAGCCGATG
Equivalent?  True
In [12]:
rstB_start = 1682882

new = wgregseq.get_target_oligo(1682882 + 43,1682882 + 44 + 10, genome, 90, "+", "", True )
mod = 'GATGAAAAAACTGTTTATCCAGTTTTACCTGTTATTGTTTGTCTGATGTCTCTGCTGGTTGGGCTGGTGTACAAATTTACCGCCGAACGC'

print( 'custom rstB: ', new)
print( 'modest rstB: ', mod)
print( 'Equivalent? ', new==mod)
left_arm_coord =  1682880  :  1682925
right_arm_coord =  1682935  :  1682980
Replichore =  2
custom rstB:  GATGAAAAAACTGTTTATCCAGTTTTACCTGTTATTGTTTGTCTGATGTCTCTGCTGGTTGGGCTGGTGTACAAATTTACCGCCGAACGC
modest rstB:  GATGAAAAAACTGTTTATCCAGTTTTACCTGTTATTGTTTGTCTGATGTCTCTGCTGGTTGGGCTGGTGTACAAATTTACCGCCGAACGC
Equivalent?  True

Now let's try two genes on the minus strand.

lacI start = 366428, Rep = 1, strand = '-'

ligA start = 2528161, Rep = 2, strand = '-'

In [13]:
lacI_mod = 'TGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGGTGGTGAATGAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGAC'

lacI_oli = wgregseq.get_target_oligo(367510-11, 367510, genome , 90, "-", "", True)

print('lacI: ', lacI_oli)
print('mod : ', lacI_mod)
print('Equivalent? ', lacI_oli == lacI_mod)
left_arm_coord =  367454  :  367499
right_arm_coord =  367509  :  367554
Replichore =  1
lacI:  TGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGGTGGTGAATGAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGAC
mod :  TGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGGTGGTGAATGAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGAC
Equivalent?  True
In [14]:
ligA_mod = 'TCATGATGGCGAAGCGTCGTTCGCAGTTCTGTCAGTTGTTGTTCGTATCGCACCATCAATGCTAAAAACCCCCGACAAGCGGGGGTTCGA'
ligA_oli = wgregseq.get_target_oligo(2530176-11, 2530176, genome , 90, "-", "", True)

print('ligA: ',ligA_oli )
print('mod : ', ligA_mod)
print('Equivalent? ', ligA_oli == ligA_mod)
left_arm_coord =  2530120  :  2530165
right_arm_coord =  2530175  :  2530220
Replichore =  2
ligA:  TCATGATGGCGAAGCGTCGTTCGCAGTTCTGTCAGTTGTTGTTCGTATCGCACCATCAATGCTAAAAACCCCCGACAAGCGGGGGTTCGA
mod :  TCATGATGGCGAAGCGTCGTTCGCAGTTCTGTCAGTTGTTGTTCGTATCGCACCATCAATGCTAAAAACCCCCGACAAGCGGGGGTTCGA
Equivalent?  True

Now let's do a first test of the attB reversal functionality. cbrC is on replichore 2, + strand, so attB should be fwd. asnA is on replichore 1, + strand, so attB should be rev. Let's just use a simple attB seq 'tttt'.

In [15]:
print('cbrC: ', wgregseq.get_target_oligo(cbrC_start + 2, cbrC_start + 3, genome , 90, "+", "tttt"))
print('asnA: ', wgregseq.get_target_oligo(asnA_start + 2, asnA_start + 3, genome , 90, "+", "tttt"))
cbrC:  TACTTTATCTTTGGGCTACTCAAAAGCAGACAGGATGTTTCTATGttttACTCAAAATATCAGGCCGTTACCCCAATTCAAATATCATCCCAAG
asnA:  TTTCACGAAGCTAATTTGACGTTGTTTGGCAATGTAAGCGGTTTTaaaaCATTTTTTATACTCCTGCGTCCTGTTGCTTATGATTAAGCAACAA

Looks perfect! Let's try our two minus strand genes. Remember LacI is Rep = 1, strand = '-', so attB should be fwd and LigA is rep = 2, strand = '-', so attB should be rev.

In [16]:
print('lacI: ', wgregseq.get_target_oligo(367510-3, 367510-2, genome , 90, "-", "tttt"))
print('ligA: ', wgregseq.get_target_oligo(2530176-3, 2530176-2, genome , 90, "-", "tttt"))
lacI:  GCATGATAGCGCCCGGAAGAGAGTCAATTCAGGGTGGTGAATGTGttttAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCT
ligA:  GCGAAGCGTCGTTCGCAGTTCTGTCAGTTGTTGTTCGATTGATTCaaaaCATATCGCACCATCAATGCTAAAAACCCCCGACAAGCGGGGGTTC

Complementarity issue¶

After all this, I realized in the pool of oligos it could be a problem to have complementary attB sequences that could bind to each other as single stranded oligos.

5' ggcttgtcgacgacggcggtctccgtcgtcaggatcat 3'

3' ccgaacagctgctgccgccagaggcagcagtcctagta 5'

These sequences are 38 bp long and high GC, meaning their melting temps are really high > 72 degrees. I worry that lots of these duplexes forming could inhibit PCR. I'm not sure, but I don't think it's worth potentially making the entire library unusable.

To solve this I'm simply going to encode the fwd attB sequence in every oligo. To do this I'm going to remake oligos using an alternative get_target_oligo_2() function. This function has an optional parameter attB_lock = False. The default behavior is the same as the original above, but when attB_lock = True then the sequence will simply be pasted into the desired orientation into the oligo, without regard for the replichore etc.

This will result in insertions facing the 'wrong' direction in the final ORBIT library, but I think that's fine for now until I can confirm it won't be a PCR issue or find another way to deal with it.

In [17]:
print('lacI: ', wgregseq.get_target_oligo_2(367510-3, 367510-2, genome , 90, "+", "tttt", attB_lock = True))
print('ligA: ', wgregseq.get_target_oligo_2(2530176-3, 2530176-2, genome , 90, "+", "tttt", attB_lock = True))
lacI:  GCATGATAGCGCCCGGAAGAGAGTCAATTCAGGGTGGTGAATGTGttttAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCT
ligA:  GCGAAGCGTCGTTCGCAGTTCTGTCAGTTGTTGTTCGATTGATTCttttCATATCGCACCATCAATGCTAAAAACCCCCGACAAGCGGGGGTTC
In [18]:
print('cbrC: ', wgregseq.get_target_oligo_2(cbrC_start + 2, cbrC_start + 3, genome , 90, "+", "tttt", attB_lock = True))
print('asnA: ', wgregseq.get_target_oligo_2(asnA_start + 2, asnA_start + 3, genome , 90, "+", "tttt", attB_lock = True))
cbrC:  TACTTTATCTTTGGGCTACTCAAAAGCAGACAGGATGTTTCTATGttttACTCAAAATATCAGGCCGTTACCCCAATTCAAATATCATCCCAAG
asnA:  TTTCACGAAGCTAATTTGACGTTGTTTGGCAATGTAAGCGGTTTTttttCATTTTTTATACTCCTGCGTCCTGTTGCTTATGATTAAGCAACAA

Looks good. Now we can move on to actually designing oligos for lots of genes.

TF gene import¶

You can find the public biocyc table here https://biocyc.org/group?id=Curated_DNA_binding_TRs_public. These genes come from searching ecocyc using the multifunctional terms "regulation" > "type of regulation" > "transcriptional"(334) or "unknown" (32). I chose these annotations, because at least this list included all TFs that I quickly searched for from the RegSeq paper. In particular yieP was in 'unknown' even though its annotated as a DNA binding TF. This was true for several others as well. With this list of 368 I then added the GO term annotations to the table and I manually checked that TFs had at least some sort of 'DNA-binding' annotation. This got rid of a lot of things like histidine sensor kinases (sensor part of two component sensors). Naturally there were some weird edge cases, for example, so I had to make some manual decisions. For almost everything if it didn't have the DNA binding annotation I removed it. One specific exception I recall was a co-regulator that functions with CRP. It did not bind DNA itself, but helped to regulate a subset of CRP regulated genes. Obviously this is a rabbit hole and we could certainly use less stringent criteria, but I thought this was a reasonable way to proceed for this preliminary ORBIT test.

Ultimately, we ended up with exactly 300 genes, which are full of some classic TFs as well as totally uncharacterized putative genes whose function was inferred only by homology.

In [19]:
df = pd.read_csv("Curated_DNA_binding_transcriptional_regulators.txt", sep = '\t')

df.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[19]:
Gene Name Left-End-Position Right-End-Position Direction
0 aaeR 3389520 3390449 +
1 abgR 1404741 1405649 +
2 acrR 485761 486408 +
3 ada 2309341 2310405 -
4 adiY 4337168 4337929 -
... ... ... ... ...
295 yqhC 3154262 3155218 -
296 ytfH 4434113 4434493 +
297 zntR 3438705 3439130 -
298 zraR 4203320 4204645 +
299 zur 4259488 4260003 -

300 rows × 4 columns

First, let's make sure that this table is complete.

In [20]:
print('Unique\n',df.nunique())

print('\nNulls\n',df.isnull().sum())
Unique
 Gene Name                        300
Product Name                     300
GO terms (molecular function)    169
Left-End-Position                300
Right-End-Position               300
Direction                          2
dtype: int64

Nulls
 Gene Name                         0
Product Name                      0
GO terms (molecular function)    10
Left-End-Position                 0
Right-End-Position                0
Direction                         0
dtype: int64

Ok, so we don't seem to have any missing values for our essential parameters, left, right and direction.

Gene length considerations¶

Another thing we should examine before proceeding is the length distribution of these genes. Remember these oligos will need to delete almost the entire gene, and efficiencies diminish significantly as deletion length increases.

In [21]:
df['length'] = df['Right-End-Position'] - df['Left-End-Position']

df['length'].describe()
Out[21]:
count     300.000000
mean      804.446667
std       429.946112
min       200.000000
25%       575.000000
50%       758.000000
75%       932.000000
max      3962.000000
Name: length, dtype: float64

Ok, so there's quite a range in lengths from 200 bp to almost 4kb. That will yield some pretty big differences in efficiency. However, we can see that the middle 50% falls between 575 and 932bp, which isn't so bad. Let's take a look at the distributions:

In [22]:
scatter = hv.Scatter(df, 'Left-End-Position', 'length').opts(width = 500)*hv.HLine(575).opts(color = 'red') 
dist = hv.Distribution(df, 'length' ).opts(width = 500, bandwidth = 0.1) * hv.VLine(575).opts(color = 'red')

scatter + dist
Out[22]:

I think the biggest concern here is that the short deletions could be way overrepresented in the library. That's not a huge deal as long we know what the composition is, but an extreme bias could be quite suboptimal. With this in mind, let's split our library into two sublibraries. Somewhat arbitratrily I have set the cutoff point at the bottom 25% line - 575 bp. This separates out the pool pretty nicely by length where the small subpool is 200-575bp and the longer length pool is mostly 575-1000, but does include some multikb sequences as well.

In [23]:
df_short = df.loc[df['length']<=575]

df_short.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[23]:
Gene Name Left-End-Position Right-End-Position Direction length
10 alpA 2758644 2758856 + 212
16 argR 3384703 3385173 + 470
17 ariR 1216369 1216635 + 266
18 arsR 3648528 3648881 + 353
20 asnC 3926545 3927003 - 458
... ... ... ... ... ...
283 yjdC 4362733 4363308 - 575
289 ylbG 529645 530016 - 371
296 ytfH 4434113 4434493 + 380
297 zntR 3438705 3439130 - 425
299 zur 4259488 4260003 - 515

76 rows × 5 columns

Ok, so 76 TFs fall into this 'short' category.

In [24]:
df_long = df.loc[df['length']>575]
df_long.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[24]:
Gene Name Left-End-Position Right-End-Position Direction length
0 aaeR 3389520 3390449 + 929
1 abgR 1404741 1405649 + 908
2 acrR 485761 486408 + 647
3 ada 2309341 2310405 - 1064
4 adiY 4337168 4337929 - 761
... ... ... ... ... ...
292 ypdC 2501130 2501987 + 857
293 yphH 2682863 2684056 + 1193
294 yqeI 2988502 2989311 + 809
295 yqhC 3154262 3155218 - 956
298 zraR 4203320 4204645 + 1325

224 rows × 5 columns

And 224 TFs fall into this 'long' category. That seems to work well for now - let's proceed.

Gene overlap considerations¶

The final thing to consider before designing all of the targeting oligos, is specifically what portion of each gene to delete. Often people delete everything except the start and stop codon, which I think is a good option. But I wanted to consider how often that mutation would actually disrupt another unintended gene. Apparently a lot of E. coli genes overlap slightly or almost overlap. See this link and many more modern papers. It seems that the most common gene overlaps are 1-4nt. Also note that the RBS shine dalgarno sequence is typically 8nt upstream of the start codon and itself is ~6nt. So for a gene that just touches the next gene (0 nt overlap) we should probably end the upstream gene deletion ~15nt early to avoid deleting the RBS. Of course this is still imperfect if the promoter has been deleted or separated from the gene by the integrating plasmid...however if we ultimately want markerless deletions with small scars these situations are important.

Gene of interest (GOI) upstream of overlapping gene (G2):

  • < 0 overlap: |--GOI--> |--G2-->
  • = 0 overlap: |--GOI-->|--G2-->
  • > 0 overlap: |--GOI--|->--G2-->

Gene of interest downstream of overlapping gene:

  • < 0 overlap: |--G2--> |--GOI-->
  • = 0 overlap: |--G2-->|--GOI-->
  • > 0 overlap: |--G2--|->--GOI-->

Optimal parameters to accomodate 4nt overlap: upstream homology arm should end at nt 6 instead of 3. downstream homology arm should end at -21 instead of -3.

From that link above I downloaded the plain text file showing all genes that either overlap or almost overlap.

In [25]:
df_ovlp = pd.read_csv("Ecoli_overlaps.txt", sep = '\t', skiprows = 12, skipfooter=12)
df_ovlp
<ipython-input-25-16c057dcf895>:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.
  df_ovlp = pd.read_csv("Ecoli_overlaps.txt", sep = '\t', skiprows = 12, skipfooter=12)
Out[25]:
Bnum Name Str Start Stop Bnum.1 Name.1 Str.1 Start.1 Stop.1 Overlap New start Unnamed: 12
0 b0002 thrA 1 337 2799 b0003 thrB 1 2801 3733 -1 NaN NaN
1 b0003 thrB 1 2801 3733 b0004 thrC 1 3734 5020 0 NaN NaN
2 b0013 yaaI 2 11382 11786 b0011 NaN 2 10643 11356 -25 11384.0 NaN
3 b0022 insA_1 2 20233 20508 b0021 insB_1 2 19811 20314 82 20342.0 NaN
4 b0024 NaN 1 21181 21399 b0025 yaaC 1 21407 22348 -7 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1499 b4380 yjjI 2 4613084 4614634 b4379 yjjW 2 4612249 4613112 29 4613140.0 NaN
1500 b4389 sms 1 4623481 4624863 b4390 nadR 1 4624863 4626116 1 NaN NaN
1501 b4397 creA 1 4633090 4633563 b4398 creB 1 4633576 4634265 -12 NaN NaN
1502 b4398 creB 1 4633576 4634265 b4399 creC 1 4634265 4635689 1 NaN NaN
1503 b4405 NaN 1 3975603 3976217 b3793 rffT 1 3976214 3977566 4 NaN NaN

1504 rows × 13 columns

We can then simply look at the distribution and summary statistics of overlaps from these 1500 nearby genes.

In [26]:
df_ovlp['gene'] = 'gene'

print(df_ovlp['Overlap'].describe())

hv.Distribution(df_ovlp, 'Overlap').opts(width = 800, height = 400)
count    1504.000000
mean        2.905585
std        27.991047
min       -26.000000
25%       -11.000000
50%         0.000000
75%         4.000000
max       263.000000
Name: Overlap, dtype: float64
Out[26]:

So, certainly we can't accomodate genes that overlap 50 - 100bp, since some genes are only 200-500bp long...but we can see from the summary statistics that at least 75% of nearby genes overlap less than 4nt. So at a minimum that we would need to move our preceding deletion 14 + 4nt upstream to avoid hitting the RBS. I settled on +6 nt for gene start and -21 nt for gene end.

Let's first make our first and last codon positions:

In [27]:
df['left_codon'] = df['Left-End-Position'] + 2
df['right_codon'] = df['Right-End-Position'] - 2

df.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[27]:
Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon
0 aaeR 3389520 3390449 + 929 3389522 3390447
1 abgR 1404741 1405649 + 908 1404743 1405647
2 acrR 485761 486408 + 647 485763 486406
3 ada 2309341 2310405 - 1064 2309343 2310403
4 adiY 4337168 4337929 - 761 4337170 4337927
... ... ... ... ... ... ... ...
295 yqhC 3154262 3155218 - 956 3154264 3155216
296 ytfH 4434113 4434493 + 380 4434115 4434491
297 zntR 3438705 3439130 - 425 3438707 3439128
298 zraR 4203320 4204645 + 1325 4203322 4204643
299 zur 4259488 4260003 - 515 4259490 4260001

300 rows × 7 columns

And now we can make our "avoid_overlap" coordinates, taking into consideration the gene direction:

In [28]:
df.loc[df['Direction']=='+', 'left_avd_ovlp'] = df['Left-End-Position'] + 5
df.loc[df['Direction']=='-', 'left_avd_ovlp'] = df['Left-End-Position'] + 20

df.loc[df['Direction']=='+', 'right_avd_ovlp'] = df['Right-End-Position'] - 20
df.loc[df['Direction']=='-', 'right_avd_ovlp'] = df['Right-End-Position'] - 5

df['gene'] = df['Gene Name']

#df['right_avd_ovlp'] = df['Right-End-Position'] - 17

df.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[28]:
Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene
0 aaeR 3389520 3390449 + 929 3389522 3390447 3389525.0 3390429.0 aaeR
1 abgR 1404741 1405649 + 908 1404743 1405647 1404746.0 1405629.0 abgR
2 acrR 485761 486408 + 647 485763 486406 485766.0 486388.0 acrR
3 ada 2309341 2310405 - 1064 2309343 2310403 2309361.0 2310400.0 ada
4 adiY 4337168 4337929 - 761 4337170 4337927 4337188.0 4337924.0 adiY
... ... ... ... ... ... ... ... ... ... ...
295 yqhC 3154262 3155218 - 956 3154264 3155216 3154282.0 3155213.0 yqhC
296 ytfH 4434113 4434493 + 380 4434115 4434491 4434118.0 4434473.0 ytfH
297 zntR 3438705 3439130 - 425 3438707 3439128 3438725.0 3439125.0 zntR
298 zraR 4203320 4204645 + 1325 4203322 4204643 4203325.0 4204625.0 zraR
299 zur 4259488 4260003 - 515 4259490 4260001 4259508.0 4259998.0 zur

300 rows × 10 columns

Let's take a quick look at the first 5 genes to make sure it worked as expected:

In [29]:
(
    ggplot(df.head()) + 
    geom_segment(aes(x = 'Left-End-Position', xend = 'Right-End-Position', y = 'Gene Name', yend = 'Gene Name')) + 
    geom_point(aes(x = 'Left-End-Position', y = 'Gene Name'), shape = '|', size = 5) + 
    geom_point(aes(x = 'Right-End-Position', y = 'Gene Name'), shape = '|', size = 5)+
    geom_point(aes(x = 'left_codon', y = 'Gene Name'), color = 'red', shape = '<', size = 3, position = position_nudge(y = 0.2))+
    geom_point(aes(x = 'right_codon', y = 'Gene Name'), color = 'red', shape = '>', size = 3, position = position_nudge(y = 0.2))+
    geom_point(aes(x = 'left_avd_ovlp', y = 'Gene Name'), color = 'blue', shape = '<', size = 3, position = position_nudge(y = -0.2))+
    geom_point(aes(x = 'right_avd_ovlp', y = 'Gene Name'), color = 'blue', shape = '>', size = 3, position = position_nudge(y = -0.2))+
    facet_wrap('~gene + Direction',nrow = 5, scales = 'free') +theme(dpi = 200, figure_size=(5,4))
)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-1a194fa664c3> in <module>
      1 (
----> 2     ggplot(df.head()) +
      3     geom_segment(aes(x = 'Left-End-Position', xend = 'Right-End-Position', y = 'Gene Name', yend = 'Gene Name')) +
      4     geom_point(aes(x = 'Left-End-Position', y = 'Gene Name'), shape = '|', size = 5) +
      5     geom_point(aes(x = 'Right-End-Position', y = 'Gene Name'), shape = '|', size = 5)+

NameError: name 'ggplot' is not defined

Looks good. You can see that the more inward blue arrows (-21nt at gene end) are as expected for the different gene directions. I find this to be a quite helpful overview, so let's go ahead and look at all 300 genes this way to make sure things look good:

In [ ]:
(
    ggplot(df) + 
    geom_segment(aes(x = 'Left-End-Position', xend = 'Right-End-Position', y = 'Gene Name', yend = 'Gene Name')) + 
    geom_point(aes(x = 'Left-End-Position', y = 'Gene Name'), shape = '|', size = 5) + 
    geom_point(aes(x = 'Right-End-Position', y = 'Gene Name'), shape = '|', size = 5)+
    geom_point(aes(x = 'left_codon', y = 'Gene Name'), color = 'red', shape = '<', size = 3, position = position_nudge(y = 0.2))+
    geom_point(aes(x = 'right_codon', y = 'Gene Name'), color = 'red', shape = '>', size = 3, position = position_nudge(y = 0.2))+
    geom_point(aes(x = 'left_avd_ovlp', y = 'Gene Name'), color = 'blue', shape = '<', size = 3, position = position_nudge(y = -0.2))+
    geom_point(aes(x = 'right_avd_ovlp', y = 'Gene Name'), color = 'blue', shape = '>', size = 3, position = position_nudge(y = -0.2))+
    facet_wrap('~gene + Direction',nrow = 30, scales = 'free') +
    theme(dpi = 300, figure_size=(30,30))
)

#p.make()

# Then you can alter its properties
#p.set_size_inches(15, 5, forward=True)
#p.set_dpi(100)
#p.fig

# And display the final figure

get_target_oligo_df()¶

With all of that out of the way, we can test out our function to get target oligos for our entire df of coordinates.

Let's look at the source code for the get_target_oligo_df() function.

In [30]:
lines = inspect.getsource(wgregseq.get_target_oligo_df)
print(lines)
def get_target_oligo_df(df, left_pos_col, right_pos_col, dir_col, genome, homology = 90, attB_fwd_seq = 'ggcttgtcgacgacggcggtctccgtcgtcaggatcat'):
    
    """
    Apply get_target_oligo to a dataframe of genomic coordinates and directions. Iterates through df rows calling get_target_oligo given the parameters specified in each column.
    
    Given a set of parameters, get an ORBIT oligo that targets the lagging strand. 
    Left and right positions are absolute genomic coordinates that specify the final nucleotides to keep unmodified in the genome, 
    everything in between will be replaced by attB. In other words the left position nucleotide is the final nt before attB in the oligo.
    The right position nt is the first nt after attB in the oligo.
    
    This function determines the lagging strand by calling `get_replichore()` on the left_pos.
    Typically attB_dir should be set to the same direction as the gene of interest, such that the integrating plasmid will insert with payload facing downstream.
    attB_fwd_seq can be modified, and the total homology can be modified, but should be an even number since homology arms are symmetric. 
        
    Parameters
    -----------------
    df : pd.DataFrame
        Pandas dataframe containing the required genomic coordinates, and gene directions.
    left_pos_col : str
        Column name of left genomic coordinate of desired attB insertion. attB is added immediately after this nt. 
    right_pos_col : str
        Column name of right genomic coordinate of desired attB insertion. attB is added immediately after this nt. 
    dir_col : str
        Column name of desired direction of attB based on genomic strand. Typically same direction as gene.
    genome : str
        Genome as a string.
    homology : int (even)
        Total homology length desired for oligo. Arm length = homology / 2.   
    attB_fwd_seq : str
        Sequence of attB to insert between homology arms.
    verbose : bool
        If true, prints details about genomic positions and replichore.
    Returns
    ---------------
    df_results : pd.DataFrame
        Adds column 'oligo' to input df. 'oligo' contains a string of the targeting oligo sequence against lagging strand, including the attB sequence in the correct orientation.
    """
    
    df_tmp = pd.DataFrame()
    df_results = pd.DataFrame()
    
    for i,row in df.iterrows():
        
        left_pos = row[left_pos_col]
        right_pos = row[right_pos_col]
        attB_dir = row[dir_col]
        
        oligo = get_target_oligo(left_pos, right_pos, genome, homology, attB_dir, attB_fwd_seq)

        df_tmp = df.iloc[[i],:]
        
        df_tmp['oligo'] = oligo
        
        df_results = pd.concat([df_results,df_tmp])
    
    return df_results

Ok, let's test this out with our first set of oligos - the short group of genes with the first and last codon deletion:

In [31]:
df_first_last = wgregseq.get_target_oligo_df(df, 'left_codon', 'right_codon', 'Direction',genome)
df_first_last.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
/Users/tomroschinger/git/Reg-Seq2/software_module/wgregseq/orbit.py:347: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp['oligo'] = oligo
Out[31]:
Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene oligo
0 aaeR 3389520 3390449 + 929 3389522 3390447 3389525.0 3390429.0 aaeR CTATATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAATTCATATTGTACTGTTACGTTGTACAAACCTGTGCCAACGGG
1 abgR 1404741 1405649 + 908 1404743 1405647 1404746.0 1405629.0 abgR GAGTCTGGCGGATGTCGACAGACTCTATTTTTTTATGCAGTTTTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAACTGG
2 acrR 485761 486408 + 647 485763 486406 485766.0 486388.0 acrR CGACGAAAATGTCCAGGAAAAATCCTGGAGTCAGATTCAGGGTTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTTGTG
3 ada 2309341 2310405 - 1064 2309343 2310403 2309361.0 2310400.0 ada GTGGCTCTTGCCACGGTTCAGCATCGGCAAACAGATCCAACATTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCGGTC
4 adiY 4337168 4337929 - 761 4337170 4337927 4337188.0 4337924.0 adiY TTAGCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATTTTTAACCTTAACGAAGAGCTATATTAATAACGGCATCAGC
... ... ... ... ... ... ... ... ... ... ... ...
295 yqhC 3154262 3155218 - 956 3154264 3155216 3154282.0 3155213.0 yqhC TGACGATTTTCCCCGTTCCCGGTTGCTGTACCGGGAACGTATTTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAGAAA
296 ytfH 4434113 4434493 + 380 4434115 4434491 4434118.0 4434473.0 ytfH AGCCATGCACCGTAGACCAGATAAGCTCAGCGCATCCGGCAGTTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAGTTT
297 zntR 3438705 3439130 - 425 3438707 3439128 3438725.0 3439125.0 zntR GGTTATTTAACGGCGCGAGTGTAATCCTGCCAGTGCAAAAAATCAatgatcctgacgacggagaccgccgtcgtcgacaagccCATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACTTGT
298 zraR 4203320 4204645 + 1325 4203322 4204643 4203325.0 4204625.0 zraR CCGGAAAGATATCGGCTGGCGCGCTATCGAACGCGAGCAGAACTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAGAGG
299 zur 4259488 4260003 - 515 4259490 4260001 4259508.0 4259998.0 zur GGTAAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAAGAGGGCGTACATCCTTGTACACGTCGGGCAGGAGGGATTAAT

300 rows × 11 columns

Seems to work well. Now let's make the targeting oligos for the "avoid_overlap" coordinates.

In [32]:
df_avd_ovlp = wgregseq.get_target_oligo_df(df, 'left_avd_ovlp', 'right_avd_ovlp', 'Direction',genome)
df_avd_ovlp.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
/Users/tomroschinger/git/Reg-Seq2/software_module/wgregseq/orbit.py:347: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp['oligo'] = oligo
Out[32]:
Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene oligo
0 aaeR 3389520 3390449 + 929 3389522 3390447 3389525.0 3390429.0 aaeR TATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGGGCGCGGGAAAGAGAAGTAATTCATATTGTACTGTTACGTTGTA
1 abgR 1404741 1405649 + 908 1404743 1405647 1404746.0 1405629.0 abgR CAGACTCTATTTTTTTATGCAGTTTTAACTTTGCAGATAGCCGCAatgatcctgacgacggagaccgccgtcgtcgacaagccAGCCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAAC
2 acrR 485761 486408 + 647 485763 486406 485766.0 486388.0 acrR AAAATCCTGGAGTCAGATTCAGGGTTATTCGTTAGTGGCAGGATTatgatcctgacgacggagaccgccgtcgtcgacaagccTGCCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTT
3 ada 2309341 2310405 - 1064 2309343 2310403 2309361.0 2310400.0 ada CAGCATCGGCAAACAGATCCAACATTACCTCTCCTCATTTTCAGCatgatcctgacgacggagaccgccgtcgtcgacaagccTTTCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCG
4 adiY 4337168 4337929 - 761 4337170 4337927 4337188.0 4337924.0 adiY GCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGAGGggcttgtcgacgacggcggtctccgtcgtcaggatcatAGAGAACGCACTGTCGCCTGATTTTTAACCTTAACGAAGAGCTAT
... ... ... ... ... ... ... ... ... ... ... ...
295 yqhC 3154262 3155218 - 956 3154264 3155216 3154282.0 3155213.0 yqhC CCGGTTGCTGTACCGGGAACGTATTTAATTCCCCTGCATCGCCCGatgatcctgacgacggagaccgccgtcgtcgacaagccTAGCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAG
296 ytfH 4434113 4434493 + 380 4434115 4434491 4434118.0 4434473.0 ytfH AGATAAGCTCAGCGCATCCGGCAGTTATGCCGCACGTTCATCCCGatgatcctgacgacggagaccgccgtcgtcgacaagccACTCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAG
297 zntR 3438705 3439130 - 425 3438707 3439128 3438725.0 3439125.0 zntR GTGTAATCCTGCCAGTGCAAAAAATCAACAACCACTCTTAACGCCatgatcctgacgacggagaccgccgtcgtcgacaagccATACATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACT
298 zraR 4203320 4204645 + 1325 4203322 4204643 4203325.0 4204625.0 zraR GCGCGCTATCGAACGCGAGCAGAACTAACGCGACAGTTTTGCCAAatgatcctgacgacggagaccgccgtcgtcgacaagccCGTCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAG
299 zur 4259488 4260003 - 515 4259490 4260001 4259508.0 4259998.0 zur AAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGTGAAAAAGAAACCGCGTTAAGAGGGCGTACATCCTTGTACACGT

300 rows × 11 columns

Modify to avoid complementarity¶

Again, I will use get_target_oligo_df_2(attB_lock = True) and I will pass the Direction as '+' for all.

In [33]:
df_2 = df.copy()
df_2['Direction'] = '+'
df_2.head()
Out[33]:
Gene Name Product Name GO terms (molecular function) Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene
0 aaeR LysR-type transcriptional regulator AaeR transcription regulatory region sequence-specific DNA binding // DNA-binding transcription factor activity // DNA binding 3389520 3390449 + 929 3389522 3390447 3389525.0 3390429.0 aaeR
1 abgR putative LysR-type DNA-binding transcriptional regulator AbgR transcription regulatory region sequence-specific DNA binding // DNA binding // DNA-binding transcription factor activity 1404741 1405649 + 908 1404743 1405647 1404746.0 1405629.0 abgR
2 acrR DNA-binding transcriptional repressor AcrR transcription regulatory region sequence-specific DNA binding // protein binding // DNA binding // bacterial-type RNA polymerase transcription regulatory region sequence-specific DNA binding // to... 485761 486408 + 647 485763 486406 485766.0 486388.0 acrR
3 ada DNA-binding transcriptional dual regulator / DNA repair protein Ada protein binding // transferase activity // methyltransferase activity // metal ion binding // sequence-specific DNA binding // zinc ion binding // catalytic activity // DNA binding // DNA-binding ... 2309341 2310405 + 1064 2309343 2310403 2309361.0 2310400.0 ada
4 adiY DNA-binding transcriptional activator AdiY sequence-specific DNA binding // DNA binding // DNA-binding transcription factor activity 4337168 4337929 + 761 4337170 4337927 4337188.0 4337924.0 adiY
In [34]:
df_first_last_2 = wgregseq.get_target_oligo_df_2(df_2, 'left_codon', 'right_codon','Direction',genome, attB_lock = True)
df_first_last_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
/Users/tomroschinger/git/Reg-Seq2/software_module/wgregseq/orbit.py:406: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp['oligo'] = oligo
Out[34]:
Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene oligo
0 aaeR 3389520 3390449 + 929 3389522 3390447 3389525.0 3390429.0 aaeR CTATATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAATTCATATTGTACTGTTACGTTGTACAAACCTGTGCCAACGGG
1 abgR 1404741 1405649 + 908 1404743 1405647 1404746.0 1405629.0 abgR GAGTCTGGCGGATGTCGACAGACTCTATTTTTTTATGCAGTTTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAACTGG
2 acrR 485761 486408 + 647 485763 486406 485766.0 486388.0 acrR CGACGAAAATGTCCAGGAAAAATCCTGGAGTCAGATTCAGGGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTTGTG
3 ada 2309341 2310405 + 1064 2309343 2310403 2309361.0 2310400.0 ada GTGGCTCTTGCCACGGTTCAGCATCGGCAAACAGATCCAACATTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCGGTC
4 adiY 4337168 4337929 + 761 4337170 4337927 4337188.0 4337924.0 adiY TTAGCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATTTTTAACCTTAACGAAGAGCTATATTAATAACGGCATCAGC
... ... ... ... ... ... ... ... ... ... ... ...
295 yqhC 3154262 3155218 + 956 3154264 3155216 3154282.0 3155213.0 yqhC TGACGATTTTCCCCGTTCCCGGTTGCTGTACCGGGAACGTATTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAGAAA
296 ytfH 4434113 4434493 + 380 4434115 4434491 4434118.0 4434473.0 ytfH AGCCATGCACCGTAGACCAGATAAGCTCAGCGCATCCGGCAGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAGTTT
297 zntR 3438705 3439130 + 425 3438707 3439128 3438725.0 3439125.0 zntR GGTTATTTAACGGCGCGAGTGTAATCCTGCCAGTGCAAAAAATCAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACTTGT
298 zraR 4203320 4204645 + 1325 4203322 4204643 4203325.0 4204625.0 zraR CCGGAAAGATATCGGCTGGCGCGCTATCGAACGCGAGCAGAACTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAGAGG
299 zur 4259488 4260003 + 515 4259490 4260001 4259508.0 4259998.0 zur GGTAAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAAGAGGGCGTACATCCTTGTACACGTCGGGCAGGAGGGATTAAT

300 rows × 11 columns

In [35]:
df_avd_ovlp_2 = wgregseq.get_target_oligo_df_2(df_2, 'left_avd_ovlp', 'right_avd_ovlp', 'Direction',genome,attB_lock = True)
df_avd_ovlp_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
/Users/tomroschinger/git/Reg-Seq2/software_module/wgregseq/orbit.py:406: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp['oligo'] = oligo
Out[35]:
Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene oligo
0 aaeR 3389520 3390449 + 929 3389522 3390447 3389525.0 3390429.0 aaeR TATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGGGCGCGGGAAAGAGAAGTAATTCATATTGTACTGTTACGTTGTA
1 abgR 1404741 1405649 + 908 1404743 1405647 1404746.0 1405629.0 abgR CAGACTCTATTTTTTTATGCAGTTTTAACTTTGCAGATAGCCGCAggcttgtcgacgacggcggtctccgtcgtcaggatcatAGCCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAAC
2 acrR 485761 486408 + 647 485763 486406 485766.0 486388.0 acrR AAAATCCTGGAGTCAGATTCAGGGTTATTCGTTAGTGGCAGGATTggcttgtcgacgacggcggtctccgtcgtcaggatcatTGCCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTT
3 ada 2309341 2310405 + 1064 2309343 2310403 2309361.0 2310400.0 ada CAGCATCGGCAAACAGATCCAACATTACCTCTCCTCATTTTCAGCggcttgtcgacgacggcggtctccgtcgtcaggatcatTTTCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCG
4 adiY 4337168 4337929 + 761 4337170 4337927 4337188.0 4337924.0 adiY GCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGAGGggcttgtcgacgacggcggtctccgtcgtcaggatcatAGAGAACGCACTGTCGCCTGATTTTTAACCTTAACGAAGAGCTAT
... ... ... ... ... ... ... ... ... ... ... ...
295 yqhC 3154262 3155218 + 956 3154264 3155216 3154282.0 3155213.0 yqhC CCGGTTGCTGTACCGGGAACGTATTTAATTCCCCTGCATCGCCCGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAGCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAG
296 ytfH 4434113 4434493 + 380 4434115 4434491 4434118.0 4434473.0 ytfH AGATAAGCTCAGCGCATCCGGCAGTTATGCCGCACGTTCATCCCGggcttgtcgacgacggcggtctccgtcgtcaggatcatACTCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAG
297 zntR 3438705 3439130 + 425 3438707 3439128 3438725.0 3439125.0 zntR GTGTAATCCTGCCAGTGCAAAAAATCAACAACCACTCTTAACGCCggcttgtcgacgacggcggtctccgtcgtcaggatcatATACATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACT
298 zraR 4203320 4204645 + 1325 4203322 4204643 4203325.0 4204625.0 zraR GCGCGCTATCGAACGCGAGCAGAACTAACGCGACAGTTTTGCCAAggcttgtcgacgacggcggtctccgtcgtcaggatcatCGTCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAG
299 zur 4259488 4260003 + 515 4259490 4260001 4259508.0 4259998.0 zur AAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGTGAAAAAGAAACCGCGTTAAGAGGGCGTACATCCTTGTACACGT

300 rows × 11 columns

Looks good. Let's proceed to making the final subpools and outputing the oligo file.

Final QC and output¶

Now that we've constructed our targeting oligos with two different coordinate sets, let's split up the oligos into long and short deletions again. This should give us 4 different subpools to work with.

In [36]:
df_first_last_short_2 = df_first_last_2.loc[df_first_last_2['length']<575].reset_index()
df_first_last_short_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[36]:
index Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene oligo
0 10 alpA 2758644 2758856 + 212 2758646 2758854 2758649.0 2758836.0 alpA AATCTCTCTGCAACCAAAGTGAACCAATGAGAGGCAACAAGAATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGAGGGTGTTACATGAATTCATACTCAATTGCTGTCATCGGAGTG
1 16 argR 3384703 3385173 + 470 3384705 3385171 3384708.0 3385153.0 argR TATGCACAATAATGTTGTATCAACCACCATATCGGGTGACTTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAATCTCTGCCCCGTCGTTTCTGACGGCGGGGAAAATGTTGCTTA
2 17 ariR 1216369 1216635 + 266 1216371 1216633 1216374.0 1216615.0 ariR GATGAATGAGTTTTCTATAAACTTATACTTAATAATTAGAAGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATGGTAACCTCTCATCTTACTTATGAAATTTTAATGTATTCTGT
3 18 arsR 3648528 3648881 + 353 3648530 3648879 3648533.0 3648861.0 arsR GCTTCGAAGAGAGACACTACCTGCAACAATCAGGAGCGCAATATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAAAAATTTAGCTAAACACATATGAATTTTCAGATGTGTTTTATC
4 20 asnC 3926545 3927003 + 458 3926547 3927001 3926565.0 3926998.0 asnC GGCTAAAATAGAATGAATCATCAATCCGCATAAGAAAATCCTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATCGGCTTTTTTAATCCCATACTTTTCCACAGGTAGATCCCAA
... ... ... ... ... ... ... ... ... ... ... ... ...
69 281 yiiE 4079291 4079509 + 218 4079293 4079507 4079296.0 4079489.0 yiiE TAAGGGCATCTGTTTTTTATATTCAAGAATGAAAAATTTTTGTCAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTACCAATACCTTACATATATTACTCATTAATGTATGTGCGAA
70 289 ylbG 529645 530016 + 371 529647 530014 529665.0 530011.0 ylbG ATATGAGTGTCGAATCCTTATCCAAAACAAGAGGTAACTCTCATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGAACAAATTTTATCAGGTGACGTTCCGTAAAAAGTTGTATGGAG
71 296 ytfH 4434113 4434493 + 380 4434115 4434491 4434118.0 4434473.0 ytfH AGCCATGCACCGTAGACCAGATAAGCTCAGCGCATCCGGCAGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAGTTT
72 297 zntR 3438705 3439130 + 425 3438707 3439128 3438725.0 3439125.0 zntR GGTTATTTAACGGCGCGAGTGTAATCCTGCCAGTGCAAAAAATCAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACTTGT
73 299 zur 4259488 4260003 + 515 4259490 4260001 4259508.0 4259998.0 zur GGTAAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAAGAGGGCGTACATCCTTGTACACGTCGGGCAGGAGGGATTAAT

74 rows × 12 columns

In [37]:
df_first_last_long_2 = df_first_last_2.loc[df_first_last_2['length']>=575].reset_index()
df_first_last_long_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[37]:
index Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene oligo
0 0 aaeR 3389520 3390449 + 929 3389522 3390447 3389525.0 3390429.0 aaeR CTATATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAATTCATATTGTACTGTTACGTTGTACAAACCTGTGCCAACGGG
1 1 abgR 1404741 1405649 + 908 1404743 1405647 1404746.0 1405629.0 abgR GAGTCTGGCGGATGTCGACAGACTCTATTTTTTTATGCAGTTTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAACTGG
2 2 acrR 485761 486408 + 647 485763 486406 485766.0 486388.0 acrR CGACGAAAATGTCCAGGAAAAATCCTGGAGTCAGATTCAGGGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTTGTG
3 3 ada 2309341 2310405 + 1064 2309343 2310403 2309361.0 2310400.0 ada GTGGCTCTTGCCACGGTTCAGCATCGGCAAACAGATCCAACATTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCGGTC
4 4 adiY 4337168 4337929 + 761 4337170 4337927 4337188.0 4337924.0 adiY TTAGCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATTTTTAACCTTAACGAAGAGCTATATTAATAACGGCATCAGC
... ... ... ... ... ... ... ... ... ... ... ... ...
221 292 ypdC 2501130 2501987 + 857 2501132 2501985 2501135.0 2501967.0 ypdC AAAGAATTTCGCCAGTTAATGCATCTTTAATCGGGAACTTTCATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAACGTCAGAAGGTTAATTCTGTTTCCAGCAGCGTCAGGATACTT
222 293 yphH 2682863 2684056 + 1193 2682865 2684054 2682868.0 2684036.0 yphH CGCGGAATAATCACGCAATTAACTAAACAAGGTTTAGTGAAGATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATGGCGCGATAACGTAGAAAGGCTTCCCGAAGGAAGCCTTGAT
223 294 yqeI 2988502 2989311 + 809 2988504 2989309 2988507.0 2989291.0 yqeI CTATGTGATCTCCATTTCGATTGATTTAGTGTTTATTGACGTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATTATAAAAAAAACTTATTATTTATTTTAGTTTTTATCAGTGG
224 295 yqhC 3154262 3155218 + 956 3154264 3155216 3154282.0 3155213.0 yqhC TGACGATTTTCCCCGTTCCCGGTTGCTGTACCGGGAACGTATTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAGAAA
225 298 zraR 4203320 4204645 + 1325 4203322 4204643 4203325.0 4204625.0 zraR CCGGAAAGATATCGGCTGGCGCGCTATCGAACGCGAGCAGAACTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAGAGG

226 rows × 12 columns

In [38]:
df_avd_ovlp_short_2 = df_avd_ovlp_2.loc[df_avd_ovlp_2['length']<575].reset_index()
df_avd_ovlp_short_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[38]:
index Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene oligo
0 10 alpA 2758644 2758856 + 212 2758646 2758854 2758649.0 2758836.0 alpA CTCTCTGCAACCAAAGTGAACCAATGAGAGGCAACAAGAATGAACggcttgtcgacgacggcggtctccgtcgtcaggatcatCAACGCTGTAAACTTATTTGAGGGTGTTACATGAATTCATACTCA
1 16 argR 3384703 3385173 + 470 3384705 3385171 3384708.0 3385153.0 argR GCACAATAATGTTGTATCAACCACCATATCGGGTGACTTATGCGAggcttgtcgacgacggcggtctccgtcgtcaggatcatCTGTTCGACCAGGAGCTTTAATCTCTGCCCCGTCGTTTCTGACGG
2 17 ariR 1216369 1216635 + 266 1216371 1216633 1216374.0 1216615.0 ariR AAACTTATACTTAATAATTAGAAGTTACATATCATCAGCTGTGTAggcttgtcgacgacggcggtctccgtcgtcaggatcatAAGCATGGTAACCTCTCATCTTACTTATGAAATTTTAATGTATTC
3 18 arsR 3648528 3648881 + 353 3648530 3648879 3648533.0 3648861.0 arsR TCGAAGAGAGACACTACCTGCAACAATCAGGAGCGCAATATGTCAggcttgtcgacgacggcggtctccgtcgtcaggatcatAGTAAGAACATTTGCAGTTAAAAATTTAGCTAAACACATATGAAT
4 20 asnC 3926545 3927003 + 458 3926547 3927001 3926565.0 3926998.0 asnC TAAAATAGAATGAATCATCAATCCGCATAAGAAAATCCTATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatATGCGTACCATCAAGCCCTGATCGGCTTTTTTAATCCCATACTTT
... ... ... ... ... ... ... ... ... ... ... ... ...
69 281 yiiE 4079291 4079509 + 218 4079293 4079507 4079296.0 4079489.0 yiiE ATATTCAAGAATGAAAAATTTTTGTCATTCCTTATGCTCCTTACAggcttgtcgacgacggcggtctccgtcgtcaggatcatCGCCATTACCAATACCTTACATATATTACTCATTAATGTATGTGC
70 289 ylbG 529645 530016 + 371 529647 530014 529665.0 530011.0 ylbG TGAGTGTCGAATCCTTATCCAAAACAAGAGGTAACTCTCATGCTTggcttgtcgacgacggcggtctccgtcgtcaggatcatAATCTCAAAAGACGATACTGAACAAATTTTATCAGGTGACGTTCC
71 296 ytfH 4434113 4434493 + 380 4434115 4434491 4434118.0 4434473.0 ytfH AGATAAGCTCAGCGCATCCGGCAGTTATGCCGCACGTTCATCCCGggcttgtcgacgacggcggtctccgtcgtcaggatcatACTCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAG
72 297 zntR 3438705 3439130 + 425 3438707 3439128 3438725.0 3439125.0 zntR GTGTAATCCTGCCAGTGCAAAAAATCAACAACCACTCTTAACGCCggcttgtcgacgacggcggtctccgtcgtcaggatcatATACATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACT
73 299 zur 4259488 4260003 + 515 4259490 4260001 4259508.0 4259998.0 zur AAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGTGAAAAAGAAACCGCGTTAAGAGGGCGTACATCCTTGTACACGT

74 rows × 12 columns

In [39]:
df_avd_ovlp_long_2 = df_avd_ovlp_2.loc[df_avd_ovlp_2['length']>=575].reset_index()
df_avd_ovlp_long_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Out[39]:
index Gene Name Left-End-Position Right-End-Position Direction length left_codon right_codon left_avd_ovlp right_avd_ovlp gene oligo
0 0 aaeR 3389520 3390449 + 929 3389522 3390447 3389525.0 3390429.0 aaeR TATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGGGCGCGGGAAAGAGAAGTAATTCATATTGTACTGTTACGTTGTA
1 1 abgR 1404741 1405649 + 908 1404743 1405647 1404746.0 1405629.0 abgR CAGACTCTATTTTTTTATGCAGTTTTAACTTTGCAGATAGCCGCAggcttgtcgacgacggcggtctccgtcgtcaggatcatAGCCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAAC
2 2 acrR 485761 486408 + 647 485763 486406 485766.0 486388.0 acrR AAAATCCTGGAGTCAGATTCAGGGTTATTCGTTAGTGGCAGGATTggcttgtcgacgacggcggtctccgtcgtcaggatcatTGCCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTT
3 3 ada 2309341 2310405 + 1064 2309343 2310403 2309361.0 2310400.0 ada CAGCATCGGCAAACAGATCCAACATTACCTCTCCTCATTTTCAGCggcttgtcgacgacggcggtctccgtcgtcaggatcatTTTCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCG
4 4 adiY 4337168 4337929 + 761 4337170 4337927 4337188.0 4337924.0 adiY GCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGAGGggcttgtcgacgacggcggtctccgtcgtcaggatcatAGAGAACGCACTGTCGCCTGATTTTTAACCTTAACGAAGAGCTAT
... ... ... ... ... ... ... ... ... ... ... ... ...
221 292 ypdC 2501130 2501987 + 857 2501132 2501985 2501135.0 2501967.0 ypdC GAATTTCGCCAGTTAATGCATCTTTAATCGGGAACTTTCATGAAAggcttgtcgacgacggcggtctccgtcgtcaggatcatAGCGCCCGTTTTCAGGGCTAACGTCAGAAGGTTAATTCTGTTTCC
222 293 yphH 2682863 2684056 + 1193 2682865 2684054 2682868.0 2684036.0 yphH GGAATAATCACGCAATTAACTAAACAAGGTTTAGTGAAGATGAGAggcttgtcgacgacggcggtctccgtcgtcaggatcatGCGCAGTTACGACAGATTTGATGGCGCGATAACGTAGAAAGGCTT
223 294 yqeI 2988502 2989311 + 809 2988504 2989309 2988507.0 2989291.0 yqeI TGTGATCTCCATTTCGATTGATTTAGTGTTTATTGACGTATGTACggcttgtcgacgacggcggtctccgtcgtcaggatcatCGTGAGGTTAATCGTGATTGATTATAAAAAAAACTTATTATTTAT
224 295 yqhC 3154262 3155218 + 956 3154264 3155216 3154282.0 3155213.0 yqhC CCGGTTGCTGTACCGGGAACGTATTTAATTCCCCTGCATCGCCCGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAGCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAG
225 298 zraR 4203320 4204645 + 1325 4203322 4204643 4203325.0 4204625.0 zraR GCGCGCTATCGAACGCGAGCAGAACTAACGCGACAGTTTTGCCAAggcttgtcgacgacggcggtctccgtcgtcaggatcatCGTCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAG

226 rows × 12 columns

And finally we will output these oligos as .csv files

In [40]:
df_first_last_short_2.to_csv("twist_orbit_tf_del_FL_short.csv")
df_first_last_long_2.to_csv("twist_orbit_tf_del_FL_long.csv")

df_avd_ovlp_short_2.to_csv("twist_orbit_tf_del_AO_short.csv")
df_avd_ovlp_long_2.to_csv("twist_orbit_tf_del_AO_long.csv")

Computational Environment¶

In [41]:
%load_ext watermark
%watermark -v -p wgregseq,numpy,pandas
CPython 3.8.5
IPython 7.19.0

wgregseq 0.0.1
numpy 1.18.1
pandas 1.2.0
In [ ]: