The goal of this notebook is to design and use a new function that accurately designs ORBIT targeting oligos. The basic approach will be to reproduce the method of the MODEST server, which is described in a bit more detail in the accompanying paper. MODEST does several different fancy things that could be fun to implement one day, but for now I'm just trying to reproduce the most basic behavior. In theory I could use the webserver itself and then hack things together in text files, but the server is very slow and unreliable...and it's not open source so I can't modify or anything.
The basic approach MODEST takes is to take in genomic coordinates, find the appropriate homology arms, and then determine whether the forward or reverse sequence should be used to get an oligo that targets the lagging strand. We are already capable of grabbing bits of the genome from genomic coordinates, so really the only tricky part is determining the lagging strand. Reminder, during DNA replication both + and - strands are copied simultaneously in opposite directions. The new DNA strand that is synthesized continuously in the same direction as the replication fork is called the leading strand, and the new strand that is synthesized discontinuosly in small pieces that are later joined (okazaki fragments) is called the lagging strand. That's about the extent of the issue you learn in high school biology, but when thinking about specific genes on E. coli's circular genome it gets slightly more complicated.
DNA replication of the circular chromosome is initiated at the origin (oriC) with two replication forks proceeding in opposite directions. Both forks continue until they reach the 'terminus' region, which is surprisingly ambiguously defined. Basically there's a protein called Tus that binds to sites called Ter that stalls these replication forks (so they don't crash into each other?). The terminus is a relatively large regions that mostly encompasses these sites on the opposite side of the chromosome from oriC. Therefore it's not exactly clear what exact position DNA replication terminates, but one more specific site is called dif. Apparently there's a native recombinase that binds to this site and separates the two replicated chromosomes (this is cool - should learn more), so perhaps that is the truest termination site. See this paper for more detail. Ultimately, it's complicated and not that important for our purpose. The important part is that these two replication forks that travel in opposite directions from the origin to the terminus divide the chromosome into two 'replichores'.
For each targeting oligo we need to figure out which replichore we are in, because that determines whether the '+' or '-' strand is lagging. The first confusing thing about this is that genomic position is linear from 0 to 4.6M, but the chromosome is circular. The above diagram explains how to map between the two and figure out which replichore belongs to which linear genomic positions:
Replichore_1 = (pos > ori) | (pos < terminus)
Replichore_2 = (pos < ori) & (pos > terminus)
The diagram below then shows the replication forks for each terminus and how to get from replichore to lagging strand. Essentially, for replichore 2 we can simply take the '+' strand sequence directly, but for replichore 1 we need to reverse complement ('-' strand).
¶With that sorted out let's go ahead and write our first function get_replichore()
. First we need to establish the positions of ori and terminus. Ori is taken as the mean of the oriC locus:
ori = np.mean([3923767,3923998])
And the terminus is taken to be very close to the dif site, which is also near Tus. I took a nearby intergenic coordinate 1,590,250.5 and made it a fraction so that we don't throw any unwanted errors. Let's take a look at the documentation for get_replichore()
lines = inspect.getsource(wgregseq.get_replichore)
def get_replichore(pos, ori = 3923882.5, ter = 1590250.5 ): """ Determine the replichore of a bacterial chromosome for a certain position. Requires origin and terminus positions. Assumes E. coli like organization. pos : int Genomic coordinate of interest. ori : float Genomic coordinate of the origin of replication. ter : float Genomic coordinate of the replication terminus. """ pos = int(pos) if((pos<0)| (pos>4641652)): raise TypeError("position must be within genome.") if((pos > ori) | (pos<ter)): rep = 1 elif((pos<ori) & (pos>ter)): rep = 2 return rep
Let's test it out with position 0 (should be replichore 1) and position 2 M (should be replichore 2)
print('pos 0 = ', wgregseq.get_replichore(pos = 0))
print('pos 2M = ', wgregseq.get_replichore(pos = 2000000))
pos 0 = 1 pos 2M = 2
Looks good. Let's now write our function get_target_oligo()
. This function has been through a few rounds of development in this notebook, so it has grown a bit complicated. Let's break down the steps required:
homology / 2
), which assumes symmetric arms and an even homology length.This last part is somewhat complex, because there are 4 possible scenarios
Replichore = 1, Direction = '+'
Replichore = 1, Direction = '-'
Replichore = 2, Direction = '+'
Replichore = 2, Direction = '-'
Let's start by explaining the simplest example first, Replichore = 2, Direction = '+'
. In this case the homology arm sequences come directly from the '+' strand. If we are deleting a gene, then the 5' end of the oligo will have the left and upstream homology arm. Then the 3' end of the oligo will have the right and downstream homology arm. In this case we simply want to paste the fwd sequence of attB between these two arms to get an insertion of pInt in the expected direction (e.g. for pDel gro promoter facing downstream).
For Replichore = 2, Direction = '-'
, the gene is facing the opposite direction. Therefore the oligo still comes from the '+' strand, but now the 5' end of the oligo is the left and downstream homology arm and the 3' end is the right and upstream homology arm. Therefore we need to reverse complement the attB sequence, so that it is now facing downstream.
For Replichore = 1, Direction = '+'
, the gene is on the '+' strand, but the oligo sequence comes from the '-' strand. Therefore the 5' end of the oligo is right and downstream and the 3' end of the oligo is left and upstream. So, to get attB facing downstream, we need to reverse complement.
Finally for Replichore = 1, Direction = '-'
we have the actual orientation / position of galK, so just think about that. The gene is on the '-' strand, but the oligo sequence also comes from the '-' strand. Here the 5' end of the oligo right and upstream and the 3' end is left and downstream. Therefore the forward attB sequence can be used to face downstream.
Here's a table to simplify things, and just remember that on the oligo, typically we want attB facing the same way as the gene: downstream.
Replichore | gene dir | 5' abs-rel pos | 3' abs-rel pos | attB |
1 | + | right-down | left-up | rev |
1 | - | right-up | left-down | fwd |
2 | + | left-up | right-down | fwd |
2 | - | left-down | right-up | rev |
for + gene_dir locus should look like: (left) | upstream | attB_fwd | downstream (right)
for - gene_dir locus should look like: (left) | downstream | attB_rev | upstream (right)
With all of that reasoned out, we can write our function with some simple if statements:
lines = inspect.getsource(wgregseq.get_target_oligo)
def get_target_oligo(left_pos, right_pos, genome, homology = 90, attB_dir = '+', attB_fwd_seq = 'ggcttgtcgacgacggcggtctccgtcgtcaggatcat', verbose = False): """ Given a set of parameters, get an ORBIT oligo that targets the lagging strand. Left and right positions are absolute genomic coordinates that specify the final nucleotides to keep unmodified in the genome, everything in between will be replaced by attB. In other words the left position nucleotide is the final nt before attB in the oligo. The right position nt is the first nt after attB in the oligo. This function determines the lagging strand by calling `get_replichore()` on the left_pos. Typically attB_dir should be set to the same direction as the gene of interest, such that the integrating plasmid will insert with payload facing downstream. attB_fwd_seq can be modified, and the total homology can be modified, but should be an even number since homology arms are symmetric. Verbose prints helpful statements for testing functionality. Parameters ----------------- left_pos : int Left genomic coordinate of desired attB insertion. attB is added immediately after this nt. right_pos : int Right genomic coordinate of desired attB insertion. attB is added immediately before this nt. genome : str Genome as a string. homology : int (even) Total homology length desired for oligo. Arm length = homology / 2. attB_dir : chr ('+' or '-') Desired direction of attB based on genomic strand. Typically same direction as gene. attB_fwd_seq : str Sequence of attB to insert between homology arms. verbose : bool If true, prints details about genomic positions and replichore. Returns --------------- oligo : str Targeting oligo against lagging strand, including the attB sequence in the correct orientation. """ left_pos = int(left_pos) right_pos = int(right_pos) # Arm length is 1/2 total homology. Arms are symmetric arm_len = int(homology / 2) # Arms from genome string. Note 0 indexing of string vs. 1 indexing of genomic coordinates. # As written, should be inclusive. left_arm = genome[(left_pos - arm_len):left_pos] right_arm = genome[(right_pos - 1):(right_pos - 1 + arm_len)] # Generate attB reverse sequence seq_attB = Seq(attB_fwd_seq) attB_rev_seq = str(seq_attB.reverse_complement()) # Replichore 1 if get_replichore(left_pos) == 1: rep = 1 # Reverse complement replichore 1 sequences. left_arm_seq = Seq(left_arm) left_arm_prime = str(left_arm_seq.reverse_complement()) right_arm_seq = Seq(right_arm) right_arm_prime = str(right_arm_seq.reverse_complement()) # Determine attB direction and paste fwd/rev seq accordingly if attB_dir == '+': oligo = right_arm_prime + attB_rev_seq + left_arm_prime elif attB_dir == '-': oligo = right_arm_prime + attB_fwd_seq + left_arm_prime # Replichore 2 elif get_replichore(left_pos) == 2: rep = 2 # '+' arm sequence used. Determine attB direction and paste accordingly. if attB_dir == '+': oligo = left_arm + attB_fwd_seq + right_arm elif attB_dir == '-': oligo = left_arm + attB_rev_seq + right_arm # Verbose print statements if verbose: print('left_arm_coord = ', left_pos - arm_len,' : ', left_pos) print('right_arm_coord = ', right_pos - 1, ' : ', right_pos -1 + arm_len) print('Replichore = ', rep) return oligo
Let's read in our genome. Remember this is a string indexed from 0.
for record in Bio.SeqIO.parse('sequencev3.fasta', "fasta"):
genome = str(record.seq)
print("Length genome: {}".format(len(genome)))
print("First 100 bases: {}".format(genome[:100]))
Let's see if we can simply get an oligo that targets the first 8 bases of the genome. Remember this is replichore 1, so we should get the reverse complement. We will also replace the attB sequence with a space to make it simpler.
wgregseq.get_target_oligo(4, 5, genome, 8, '+',' ', True)
left_arm_coord = 0 : 4 right_arm_coord = 4 : 8 Replichore = 1
Looks good! As a reminder the left_pos, right_pos should read like "keep the 4th nucleotide as the last in the homology arm and therefore the last unmodified nt in the genome before inserting attB". Same thing for right pos, where that is the first nt of the right homology arm and first nt in the locus after attB.
Now let's dive in and try to compare this function the MODEST generated oligos. First, I generated MODEST oligos for 5 different genes that represented a mix of replichore 1 and 2. Some are very close to the terminus which was helpful. Note all the genes are on the + strand so that we can easily look for the ATG start codon. For each oligo let's compare MODEST to get_target_oligo()
by directly comparing strings and using nothing ("") as the attB sequence.
*A note on translating to MODEST format. get_target_oligo asks what is the last nucleotide you want to keep on either side? Modest asks what is the position of the first nt to delete and how many do you want to delete. so modest Gene X pos = 45, del = 10 is the same as keep the 44th nt and delete nt 45 to nt 54 and keep the 55th nucleotide as the left end. Also remember that the genomic coordinate is the first nucelotide, so to get to the 44th nt we add 43 and to get to the 55th nt we add 54.
cbrC_start = 3898022
new_cbrC = wgregseq.get_target_oligo(3898022 + 43,3898022 + 44 + 10, genome, 90, "+", "" , True)
print( 'custom cbrC: ', new_cbrC)
print( 'modest cbrC: ', mod_cbrC)
print( 'Equivalent? ', new_cbrC==mod_cbrC)
asnA_start = 3927155
new_asnA = wgregseq.get_target_oligo(3927155 + 43,3927155 + 44 + 10, genome, 90, "+", "" , True)
print( 'custom asnA: ', new_asnA)
print( 'modest asnA: ', mod_asnA)
print( 'Equivalent? ', new_asnA==mod_asnA)
cysB_start = 1333855
new_cysB = wgregseq.get_target_oligo(1333855 + 43,1333855 + 44 + 10, genome, 90, "+", "", True)
print( 'custom cysB: ', new_cysB)
print( 'modest cysB: ', mod_cysB)
print( 'Equivalent? ', new_cysB==mod_cysB)
manA_start = 1688576
new = wgregseq.get_target_oligo(1688576 + 43,1688576 + 44 + 10, genome, 90, "+", "" , True)
print( 'custom manA: ', new)
print( 'modest manA: ', mod)
print( 'Equivalent? ', new==mod)
rstB_start = 1682882
new = wgregseq.get_target_oligo(1682882 + 43,1682882 + 44 + 10, genome, 90, "+", "", True )
print( 'custom rstB: ', new)
print( 'modest rstB: ', mod)
print( 'Equivalent? ', new==mod)
Now let's try two genes on the minus strand.
lacI start = 366428, Rep = 1, strand = '-'
ligA start = 2528161, Rep = 2, strand = '-'
lacI_oli = wgregseq.get_target_oligo(367510-11, 367510, genome , 90, "-", "", True)
print('lacI: ', lacI_oli)
print('mod : ', lacI_mod)
print('Equivalent? ', lacI_oli == lacI_mod)
ligA_oli = wgregseq.get_target_oligo(2530176-11, 2530176, genome , 90, "-", "", True)
print('ligA: ',ligA_oli )
print('mod : ', ligA_mod)
print('Equivalent? ', ligA_oli == ligA_mod)
Now let's do a first test of the attB reversal functionality. cbrC is on replichore 2, + strand, so attB should be fwd. asnA is on replichore 1, + strand, so attB should be rev. Let's just use a simple attB seq 'tttt'.
print('cbrC: ', wgregseq.get_target_oligo(cbrC_start + 2, cbrC_start + 3, genome , 90, "+", "tttt"))
print('asnA: ', wgregseq.get_target_oligo(asnA_start + 2, asnA_start + 3, genome , 90, "+", "tttt"))
Looks perfect! Let's try our two minus strand genes. Remember LacI is Rep = 1, strand = '-', so attB should be fwd and LigA is rep = 2, strand = '-', so attB should be rev.
print('lacI: ', wgregseq.get_target_oligo(367510-3, 367510-2, genome , 90, "-", "tttt"))
print('ligA: ', wgregseq.get_target_oligo(2530176-3, 2530176-2, genome , 90, "-", "tttt"))
After all this, I realized in the pool of oligos it could be a problem to have complementary attB sequences that could bind to each other as single stranded oligos.
5' ggcttgtcgacgacggcggtctccgtcgtcaggatcat 3'
3' ccgaacagctgctgccgccagaggcagcagtcctagta 5'
These sequences are 38 bp long and high GC, meaning their melting temps are really high > 72 degrees. I worry that lots of these duplexes forming could inhibit PCR. I'm not sure, but I don't think it's worth potentially making the entire library unusable.
To solve this I'm simply going to encode the fwd attB sequence in every oligo. To do this I'm going to remake oligos using an alternative get_target_oligo_2()
function. This function has an optional parameter attB_lock = False
. The default behavior is the same as the original above, but when attB_lock = True
then the sequence will simply be pasted into the desired orientation into the oligo, without regard for the replichore etc.
This will result in insertions facing the 'wrong' direction in the final ORBIT library, but I think that's fine for now until I can confirm it won't be a PCR issue or find another way to deal with it.
print('lacI: ', wgregseq.get_target_oligo_2(367510-3, 367510-2, genome , 90, "+", "tttt", attB_lock = True))
print('ligA: ', wgregseq.get_target_oligo_2(2530176-3, 2530176-2, genome , 90, "+", "tttt", attB_lock = True))
print('cbrC: ', wgregseq.get_target_oligo_2(cbrC_start + 2, cbrC_start + 3, genome , 90, "+", "tttt", attB_lock = True))
print('asnA: ', wgregseq.get_target_oligo_2(asnA_start + 2, asnA_start + 3, genome , 90, "+", "tttt", attB_lock = True))
Looks good. Now we can move on to actually designing oligos for lots of genes.
You can find the public biocyc table here https://biocyc.org/group?id=Curated_DNA_binding_TRs_public. These genes come from searching ecocyc using the multifunctional terms "regulation" > "type of regulation" > "transcriptional"(334) or "unknown" (32). I chose these annotations, because at least this list included all TFs that I quickly searched for from the RegSeq paper. In particular yieP was in 'unknown' even though its annotated as a DNA binding TF. This was true for several others as well. With this list of 368 I then added the GO term annotations to the table and I manually checked that TFs had at least some sort of 'DNA-binding' annotation. This got rid of a lot of things like histidine sensor kinases (sensor part of two component sensors). Naturally there were some weird edge cases, for example, so I had to make some manual decisions. For almost everything if it didn't have the DNA binding annotation I removed it. One specific exception I recall was a co-regulator that functions with CRP. It did not bind DNA itself, but helped to regulate a subset of CRP regulated genes. Obviously this is a rabbit hole and we could certainly use less stringent criteria, but I thought this was a reasonable way to proceed for this preliminary ORBIT test.
Ultimately, we ended up with exactly 300 genes, which are full of some classic TFs as well as totally uncharacterized putative genes whose function was inferred only by homology.
df = pd.read_csv("Curated_DNA_binding_transcriptional_regulators.txt", sep = '\t')
df.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | |
0 | aaeR | 3389520 | 3390449 | + |
1 | abgR | 1404741 | 1405649 | + |
2 | acrR | 485761 | 486408 | + |
3 | ada | 2309341 | 2310405 | - |
4 | adiY | 4337168 | 4337929 | - |
... | ... | ... | ... | ... |
295 | yqhC | 3154262 | 3155218 | - |
296 | ytfH | 4434113 | 4434493 | + |
297 | zntR | 3438705 | 3439130 | - |
298 | zraR | 4203320 | 4204645 | + |
299 | zur | 4259488 | 4260003 | - |
300 rows × 4 columns
First, let's make sure that this table is complete.
Unique Gene Name 300 Product Name 300 GO terms (molecular function) 169 Left-End-Position 300 Right-End-Position 300 Direction 2 dtype: int64 Nulls Gene Name 0 Product Name 0 GO terms (molecular function) 10 Left-End-Position 0 Right-End-Position 0 Direction 0 dtype: int64
Ok, so we don't seem to have any missing values for our essential parameters, left, right and direction.
Another thing we should examine before proceeding is the length distribution of these genes. Remember these oligos will need to delete almost the entire gene, and efficiencies diminish significantly as deletion length increases.
df['length'] = df['Right-End-Position'] - df['Left-End-Position']
count 300.000000 mean 804.446667 std 429.946112 min 200.000000 25% 575.000000 50% 758.000000 75% 932.000000 max 3962.000000 Name: length, dtype: float64
Ok, so there's quite a range in lengths from 200 bp to almost 4kb. That will yield some pretty big differences in efficiency. However, we can see that the middle 50% falls between 575 and 932bp, which isn't so bad. Let's take a look at the distributions:
scatter = hv.Scatter(df, 'Left-End-Position', 'length').opts(width = 500)*hv.HLine(575).opts(color = 'red')
dist = hv.Distribution(df, 'length' ).opts(width = 500, bandwidth = 0.1) * hv.VLine(575).opts(color = 'red')
scatter + dist
I think the biggest concern here is that the short deletions could be way overrepresented in the library. That's not a huge deal as long we know what the composition is, but an extreme bias could be quite suboptimal. With this in mind, let's split our library into two sublibraries. Somewhat arbitratrily I have set the cutoff point at the bottom 25% line - 575 bp. This separates out the pool pretty nicely by length where the small subpool is 200-575bp and the longer length pool is mostly 575-1000, but does include some multikb sequences as well.
df_short = df.loc[df['length']<=575]
df_short.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | length | |
10 | alpA | 2758644 | 2758856 | + | 212 |
16 | argR | 3384703 | 3385173 | + | 470 |
17 | ariR | 1216369 | 1216635 | + | 266 |
18 | arsR | 3648528 | 3648881 | + | 353 |
20 | asnC | 3926545 | 3927003 | - | 458 |
... | ... | ... | ... | ... | ... |
283 | yjdC | 4362733 | 4363308 | - | 575 |
289 | ylbG | 529645 | 530016 | - | 371 |
296 | ytfH | 4434113 | 4434493 | + | 380 |
297 | zntR | 3438705 | 3439130 | - | 425 |
299 | zur | 4259488 | 4260003 | - | 515 |
76 rows × 5 columns
Ok, so 76 TFs fall into this 'short' category.
df_long = df.loc[df['length']>575]
df_long.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | length | |
0 | aaeR | 3389520 | 3390449 | + | 929 |
1 | abgR | 1404741 | 1405649 | + | 908 |
2 | acrR | 485761 | 486408 | + | 647 |
3 | ada | 2309341 | 2310405 | - | 1064 |
4 | adiY | 4337168 | 4337929 | - | 761 |
... | ... | ... | ... | ... | ... |
292 | ypdC | 2501130 | 2501987 | + | 857 |
293 | yphH | 2682863 | 2684056 | + | 1193 |
294 | yqeI | 2988502 | 2989311 | + | 809 |
295 | yqhC | 3154262 | 3155218 | - | 956 |
298 | zraR | 4203320 | 4204645 | + | 1325 |
224 rows × 5 columns
And 224 TFs fall into this 'long' category. That seems to work well for now - let's proceed.
The final thing to consider before designing all of the targeting oligos, is specifically what portion of each gene to delete. Often people delete everything except the start and stop codon, which I think is a good option. But I wanted to consider how often that mutation would actually disrupt another unintended gene. Apparently a lot of E. coli genes overlap slightly or almost overlap. See this link and many more modern papers. It seems that the most common gene overlaps are 1-4nt. Also note that the RBS shine dalgarno sequence is typically 8nt upstream of the start codon and itself is ~6nt. So for a gene that just touches the next gene (0 nt overlap) we should probably end the upstream gene deletion ~15nt early to avoid deleting the RBS. Of course this is still imperfect if the promoter has been deleted or separated from the gene by the integrating plasmid...however if we ultimately want markerless deletions with small scars these situations are important.
Gene of interest (GOI) upstream of overlapping gene (G2):
Gene of interest downstream of overlapping gene:
Optimal parameters to accomodate 4nt overlap: upstream homology arm should end at nt 6 instead of 3. downstream homology arm should end at -21 instead of -3.
From that link above I downloaded the plain text file showing all genes that either overlap or almost overlap.
df_ovlp = pd.read_csv("Ecoli_overlaps.txt", sep = '\t', skiprows = 12, skipfooter=12)
Bnum | Name | Str | Start | Stop | Bnum.1 | Name.1 | Str.1 | Start.1 | Stop.1 | Overlap | New start | Unnamed: 12 | |
0 | b0002 | thrA | 1 | 337 | 2799 | b0003 | thrB | 1 | 2801 | 3733 | -1 | NaN | NaN |
1 | b0003 | thrB | 1 | 2801 | 3733 | b0004 | thrC | 1 | 3734 | 5020 | 0 | NaN | NaN |
2 | b0013 | yaaI | 2 | 11382 | 11786 | b0011 | NaN | 2 | 10643 | 11356 | -25 | 11384.0 | NaN |
3 | b0022 | insA_1 | 2 | 20233 | 20508 | b0021 | insB_1 | 2 | 19811 | 20314 | 82 | 20342.0 | NaN |
4 | b0024 | NaN | 1 | 21181 | 21399 | b0025 | yaaC | 1 | 21407 | 22348 | -7 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1499 | b4380 | yjjI | 2 | 4613084 | 4614634 | b4379 | yjjW | 2 | 4612249 | 4613112 | 29 | 4613140.0 | NaN |
1500 | b4389 | sms | 1 | 4623481 | 4624863 | b4390 | nadR | 1 | 4624863 | 4626116 | 1 | NaN | NaN |
1501 | b4397 | creA | 1 | 4633090 | 4633563 | b4398 | creB | 1 | 4633576 | 4634265 | -12 | NaN | NaN |
1502 | b4398 | creB | 1 | 4633576 | 4634265 | b4399 | creC | 1 | 4634265 | 4635689 | 1 | NaN | NaN |
1503 | b4405 | NaN | 1 | 3975603 | 3976217 | b3793 | rffT | 1 | 3976214 | 3977566 | 4 | NaN | NaN |
1504 rows × 13 columns
We can then simply look at the distribution and summary statistics of overlaps from these 1500 nearby genes.
df_ovlp['gene'] = 'gene'
hv.Distribution(df_ovlp, 'Overlap').opts(width = 800, height = 400)
count 1504.000000 mean 2.905585 std 27.991047 min -26.000000 25% -11.000000 50% 0.000000 75% 4.000000 max 263.000000 Name: Overlap, dtype: float64
So, certainly we can't accomodate genes that overlap 50 - 100bp, since some genes are only 200-500bp long...but we can see from the summary statistics that at least 75% of nearby genes overlap less than 4nt. So at a minimum that we would need to move our preceding deletion 14 + 4nt upstream to avoid hitting the RBS. I settled on +6 nt for gene start and -21 nt for gene end.
Let's first make our first and last codon positions:
df['left_codon'] = df['Left-End-Position'] + 2
df['right_codon'] = df['Right-End-Position'] - 2
df.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | |
0 | aaeR | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 |
1 | abgR | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 |
2 | acrR | 485761 | 486408 | + | 647 | 485763 | 486406 |
3 | ada | 2309341 | 2310405 | - | 1064 | 2309343 | 2310403 |
4 | adiY | 4337168 | 4337929 | - | 761 | 4337170 | 4337927 |
... | ... | ... | ... | ... | ... | ... | ... |
295 | yqhC | 3154262 | 3155218 | - | 956 | 3154264 | 3155216 |
296 | ytfH | 4434113 | 4434493 | + | 380 | 4434115 | 4434491 |
297 | zntR | 3438705 | 3439130 | - | 425 | 3438707 | 3439128 |
298 | zraR | 4203320 | 4204645 | + | 1325 | 4203322 | 4204643 |
299 | zur | 4259488 | 4260003 | - | 515 | 4259490 | 4260001 |
300 rows × 7 columns
And now we can make our "avoid_overlap" coordinates, taking into consideration the gene direction:
df.loc[df['Direction']=='+', 'left_avd_ovlp'] = df['Left-End-Position'] + 5
df.loc[df['Direction']=='-', 'left_avd_ovlp'] = df['Left-End-Position'] + 20
df.loc[df['Direction']=='+', 'right_avd_ovlp'] = df['Right-End-Position'] - 20
df.loc[df['Direction']=='-', 'right_avd_ovlp'] = df['Right-End-Position'] - 5
df['gene'] = df['Gene Name']
#df['right_avd_ovlp'] = df['Right-End-Position'] - 17
df.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | |
0 | aaeR | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 | 3389525.0 | 3390429.0 | aaeR |
1 | abgR | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 | 1404746.0 | 1405629.0 | abgR |
2 | acrR | 485761 | 486408 | + | 647 | 485763 | 486406 | 485766.0 | 486388.0 | acrR |
3 | ada | 2309341 | 2310405 | - | 1064 | 2309343 | 2310403 | 2309361.0 | 2310400.0 | ada |
4 | adiY | 4337168 | 4337929 | - | 761 | 4337170 | 4337927 | 4337188.0 | 4337924.0 | adiY |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
295 | yqhC | 3154262 | 3155218 | - | 956 | 3154264 | 3155216 | 3154282.0 | 3155213.0 | yqhC |
296 | ytfH | 4434113 | 4434493 | + | 380 | 4434115 | 4434491 | 4434118.0 | 4434473.0 | ytfH |
297 | zntR | 3438705 | 3439130 | - | 425 | 3438707 | 3439128 | 3438725.0 | 3439125.0 | zntR |
298 | zraR | 4203320 | 4204645 | + | 1325 | 4203322 | 4204643 | 4203325.0 | 4204625.0 | zraR |
299 | zur | 4259488 | 4260003 | - | 515 | 4259490 | 4260001 | 4259508.0 | 4259998.0 | zur |
300 rows × 10 columns
Let's take a quick look at the first 5 genes to make sure it worked as expected:
ggplot(df.head()) +
geom_segment(aes(x = 'Left-End-Position', xend = 'Right-End-Position', y = 'Gene Name', yend = 'Gene Name')) +
geom_point(aes(x = 'Left-End-Position', y = 'Gene Name'), shape = '|', size = 5) +
geom_point(aes(x = 'Right-End-Position', y = 'Gene Name'), shape = '|', size = 5)+
geom_point(aes(x = 'left_codon', y = 'Gene Name'), color = 'red', shape = '<', size = 3, position = position_nudge(y = 0.2))+
geom_point(aes(x = 'right_codon', y = 'Gene Name'), color = 'red', shape = '>', size = 3, position = position_nudge(y = 0.2))+
geom_point(aes(x = 'left_avd_ovlp', y = 'Gene Name'), color = 'blue', shape = '<', size = 3, position = position_nudge(y = -0.2))+
geom_point(aes(x = 'right_avd_ovlp', y = 'Gene Name'), color = 'blue', shape = '>', size = 3, position = position_nudge(y = -0.2))+
facet_wrap('~gene + Direction',nrow = 5, scales = 'free') +theme(dpi = 200, figure_size=(5,4))
Looks good. You can see that the more inward blue arrows (-21nt at gene end) are as expected for the different gene directions. I find this to be a quite helpful overview, so let's go ahead and look at all 300 genes this way to make sure things look good:
ggplot(df) +
geom_segment(aes(x = 'Left-End-Position', xend = 'Right-End-Position', y = 'Gene Name', yend = 'Gene Name')) +
geom_point(aes(x = 'Left-End-Position', y = 'Gene Name'), shape = '|', size = 5) +
geom_point(aes(x = 'Right-End-Position', y = 'Gene Name'), shape = '|', size = 5)+
geom_point(aes(x = 'left_codon', y = 'Gene Name'), color = 'red', shape = '<', size = 3, position = position_nudge(y = 0.2))+
geom_point(aes(x = 'right_codon', y = 'Gene Name'), color = 'red', shape = '>', size = 3, position = position_nudge(y = 0.2))+
geom_point(aes(x = 'left_avd_ovlp', y = 'Gene Name'), color = 'blue', shape = '<', size = 3, position = position_nudge(y = -0.2))+
geom_point(aes(x = 'right_avd_ovlp', y = 'Gene Name'), color = 'blue', shape = '>', size = 3, position = position_nudge(y = -0.2))+
facet_wrap('~gene + Direction',nrow = 30, scales = 'free') +
theme(dpi = 300, figure_size=(30,30))
¶With all of that out of the way, we can test out our function to get target oligos for our entire df of coordinates.
Let's look at the source code for the get_target_oligo_df()
lines = inspect.getsource(wgregseq.get_target_oligo_df)
def get_target_oligo_df(df, left_pos_col, right_pos_col, dir_col, genome, homology = 90, attB_fwd_seq = 'ggcttgtcgacgacggcggtctccgtcgtcaggatcat'): """ Apply get_target_oligo to a dataframe of genomic coordinates and directions. Iterates through df rows calling get_target_oligo given the parameters specified in each column. Given a set of parameters, get an ORBIT oligo that targets the lagging strand. Left and right positions are absolute genomic coordinates that specify the final nucleotides to keep unmodified in the genome, everything in between will be replaced by attB. In other words the left position nucleotide is the final nt before attB in the oligo. The right position nt is the first nt after attB in the oligo. This function determines the lagging strand by calling `get_replichore()` on the left_pos. Typically attB_dir should be set to the same direction as the gene of interest, such that the integrating plasmid will insert with payload facing downstream. attB_fwd_seq can be modified, and the total homology can be modified, but should be an even number since homology arms are symmetric. Parameters ----------------- df : pd.DataFrame Pandas dataframe containing the required genomic coordinates, and gene directions. left_pos_col : str Column name of left genomic coordinate of desired attB insertion. attB is added immediately after this nt. right_pos_col : str Column name of right genomic coordinate of desired attB insertion. attB is added immediately after this nt. dir_col : str Column name of desired direction of attB based on genomic strand. Typically same direction as gene. genome : str Genome as a string. homology : int (even) Total homology length desired for oligo. Arm length = homology / 2. attB_fwd_seq : str Sequence of attB to insert between homology arms. verbose : bool If true, prints details about genomic positions and replichore. Returns --------------- df_results : pd.DataFrame Adds column 'oligo' to input df. 'oligo' contains a string of the targeting oligo sequence against lagging strand, including the attB sequence in the correct orientation. """ df_tmp = pd.DataFrame() df_results = pd.DataFrame() for i,row in df.iterrows(): left_pos = row[left_pos_col] right_pos = row[right_pos_col] attB_dir = row[dir_col] oligo = get_target_oligo(left_pos, right_pos, genome, homology, attB_dir, attB_fwd_seq) df_tmp = df.iloc[[i],:] df_tmp['oligo'] = oligo df_results = pd.concat([df_results,df_tmp]) return df_results
Ok, let's test this out with our first set of oligos - the short group of genes with the first and last codon deletion:
df_first_last = wgregseq.get_target_oligo_df(df, 'left_codon', 'right_codon', 'Direction',genome)
df_first_last.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | oligo | |
0 | aaeR | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 | 3389525.0 | 3390429.0 | aaeR | CTATATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAATTCATATTGTACTGTTACGTTGTACAAACCTGTGCCAACGGG |
1 | abgR | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 | 1404746.0 | 1405629.0 | abgR | GAGTCTGGCGGATGTCGACAGACTCTATTTTTTTATGCAGTTTTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAACTGG |
2 | acrR | 485761 | 486408 | + | 647 | 485763 | 486406 | 485766.0 | 486388.0 | acrR | CGACGAAAATGTCCAGGAAAAATCCTGGAGTCAGATTCAGGGTTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTTGTG |
3 | ada | 2309341 | 2310405 | - | 1064 | 2309343 | 2310403 | 2309361.0 | 2310400.0 | ada | GTGGCTCTTGCCACGGTTCAGCATCGGCAAACAGATCCAACATTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCGGTC |
4 | adiY | 4337168 | 4337929 | - | 761 | 4337170 | 4337927 | 4337188.0 | 4337924.0 | adiY | TTAGCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATTTTTAACCTTAACGAAGAGCTATATTAATAACGGCATCAGC |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
295 | yqhC | 3154262 | 3155218 | - | 956 | 3154264 | 3155216 | 3154282.0 | 3155213.0 | yqhC | TGACGATTTTCCCCGTTCCCGGTTGCTGTACCGGGAACGTATTTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAGAAA |
296 | ytfH | 4434113 | 4434493 | + | 380 | 4434115 | 4434491 | 4434118.0 | 4434473.0 | ytfH | AGCCATGCACCGTAGACCAGATAAGCTCAGCGCATCCGGCAGTTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAGTTT |
297 | zntR | 3438705 | 3439130 | - | 425 | 3438707 | 3439128 | 3438725.0 | 3439125.0 | zntR | GGTTATTTAACGGCGCGAGTGTAATCCTGCCAGTGCAAAAAATCAatgatcctgacgacggagaccgccgtcgtcgacaagccCATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACTTGT |
298 | zraR | 4203320 | 4204645 | + | 1325 | 4203322 | 4204643 | 4203325.0 | 4204625.0 | zraR | CCGGAAAGATATCGGCTGGCGCGCTATCGAACGCGAGCAGAACTAatgatcctgacgacggagaccgccgtcgtcgacaagccCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAGAGG |
299 | zur | 4259488 | 4260003 | - | 515 | 4259490 | 4260001 | 4259508.0 | 4259998.0 | zur | GGTAAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAAGAGGGCGTACATCCTTGTACACGTCGGGCAGGAGGGATTAAT |
300 rows × 11 columns
Seems to work well. Now let's make the targeting oligos for the "avoid_overlap" coordinates.
df_avd_ovlp = wgregseq.get_target_oligo_df(df, 'left_avd_ovlp', 'right_avd_ovlp', 'Direction',genome)
df_avd_ovlp.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | oligo | |
0 | aaeR | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 | 3389525.0 | 3390429.0 | aaeR | TATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGGGCGCGGGAAAGAGAAGTAATTCATATTGTACTGTTACGTTGTA |
1 | abgR | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 | 1404746.0 | 1405629.0 | abgR | CAGACTCTATTTTTTTATGCAGTTTTAACTTTGCAGATAGCCGCAatgatcctgacgacggagaccgccgtcgtcgacaagccAGCCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAAC |
2 | acrR | 485761 | 486408 | + | 647 | 485763 | 486406 | 485766.0 | 486388.0 | acrR | AAAATCCTGGAGTCAGATTCAGGGTTATTCGTTAGTGGCAGGATTatgatcctgacgacggagaccgccgtcgtcgacaagccTGCCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTT |
3 | ada | 2309341 | 2310405 | - | 1064 | 2309343 | 2310403 | 2309361.0 | 2310400.0 | ada | CAGCATCGGCAAACAGATCCAACATTACCTCTCCTCATTTTCAGCatgatcctgacgacggagaccgccgtcgtcgacaagccTTTCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCG |
4 | adiY | 4337168 | 4337929 | - | 761 | 4337170 | 4337927 | 4337188.0 | 4337924.0 | adiY | GCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGAGGggcttgtcgacgacggcggtctccgtcgtcaggatcatAGAGAACGCACTGTCGCCTGATTTTTAACCTTAACGAAGAGCTAT |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
295 | yqhC | 3154262 | 3155218 | - | 956 | 3154264 | 3155216 | 3154282.0 | 3155213.0 | yqhC | CCGGTTGCTGTACCGGGAACGTATTTAATTCCCCTGCATCGCCCGatgatcctgacgacggagaccgccgtcgtcgacaagccTAGCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAG |
296 | ytfH | 4434113 | 4434493 | + | 380 | 4434115 | 4434491 | 4434118.0 | 4434473.0 | ytfH | AGATAAGCTCAGCGCATCCGGCAGTTATGCCGCACGTTCATCCCGatgatcctgacgacggagaccgccgtcgtcgacaagccACTCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAG |
297 | zntR | 3438705 | 3439130 | - | 425 | 3438707 | 3439128 | 3438725.0 | 3439125.0 | zntR | GTGTAATCCTGCCAGTGCAAAAAATCAACAACCACTCTTAACGCCatgatcctgacgacggagaccgccgtcgtcgacaagccATACATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACT |
298 | zraR | 4203320 | 4204645 | + | 1325 | 4203322 | 4204643 | 4203325.0 | 4204625.0 | zraR | GCGCGCTATCGAACGCGAGCAGAACTAACGCGACAGTTTTGCCAAatgatcctgacgacggagaccgccgtcgtcgacaagccCGTCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAG |
299 | zur | 4259488 | 4260003 | - | 515 | 4259490 | 4260001 | 4259508.0 | 4259998.0 | zur | AAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGTGAAAAAGAAACCGCGTTAAGAGGGCGTACATCCTTGTACACGT |
300 rows × 11 columns
Again, I will use get_target_oligo_df_2(attB_lock = True)
and I will pass the Direction as '+' for all.
df_2 = df.copy()
df_2['Direction'] = '+'
Gene Name | Product Name | GO terms (molecular function) | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | |
0 | aaeR | LysR-type transcriptional regulator AaeR | transcription regulatory region sequence-specific DNA binding // DNA-binding transcription factor activity // DNA binding | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 | 3389525.0 | 3390429.0 | aaeR |
1 | abgR | putative LysR-type DNA-binding transcriptional regulator AbgR | transcription regulatory region sequence-specific DNA binding // DNA binding // DNA-binding transcription factor activity | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 | 1404746.0 | 1405629.0 | abgR |
2 | acrR | DNA-binding transcriptional repressor AcrR | transcription regulatory region sequence-specific DNA binding // protein binding // DNA binding // bacterial-type RNA polymerase transcription regulatory region sequence-specific DNA binding // to... | 485761 | 486408 | + | 647 | 485763 | 486406 | 485766.0 | 486388.0 | acrR |
3 | ada | DNA-binding transcriptional dual regulator / DNA repair protein Ada | protein binding // transferase activity // methyltransferase activity // metal ion binding // sequence-specific DNA binding // zinc ion binding // catalytic activity // DNA binding // DNA-binding ... | 2309341 | 2310405 | + | 1064 | 2309343 | 2310403 | 2309361.0 | 2310400.0 | ada |
4 | adiY | DNA-binding transcriptional activator AdiY | sequence-specific DNA binding // DNA binding // DNA-binding transcription factor activity | 4337168 | 4337929 | + | 761 | 4337170 | 4337927 | 4337188.0 | 4337924.0 | adiY |
df_first_last_2 = wgregseq.get_target_oligo_df_2(df_2, 'left_codon', 'right_codon','Direction',genome, attB_lock = True)
df_first_last_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | oligo | |
0 | aaeR | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 | 3389525.0 | 3390429.0 | aaeR | CTATATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAATTCATATTGTACTGTTACGTTGTACAAACCTGTGCCAACGGG |
1 | abgR | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 | 1404746.0 | 1405629.0 | abgR | GAGTCTGGCGGATGTCGACAGACTCTATTTTTTTATGCAGTTTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAACTGG |
2 | acrR | 485761 | 486408 | + | 647 | 485763 | 486406 | 485766.0 | 486388.0 | acrR | CGACGAAAATGTCCAGGAAAAATCCTGGAGTCAGATTCAGGGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTTGTG |
3 | ada | 2309341 | 2310405 | + | 1064 | 2309343 | 2310403 | 2309361.0 | 2310400.0 | ada | GTGGCTCTTGCCACGGTTCAGCATCGGCAAACAGATCCAACATTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCGGTC |
4 | adiY | 4337168 | 4337929 | + | 761 | 4337170 | 4337927 | 4337188.0 | 4337924.0 | adiY | TTAGCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATTTTTAACCTTAACGAAGAGCTATATTAATAACGGCATCAGC |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
295 | yqhC | 3154262 | 3155218 | + | 956 | 3154264 | 3155216 | 3154282.0 | 3155213.0 | yqhC | TGACGATTTTCCCCGTTCCCGGTTGCTGTACCGGGAACGTATTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAGAAA |
296 | ytfH | 4434113 | 4434493 | + | 380 | 4434115 | 4434491 | 4434118.0 | 4434473.0 | ytfH | AGCCATGCACCGTAGACCAGATAAGCTCAGCGCATCCGGCAGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAGTTT |
297 | zntR | 3438705 | 3439130 | + | 425 | 3438707 | 3439128 | 3438725.0 | 3439125.0 | zntR | GGTTATTTAACGGCGCGAGTGTAATCCTGCCAGTGCAAAAAATCAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACTTGT |
298 | zraR | 4203320 | 4204645 | + | 1325 | 4203322 | 4204643 | 4203325.0 | 4204625.0 | zraR | CCGGAAAGATATCGGCTGGCGCGCTATCGAACGCGAGCAGAACTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAGAGG |
299 | zur | 4259488 | 4260003 | + | 515 | 4259490 | 4260001 | 4259508.0 | 4259998.0 | zur | GGTAAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAAGAGGGCGTACATCCTTGTACACGTCGGGCAGGAGGGATTAAT |
300 rows × 11 columns
df_avd_ovlp_2 = wgregseq.get_target_oligo_df_2(df_2, 'left_avd_ovlp', 'right_avd_ovlp', 'Direction',genome,attB_lock = True)
df_avd_ovlp_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | oligo | |
0 | aaeR | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 | 3389525.0 | 3390429.0 | aaeR | TATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGGGCGCGGGAAAGAGAAGTAATTCATATTGTACTGTTACGTTGTA |
1 | abgR | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 | 1404746.0 | 1405629.0 | abgR | CAGACTCTATTTTTTTATGCAGTTTTAACTTTGCAGATAGCCGCAggcttgtcgacgacggcggtctccgtcgtcaggatcatAGCCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAAC |
2 | acrR | 485761 | 486408 | + | 647 | 485763 | 486406 | 485766.0 | 486388.0 | acrR | AAAATCCTGGAGTCAGATTCAGGGTTATTCGTTAGTGGCAGGATTggcttgtcgacgacggcggtctccgtcgtcaggatcatTGCCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTT |
3 | ada | 2309341 | 2310405 | + | 1064 | 2309343 | 2310403 | 2309361.0 | 2310400.0 | ada | CAGCATCGGCAAACAGATCCAACATTACCTCTCCTCATTTTCAGCggcttgtcgacgacggcggtctccgtcgtcaggatcatTTTCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCG |
4 | adiY | 4337168 | 4337929 | + | 761 | 4337170 | 4337927 | 4337188.0 | 4337924.0 | adiY | GCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGAGGggcttgtcgacgacggcggtctccgtcgtcaggatcatAGAGAACGCACTGTCGCCTGATTTTTAACCTTAACGAAGAGCTAT |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
295 | yqhC | 3154262 | 3155218 | + | 956 | 3154264 | 3155216 | 3154282.0 | 3155213.0 | yqhC | CCGGTTGCTGTACCGGGAACGTATTTAATTCCCCTGCATCGCCCGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAGCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAG |
296 | ytfH | 4434113 | 4434493 | + | 380 | 4434115 | 4434491 | 4434118.0 | 4434473.0 | ytfH | AGATAAGCTCAGCGCATCCGGCAGTTATGCCGCACGTTCATCCCGggcttgtcgacgacggcggtctccgtcgtcaggatcatACTCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAG |
297 | zntR | 3438705 | 3439130 | + | 425 | 3438707 | 3439128 | 3438725.0 | 3439125.0 | zntR | GTGTAATCCTGCCAGTGCAAAAAATCAACAACCACTCTTAACGCCggcttgtcgacgacggcggtctccgtcgtcaggatcatATACATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACT |
298 | zraR | 4203320 | 4204645 | + | 1325 | 4203322 | 4204643 | 4203325.0 | 4204625.0 | zraR | GCGCGCTATCGAACGCGAGCAGAACTAACGCGACAGTTTTGCCAAggcttgtcgacgacggcggtctccgtcgtcaggatcatCGTCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAG |
299 | zur | 4259488 | 4260003 | + | 515 | 4259490 | 4260001 | 4259508.0 | 4259998.0 | zur | AAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGTGAAAAAGAAACCGCGTTAAGAGGGCGTACATCCTTGTACACGT |
300 rows × 11 columns
Looks good. Let's proceed to making the final subpools and outputing the oligo file.
Now that we've constructed our targeting oligos with two different coordinate sets, let's split up the oligos into long and short deletions again. This should give us 4 different subpools to work with.
df_first_last_short_2 = df_first_last_2.loc[df_first_last_2['length']<575].reset_index()
df_first_last_short_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
index | Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | oligo | |
0 | 10 | alpA | 2758644 | 2758856 | + | 212 | 2758646 | 2758854 | 2758649.0 | 2758836.0 | alpA | AATCTCTCTGCAACCAAAGTGAACCAATGAGAGGCAACAAGAATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGAGGGTGTTACATGAATTCATACTCAATTGCTGTCATCGGAGTG |
1 | 16 | argR | 3384703 | 3385173 | + | 470 | 3384705 | 3385171 | 3384708.0 | 3385153.0 | argR | TATGCACAATAATGTTGTATCAACCACCATATCGGGTGACTTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAATCTCTGCCCCGTCGTTTCTGACGGCGGGGAAAATGTTGCTTA |
2 | 17 | ariR | 1216369 | 1216635 | + | 266 | 1216371 | 1216633 | 1216374.0 | 1216615.0 | ariR | GATGAATGAGTTTTCTATAAACTTATACTTAATAATTAGAAGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATGGTAACCTCTCATCTTACTTATGAAATTTTAATGTATTCTGT |
3 | 18 | arsR | 3648528 | 3648881 | + | 353 | 3648530 | 3648879 | 3648533.0 | 3648861.0 | arsR | GCTTCGAAGAGAGACACTACCTGCAACAATCAGGAGCGCAATATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAAAAATTTAGCTAAACACATATGAATTTTCAGATGTGTTTTATC |
4 | 20 | asnC | 3926545 | 3927003 | + | 458 | 3926547 | 3927001 | 3926565.0 | 3926998.0 | asnC | GGCTAAAATAGAATGAATCATCAATCCGCATAAGAAAATCCTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATCGGCTTTTTTAATCCCATACTTTTCCACAGGTAGATCCCAA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
69 | 281 | yiiE | 4079291 | 4079509 | + | 218 | 4079293 | 4079507 | 4079296.0 | 4079489.0 | yiiE | TAAGGGCATCTGTTTTTTATATTCAAGAATGAAAAATTTTTGTCAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTACCAATACCTTACATATATTACTCATTAATGTATGTGCGAA |
70 | 289 | ylbG | 529645 | 530016 | + | 371 | 529647 | 530014 | 529665.0 | 530011.0 | ylbG | ATATGAGTGTCGAATCCTTATCCAAAACAAGAGGTAACTCTCATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGAACAAATTTTATCAGGTGACGTTCCGTAAAAAGTTGTATGGAG |
71 | 296 | ytfH | 4434113 | 4434493 | + | 380 | 4434115 | 4434491 | 4434118.0 | 4434473.0 | ytfH | AGCCATGCACCGTAGACCAGATAAGCTCAGCGCATCCGGCAGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAGTTT |
72 | 297 | zntR | 3438705 | 3439130 | + | 425 | 3438707 | 3439128 | 3438725.0 | 3439125.0 | zntR | GGTTATTTAACGGCGCGAGTGTAATCCTGCCAGTGCAAAAAATCAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACTTGT |
73 | 299 | zur | 4259488 | 4260003 | + | 515 | 4259490 | 4260001 | 4259508.0 | 4259998.0 | zur | GGTAAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAAGAGGGCGTACATCCTTGTACACGTCGGGCAGGAGGGATTAAT |
74 rows × 12 columns
df_first_last_long_2 = df_first_last_2.loc[df_first_last_2['length']>=575].reset_index()
df_first_last_long_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
index | Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | oligo | |
0 | 0 | aaeR | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 | 3389525.0 | 3390429.0 | aaeR | CTATATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAATTCATATTGTACTGTTACGTTGTACAAACCTGTGCCAACGGG |
1 | 1 | abgR | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 | 1404746.0 | 1405629.0 | abgR | GAGTCTGGCGGATGTCGACAGACTCTATTTTTTTATGCAGTTTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAACTGG |
2 | 2 | acrR | 485761 | 486408 | + | 647 | 485763 | 486406 | 485766.0 | 486388.0 | acrR | CGACGAAAATGTCCAGGAAAAATCCTGGAGTCAGATTCAGGGTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTTGTG |
3 | 3 | ada | 2309341 | 2310405 | + | 1064 | 2309343 | 2310403 | 2309361.0 | 2310400.0 | ada | GTGGCTCTTGCCACGGTTCAGCATCGGCAAACAGATCCAACATTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCGGTC |
4 | 4 | adiY | 4337168 | 4337929 | + | 761 | 4337170 | 4337927 | 4337188.0 | 4337924.0 | adiY | TTAGCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATTTTTAACCTTAACGAAGAGCTATATTAATAACGGCATCAGC |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
221 | 292 | ypdC | 2501130 | 2501987 | + | 857 | 2501132 | 2501985 | 2501135.0 | 2501967.0 | ypdC | AAAGAATTTCGCCAGTTAATGCATCTTTAATCGGGAACTTTCATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAACGTCAGAAGGTTAATTCTGTTTCCAGCAGCGTCAGGATACTT |
222 | 293 | yphH | 2682863 | 2684056 | + | 1193 | 2682865 | 2684054 | 2682868.0 | 2684036.0 | yphH | CGCGGAATAATCACGCAATTAACTAAACAAGGTTTAGTGAAGATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATGGCGCGATAACGTAGAAAGGCTTCCCGAAGGAAGCCTTGAT |
223 | 294 | yqeI | 2988502 | 2989311 | + | 809 | 2988504 | 2989309 | 2988507.0 | 2989291.0 | yqeI | CTATGTGATCTCCATTTCGATTGATTTAGTGTTTATTGACGTATGggcttgtcgacgacggcggtctccgtcgtcaggatcatTGATTATAAAAAAAACTTATTATTTATTTTAGTTTTTATCAGTGG |
224 | 295 | yqhC | 3154262 | 3155218 | + | 956 | 3154264 | 3155216 | 3154282.0 | 3155213.0 | yqhC | TGACGATTTTCCCCGTTCCCGGTTGCTGTACCGGGAACGTATTTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAGAAA |
225 | 298 | zraR | 4203320 | 4204645 | + | 1325 | 4203322 | 4204643 | 4203325.0 | 4204625.0 | zraR | CCGGAAAGATATCGGCTGGCGCGCTATCGAACGCGAGCAGAACTAggcttgtcgacgacggcggtctccgtcgtcaggatcatCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAGAGG |
226 rows × 12 columns
df_avd_ovlp_short_2 = df_avd_ovlp_2.loc[df_avd_ovlp_2['length']<575].reset_index()
df_avd_ovlp_short_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
index | Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | oligo | |
0 | 10 | alpA | 2758644 | 2758856 | + | 212 | 2758646 | 2758854 | 2758649.0 | 2758836.0 | alpA | CTCTCTGCAACCAAAGTGAACCAATGAGAGGCAACAAGAATGAACggcttgtcgacgacggcggtctccgtcgtcaggatcatCAACGCTGTAAACTTATTTGAGGGTGTTACATGAATTCATACTCA |
1 | 16 | argR | 3384703 | 3385173 | + | 470 | 3384705 | 3385171 | 3384708.0 | 3385153.0 | argR | GCACAATAATGTTGTATCAACCACCATATCGGGTGACTTATGCGAggcttgtcgacgacggcggtctccgtcgtcaggatcatCTGTTCGACCAGGAGCTTTAATCTCTGCCCCGTCGTTTCTGACGG |
2 | 17 | ariR | 1216369 | 1216635 | + | 266 | 1216371 | 1216633 | 1216374.0 | 1216615.0 | ariR | AAACTTATACTTAATAATTAGAAGTTACATATCATCAGCTGTGTAggcttgtcgacgacggcggtctccgtcgtcaggatcatAAGCATGGTAACCTCTCATCTTACTTATGAAATTTTAATGTATTC |
3 | 18 | arsR | 3648528 | 3648881 | + | 353 | 3648530 | 3648879 | 3648533.0 | 3648861.0 | arsR | TCGAAGAGAGACACTACCTGCAACAATCAGGAGCGCAATATGTCAggcttgtcgacgacggcggtctccgtcgtcaggatcatAGTAAGAACATTTGCAGTTAAAAATTTAGCTAAACACATATGAAT |
4 | 20 | asnC | 3926545 | 3927003 | + | 458 | 3926547 | 3927001 | 3926565.0 | 3926998.0 | asnC | TAAAATAGAATGAATCATCAATCCGCATAAGAAAATCCTATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatATGCGTACCATCAAGCCCTGATCGGCTTTTTTAATCCCATACTTT |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
69 | 281 | yiiE | 4079291 | 4079509 | + | 218 | 4079293 | 4079507 | 4079296.0 | 4079489.0 | yiiE | ATATTCAAGAATGAAAAATTTTTGTCATTCCTTATGCTCCTTACAggcttgtcgacgacggcggtctccgtcgtcaggatcatCGCCATTACCAATACCTTACATATATTACTCATTAATGTATGTGC |
70 | 289 | ylbG | 529645 | 530016 | + | 371 | 529647 | 530014 | 529665.0 | 530011.0 | ylbG | TGAGTGTCGAATCCTTATCCAAAACAAGAGGTAACTCTCATGCTTggcttgtcgacgacggcggtctccgtcgtcaggatcatAATCTCAAAAGACGATACTGAACAAATTTTATCAGGTGACGTTCC |
71 | 296 | ytfH | 4434113 | 4434493 | + | 380 | 4434115 | 4434491 | 4434118.0 | 4434473.0 | ytfH | AGATAAGCTCAGCGCATCCGGCAGTTATGCCGCACGTTCATCCCGggcttgtcgacgacggcggtctccgtcgtcaggatcatACTCATTTCATACTTACCTTTTTGTACGTACTTACTAAAAGTAAG |
72 | 297 | zntR | 3438705 | 3439130 | + | 425 | 3438707 | 3439128 | 3438725.0 | 3439125.0 | zntR | GTGTAATCCTGCCAGTGCAAAAAATCAACAACCACTCTTAACGCCggcttgtcgacgacggcggtctccgtcgtcaggatcatATACATACATACTCCACTAGTTATCGTTGATTTTGTCCAACAACT |
73 | 299 | zur | 4259488 | 4260003 | + | 515 | 4259490 | 4260001 | 4259508.0 | 4259998.0 | zur | AAAGTAAGGACATTCTTAACCCCCACTTTGAGGTGCCCGATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGTGAAAAAGAAACCGCGTTAAGAGGGCGTACATCCTTGTACACGT |
74 rows × 12 columns
df_avd_ovlp_long_2 = df_avd_ovlp_2.loc[df_avd_ovlp_2['length']>=575].reset_index()
df_avd_ovlp_long_2.drop(['Product Name', 'GO terms (molecular function)'], axis = 1)
index | Gene Name | Left-End-Position | Right-End-Position | Direction | length | left_codon | right_codon | left_avd_ovlp | right_avd_ovlp | gene | oligo | |
0 | 0 | aaeR | 3389520 | 3390449 | + | 929 | 3389522 | 3390447 | 3389525.0 | 3390429.0 | aaeR | TATTATGTGATCTAAATCACTTTTAAGTCAGAGTGAATAATGGAAggcttgtcgacgacggcggtctccgtcgtcaggatcatGGGCGCGGGAAAGAGAAGTAATTCATATTGTACTGTTACGTTGTA |
1 | 1 | abgR | 1404741 | 1405649 | + | 908 | 1404743 | 1405647 | 1404746.0 | 1405629.0 | abgR | CAGACTCTATTTTTTTATGCAGTTTTAACTTTGCAGATAGCCGCAggcttgtcgacgacggcggtctccgtcgtcaggatcatAGCCATGACGCCACCGATAACCGTTATTTATCAGACCAAAGAAAC |
2 | 2 | acrR | 485761 | 486408 | + | 647 | 485763 | 486406 | 485766.0 | 486388.0 | acrR | AAAATCCTGGAGTCAGATTCAGGGTTATTCGTTAGTGGCAGGATTggcttgtcgacgacggcggtctccgtcgtcaggatcatTGCCATATGTTCGTGAATTTACAGGCGTTAGATTTACATACATTT |
3 | 3 | ada | 2309341 | 2310405 | + | 1064 | 2309343 | 2310403 | 2309361.0 | 2310400.0 | ada | CAGCATCGGCAAACAGATCCAACATTACCTCTCCTCATTTTCAGCggcttgtcgacgacggcggtctccgtcgtcaggatcatTTTCATAATCAGCTCCCTGGTTAAGGATAGCCTTTAGGCTGCCCG |
4 | 4 | adiY | 4337168 | 4337929 | + | 761 | 4337170 | 4337927 | 4337188.0 | 4337924.0 | adiY | GCGAGAACTGGTCTTTTATTCGCACTCAGGAGTACATGTATGAGGggcttgtcgacgacggcggtctccgtcgtcaggatcatAGAGAACGCACTGTCGCCTGATTTTTAACCTTAACGAAGAGCTAT |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
221 | 292 | ypdC | 2501130 | 2501987 | + | 857 | 2501132 | 2501985 | 2501135.0 | 2501967.0 | ypdC | GAATTTCGCCAGTTAATGCATCTTTAATCGGGAACTTTCATGAAAggcttgtcgacgacggcggtctccgtcgtcaggatcatAGCGCCCGTTTTCAGGGCTAACGTCAGAAGGTTAATTCTGTTTCC |
222 | 293 | yphH | 2682863 | 2684056 | + | 1193 | 2682865 | 2684054 | 2682868.0 | 2684036.0 | yphH | GGAATAATCACGCAATTAACTAAACAAGGTTTAGTGAAGATGAGAggcttgtcgacgacggcggtctccgtcgtcaggatcatGCGCAGTTACGACAGATTTGATGGCGCGATAACGTAGAAAGGCTT |
223 | 294 | yqeI | 2988502 | 2989311 | + | 809 | 2988504 | 2989309 | 2988507.0 | 2989291.0 | yqeI | TGTGATCTCCATTTCGATTGATTTAGTGTTTATTGACGTATGTACggcttgtcgacgacggcggtctccgtcgtcaggatcatCGTGAGGTTAATCGTGATTGATTATAAAAAAAACTTATTATTTAT |
224 | 295 | yqhC | 3154262 | 3155218 | + | 956 | 3154264 | 3155216 | 3154282.0 | 3155213.0 | yqhC | CCGGTTGCTGTACCGGGAACGTATTTAATTCCCCTGCATCGCCCGggcttgtcgacgacggcggtctccgtcgtcaggatcatTAGCATTTCTCCAGCACTCTGGAGAAATAGGCAAGACATTGGCAG |
225 | 298 | zraR | 4203320 | 4204645 | + | 1325 | 4203322 | 4204643 | 4203325.0 | 4204625.0 | zraR | GCGCGCTATCGAACGCGAGCAGAACTAACGCGACAGTTTTGCCAAggcttgtcgacgacggcggtctccgtcgtcaggatcatCGTCATCCTTGTGGGTCCTTACGCGTAATATTGACCGGAAGCCAG |
226 rows × 12 columns
And finally we will output these oligos as .csv files
