Software

Loading...
blackbird header

The immune systems of L. fusiformis and other cyanobacteria are often poorly understood and vary by strain, making it difficult to predict how native systems will interfere with gene inserts. A common defense mechanism in L. fusiformis and UTEX 3154 is the Restriction Modification System (RMS), which cuts specific DNA sequences. This likely contributes to the degradation of genetic inserts, hindering transformation.

We needed a way to bypass the RMS

Our PI, David Bernick, wrote the Stealth algorithm,[2] a program that identifyies underrepresented K-mer sequences in a given genome- we used the Stealth algorithm as a starting point to help solve the problem.

home1

As we cannot disable the immune systems of LiFT's target and model organisms, we had to work around the RMS of L. fusiformis and UTEX3154. Based on the hypothesis that RMS cut sites become unrecognizable as they generate random mutations from the repeated cutting and repair mechanisms, the apparent lack of a sequence could hint at it being a cut site. To aid LiFT in integration of accelerators, we wanted to remove cut sites and additionally optimize our inserts for UTEX 3154. To this end, we designed



BLACKBIRD [1]


a Stealth-based pipeline that optimizes inserts
for cyanobacterial transformations in non-model organisms.

Search for the underrepresented sequences!

GAAAG GGCC GCGAA ACGAT GAAAG CCGGC GATCC GGATC CGATT GGTAC ACTTCG AGATCG CGCG CAAAT GGCC GCGAA
GAAAG CCGGC GCGAA CGATT GATCC GGCC ACGAT CAAAT GAAAG GGTAC ACTTCG AGATCG CGCG CCGGC GGCC
GCGAA CGATT CCGGC GGCC GAAAG ACTTCG AGATCG GATCC GCGAA GGATC CGATT GGTAC CAAAT CGCG GAAAG GGCC
ACGAT CCGGC GAAAG CGATT GCGAA CGATT GCGAA GAAAG CCGGC GATCC GGATC CGATT GGTAC AGATCG ACGAT
GCGAA GAAAG CCGGC GGATC GGTAC GGCC GCGAA GAAAG ACGAT CGAT GGTAC GAAAG CCGGC ACTTCG AGATCG GCGAA
GAAAG CCGC GGATC CGATT GAAAG GGCTC GCGAA GATCC CCGGC CGATT GGTAC ACTT ACGAT GCGAA GAAAG CCGC
GGATC GGTAC ACGAT GGTAC GAAAG CGCGC AGATC GGATC GAAAG CGATT CCGGC GCGAA GAATAG GGCC GCGAA GATCC ACGAT
CCGGC CCGGC GCGAA GAAAG GGTGAC ACTTCG CGATT GGCC GAAAG GAAAG GGCC GCGAA GATCC ACGAT CCGGC
GGCC GAAAG GCGAA GATCC ACGAT CCGGC GGTAC AGATCG CGATT GAAAG GGCC GCGAA GATCC ACGAT CCGGC

The BLACKBIRD Software

To accelerate the transformation and integration of Cyanobacteria, LiFT used multiple methods to maximize integration efficiency. BLACKBIRD contributed the use of the Stealth algorithm in conjunction with codon optimization of gene inserts to achieve this goal. We consulted with our PI Dr. David L. Bernick to learn more about the mathematical theory behind its functioning, inputs, outputs, and possible implementations. From them we learned the algorithm operates through the use of Markov chains, which predict the expected frequency of a given sequence when each nucleotide is equally likely to occur in a sequence[2]. When given a user-generated cut-off value, the algorithm compares it to sequences to find those underrepresented according to a Chi-Squared Independence test. The output of this algorithm would be a series of underrepresented sequences, but changing a gene insert to avoid these sites proves to be a tedious task as new sites can be generated from each correction. BLACKBIRD automates the process of altering gene inserts with regard to RMS cut sites, beginning with the processing of the host and target genomes.

BLACKBIRD uses the genomes of the host organism, the organism from where the gene insert is derived, and the target organism to optimize the gene insert. It calculates the codon usage table for the genomes. This is used to generate a ranking table between the host and target organism, so that when codon optimizing the sequence, codons are used in accordance with their relative abundance. This is done to preserve slow translating regions of proteins, shown to increase the folding efficiency of proteins as the time taken for the rarer tRNA to bind promotes proper protein folding [3][4]. The codon usage table is created through the use of a Open Reading Frame finder coded into the program, and generates the usage statistics based off of the frames found. When these are completed, the codon tables of both organisms are compared, matching codons to amino acids based on usage rankings between organisms, and the gene insert can be altered.

The process of adapting the gene insert is by far the most time-intensive process, specifically due to the chance of generating new RMS cut sites with each edit. Our implemented solution created an 'editing window' that would take into consideration the following and trailing sequence to identify if any changes generated new cut sites. Using the targetted organism’s genome, BLACKBIRD runs the Stealth program and returns the underrepresented theoretical RMS cut sites that are to be checked against the insert. An editing window is created at every RMS cut site found in the gene insert, and the adapting process begins. The window is based on the first codon containing the RMS cut site, and extends 2 codons before and 2 after. This was chosen based on the K-mers being 4 nucleotides long at a minimum, and a 5 codon window being best suited to catching any changes. With each change, BLACKBIRD rechecks the window for any newly generated cut sites until none are found. In the event that the RMS cut site cannot be removed from the current codon, the window shifts forward to the next codon and attempts changes

README

For Unix/macOS:

First, check if your system can run python and the pip installer. Python packages that are not downloaded to your system need to be retrieved by an installer like pip. Use the following prompts to check:

usr:~$ python --version
                Python 3.x.x
                usr:~$ python -m pip --version
                pip X.Y.Z from /<path>/<to>/<your>/pip (python 3.x.x)

If you receive an error...

Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.

BLACKBIRD CLI

Once installed, the main function can be easily run with the command blackbird

blackbird --insert (-n) <insert infile> --stealth (-s) <stealth infile> --hostT (-ht) <host genome infile> --target (-t) <target genome infile> --outfile -o [outfile | default: stdout]

The blackbird command takes 4 required arguments:

  • --insert (-n): the insert sequence of interest in Fasta format (.fa/.fasta)
  • --stealth (-s): the list of Stealth outputted kmers in a text file (.txt/.stealth)
  • --hostT (-ht): the host organism's codon usage table in TSV format (.tsv)
  • --target (-t): the target organism's complete genome in Fasta format (.fa/.fasta)

Input File Examples

Insert Sequence (Fasta format):

>pET28:EGFP CDS
              ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA

Stealth Input File:

N = 3081514
              CGCG	[100]	RC Palindrome
              GCGC	[98] RC Palindrome
              GGCC	[100]	RC Palindrome
              AATAG	[92]	
              AATCG	[100]
              ...
              GAAGAC
              GTCTTC
              GGTCTC
              GAGACC

Codon Usage Table (TSV format):

TTT	22.31	-2414783
              TTC	16.54	-1789835
              TTA	13.76	-1489606
              TTG	13.65	-1477363
              ...

Output File Format (Fasta):

>pET28:EGFP CDS output [8]
              ATGTCAATATATCAA...

The number in brackets refers to the current number of stealth hits of the outputted insert sequence.

The LIFT, the 2024 UCSC iGem team consents to receiving any and all contributions offered.
This software is published under the MIT license. Feel free to use any and all code provided by the project in any way and for any purpose.

Contributors

BLACKBIRD was written and contributed to by:

Special Thanks

  • David L. Bernick (email: dbernick@soe.ucsc.edu), our PI, for allowing the further application of Stealth and for all the support and contributions throughout.
  • Robin Rounthwaite (email: rrounthw@ucsc.edu) for consultance in software architecture and Git repository management.
  • TABI 2023 UCSC iGem team (github) for support regarding Git repository management and project packaging.
  • Reto Stamm (email: rstamm@ucsc.edu | github) for guidance in developing and publishing a package to the Python Package Index.

The BLACKBIRD Modules Documentation

FastAreader

Information about FastAreader goes here.

StealthParser

Information about StealthParser goes here.

CodonChoice

Information about CodonChoice goes here.

CodonOp

Information about CodonOp goes here.

CodonUsage

Information about CodonUsage goes here.

References

[1] V. Nandakumar, A. Mahesh BLACKBIRD GitLab iGEM UCSC 2024.

[2] S. Hu, "Altering under-represented DNA sequences elevates bacterial transformation efficiency" mBio, Oct. 31, 2023. https://doi.org/10.1128/mbio.02105-23 (accessed Sep. 24, 2024).

[3] G. Zhang, "Transient ribosomal attenuation coordinates protein synthesis and co-translational folding" Nat Struct Mol Biol, Jul. 13, 2008 https://www.nature.com/articles/nsmb.1554 (accessed Sep. 26, 2024).

[4] G. L. Rosano, "Rare codon content affects the solubility of recombinant proteins in a codon bias-adjusted Escherichia coli strain" Microb Cell Fact, Jul. 24, 2009 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723077/ (accessed Sep. 26, 2024).