Genetic Constructor

The Genetic Constructor Support Hub

Welcome to the Genetic Constructor developer hub. You'll find comprehensive guides and documentation to help you start using and extending the Genetic Constructor as quickly as possible, as well as support if you get stuck. Let's jump right in!

Documentation    Support

GSL Beginners Tutorial

This tutorial focuses on the basics of the Genotype Specification Language (GSL). It introduces the key syntax needed for writing valid GSL and describes how a scientist might use it to design their DNA constructs - aka how to think in GSL.

GSL is under active development

This tutorial also mentions features that are integral to GSL but not yet accessible in the public installation.

Syntax and basic principles

GSL is a language tool, much like a programming language. Technically DNA is also a language but for biologists to try to specify complex designs in ACTGs would be like a coder trying to write a large computer program in only 1s and 0s. Therefore GSL is a more abstracted version of talking about DNA. Instead of dealing with individual nucleotides, GSL deals with functional parts, like a gene part or a promoter part. The basic syntax provided by GSL allows scientists to organize and manipulate these parts in various ways.

GSL parts are centered around genes in annotated reference genomes. Various types of genetic elements surrounding annotated genes are accessible using simple prefix operators. For example, the diagram below represents a section of the yeast genome where the gene ADH1 has been annotated. To access the open reading frame of ADH1 from start codon to stop codon, a user may simply type oADH1.

Visualization of the GSL operators' positions around the ADH1 gene annotation.

Visualization of the GSL operators' positions around the ADH1 gene annotation.

Similarly, pADH1 and tADH1 can be used to access the promoter and terminator regions (default: about ~500 bp upstream of the start codon or ~500 bp downstream of the stop codon, respectively). While uADH1 and dADH1 access the same sequence regions as the promoter and terminator syntax, conceptually p and t are used when you want the function of the promoter/terminator regions while u and d are used as flanking homology for a construct to recombine at a particular locus.

gADH1 can be used to indicate the general neighborhood of the gene and it is the operator required for making further modifications that we’ll discuss below! Without modifications, the g operator accesses the same region as o (start codon through stop codon).

When programming in GSL, you have access to any of these functional parts for any annotated gene by simply typing the gene name preceded by the operator prefix you want. When a GSL program is run, the underlying compiler knows how to fetch the precise sequence of each part from its coordinates in the annotated reference genome.

Each genetic element is accessible for every annotated gene across the reference genome

Each genetic element is accessible for every annotated gene across the reference genome

Note

At the time of writing the only three available genome as part of the open GSL and this integration are the Yeast strain S288C, BY4741 and BY4742.

Once you know which parts you want in your construct, you can arrange them in order separated by semicolons to quickly denote any new genotype. Below is a hypothetical construct that places the ERG10 gene under the control of the ADH1 promoter and targets the native HO locus for homologous recombination.

Example GSL construct and its representation in Genetic Constructor.
`uHO ; pADH1 ; gERG10; dHO`

Example GSL construct and its representation in Genetic Constructor.
uHO ; pADH1 ; gERG10; dHO


The Power of GSL: Precision Editing

The above designs have introduced the basic syntax features of GSL however so far, they have involved little more than lining up standard genetic parts in a row… Further precision editing syntax exists to enable users to manipulate the DNA inside these parts but GSL does the heavy lifting to implement the low level changes.

-- Part Orientation

By default, genetic elements are oriented left to right.

`uHO ; pADH1 ; gERG10 ; dHO`

uHO ; pADH1 ; gERG10 ; dHO

To reverse the direction of selected parts, users may simply prepend a !. GSL does the work to reverse the actual sequence in the final construct

`uHO ; !gERG10 ; !pADH1 ; pGAL1 ; gERG12 ; dHO`

uHO ; !gERG10 ; !pADH1 ; pGAL1 ; gERG12 ; dHO

-- Slicing and Dicing

Using operator prefixes like o and p access fairly rigid pieces of DNA. What if we want only a subsequence of a gene region? Or a region that spanned across the ORF boundary? For that, we can use slice notation to specify the exact section of DNA we want.

GSL uses a DNA coordinate system (much like an array in computer science) such that every nucleotide of a gene is enumerated in order: the first 'A' in the 'ATG' codon would be 1, the 'T' would be 2, etc, all the way down the length of the gene and even past the end of the gene. This number is that nucleotide’s index. Appending an S or an E to an index signifies a position relative to the Start or the End of a gene. By default, indices are relative to the start codon so S usually does not change the outcome. However E often comes in handy for working with the tails and/or downstream regions of genes without having to know their exact lengths.

Schematic of GSL's DNA coordinate system. In this hypothetical ORF, the 'A' of the ATG start codon is at index 1 while the 'G' of the TAG stop codon is at index 999. Appending an `E` to the index references a position relative to the End of the gene so the final 'G' of TAG is also at index `-1E`.

Schematic of GSL's DNA coordinate system. In this hypothetical ORF, the 'A' of the ATG start codon is at index 1 while the 'G' of the TAG stop codon is at index 999. Appending an E to the index references a position relative to the End of the gene so the final 'G' of TAG is also at index -1E.

Heads up!

There is no 0 (zero) index! This means that the index of the base immediately preceding the start codon is -1. (This may seriously bother some computer scientists)

Similar to Python's array slice notation, to take a custom slice of a gene in GSL, we must use the g operator and append square brackets containing the precise start and stop indices, separated by a colon: gYFG[start:stop]

Some example reasons one might want to make a slice include:

#refgenome S288C
// 1.) Truncating a gene
gADH1[1:728]

// 2.) Spanning over an ORF boundary
gADH1[-100:100]    // 200 bp, centered on start codon
gADH1[-100E:100E]  // 200 bp, centered on stop codon

// 3.) Extracting a certain protein domain
gADH1[-300E:-1E]   // final 300 bp of the sequence

Amino Acid slicing

Users can also make slices using amino acid coordinates by appending a to the indices:
gADH1[100a:-100aE]

Approximate slicing

Users may also make approximate slices by prepending a ~ to the indices:
gADH1[~-500S:-1S]
This is a common practice when designing homology regions that do not need such exact precision. This allows the GSL compiler some flexibility in deciding on an optimal slice location based on its primer design algorithm, however the slice will still be made in the immediate vicinity of the index provided.

Sliced parts may be used in a construct along with other unmodified genetic elements.

`uHO ; pADH1 ; gERG10[1:728] ; dHO`

uHO ; pADH1 ; gERG10[1:728] ; dHO

-- Selectable Markers `###`

In GSL, adjacent elements are designed with small sections of homology to allow stitching via homologous recombination. Three hashtags in a row (###) are used to denote a general marker sequence, the default being URA3. Upon assembly, the marker sequence is usually split into 2 parts, each of which contains about 2/3 of the marker sequence and overlap in the middle. The placement of the ### will split the entire assembly into two parts which can be stitched separately but then connected in vivo via homologous recombination of the overlapping marker.

This assembly technique is helpful in 2 ways: first, the split marker will have a longer homology region that can pull together both halves of the construct. Stitching efficiency can decrease as the number of pieces to stitch increases, so splitting up the stitching process into two shorter assemblies is helpful for obtaining the fully correct construct. Second, the marker will only be functional if its two partial sequences have in fact recombined correctly. Correctly assembled parts that have been transformed in a strain can be selected for using standard resistance plating techniques.

Design computed by GSL. Each element is amplified separately but with linking homology regions allowing them to be stitched together. The marker sequence is split into 2 parts with a longer overlap, which can then recombine. This helps pull together each half of the construct and confers selectable resistance to strains with a correctly assembled construct.

Design computed by GSL. Each element is amplified separately but with linking homology regions allowing them to be stitched together. The marker sequence is split into 2 parts with a longer overlap, which can then recombine. This helps pull together each half of the construct and confers selectable resistance to strains with a correctly assembled construct.

-- Inserting Custom DNA Sequences

Though GSL is designed to move users away from interacting with raw DNA sequences, it has syntax to allow custom DNA insertions if needed. The following example shows a TDH3 promoter being inserted at the native ACS1 locus (thereby taking over transcriptional regulation from the native promoter). Notice that the downstream homology slice starts 106 bp into the ACS1 gene, effectively cutting out the first 105 nucleotides from the native sequence. However a short custom DNA sequence denoted by /ATGACCATC/ is inserted between the new TDH3 promoter and the truncated ACS1 gene. These amino acids are likely important for maintaining accurate translation of the ORF.

#refgenome S288C
// Promoter swap and truncation at the ACS1 gene
gACS1[-700:1] ; ### ; pTDH3 ; /ATGACCATC/ ; gACS1[106:700]
Schematic showing a custom DNA insertion to maintain accurate translation of the ACS1 ORF after it's been truncated

Schematic showing a custom DNA insertion to maintain accurate translation of the ACS1 ORF after it's been truncated

Custom amino acid insertion

Alternatively, users may specify amino acid sequences using /$___/
gACS1[-700:1] ; ### ; pTDH3 ; /$MTI/ ; gACS1[106:700]

GSL will codon optimize amino acid inserts based on the host organism's reference genome.

The above manipulations relied on level 1 GSL syntax - it provides some abstraction over the AGCs and Ts but ultimately each element directly and unambiguously translates to level 0 - the sequence. While GSL does the heavy lifting of the conversion - it is somewhat restricted in its freedom to generate the best construct given various constraints. This is where Level 2 syntax starts.


Level 2 GSL: leave it to the compiler!

Level 2 GSL language elements provide simpler, higher level operations for common engineering steps such as introducing mutations, gene knockouts, and promoter replacements. Designs that use Level 2 syntax still translate into concrete sequences but more of the implementation details are abstracted away. This gives the GSL compiler more flexibility for constructing the final DNA and relieves human users from making many of the low level decisions.

-- Allele swaps

Our first example of Level 2 syntax is for making allele swaps (a.k.a. single amino-acid mutations).

The syntax to specify an allele swap is quite simple: a g operator + the gene name we're editing, followed by a $, the current amino acid we're swapping out, the index we're editing, and finally the amino acid we're swapping in.

#refgenome S288C
// In the GPR1 gene, mutate the Proline at 
// amino acid index 627 to a Methionine
gGPR1$P627M

The molecular biology strategy that is being implemented under the hood is to design a piece of donor DNA that is homologous to the GPR1 gene upstream and downstream of the mutation, contains a selectable marker that does not interfere with the ORF, and has a heterology block preceding the mutation index.

Heterology blocks

A heterology block is a segment of DNA that translates to the same amino acid sequence as the wild type gene but the DNA sequence contains codon variants such that a bubble in the homologous recombination alignment is maintained. This can help promote integration of the actual mutation into the gene. In Level 1 GSL, a heterology is denoted by a ; ~ ;

Visual representation of the donor DNA to implement a Proline to Methionine allele swap in the GPR1 gene. The donor DNA contains a heterology block preceding the Methionine to facilitate accurate swapping through homologous recombination.

Visual representation of the donor DNA to implement a Proline to Methionine allele swap in the GPR1 gene. The donor DNA contains a heterology block preceding the Methionine to facilitate accurate swapping through homologous recombination.

In fact, inside the GSL compiler the gGPR1$P627M code is expanded into the following Level 1 syntax:

#refgenome S288C
gGPR1[~879:1878] ; ~ ; /ATG/ ; gGPR1[1882:200E] ; ### ; gGPR1[1E:~800E]

However once again, users do not have to see this complexity! They need only specify the index and amino acids for the desired allele swap and GSL does the heavy lifting to design the sequence level specification. When the resulting construct is built, it can be transformed into a strain to implement the allele swap.

-- Gene knockouts

Following common genetic notation, native gene knockouts can be specified by typing the gene name with the g prefix, followed by a ^. GSL will automatically convert this notation into upstream and downstream homology regions connected by a marker but from a user's perspective, you are simply encoding a function rather than a specific construct.

#refgenome S288C
// Knock out the native ADH1 gene
gADH1^
Simple Level 2 GSL to specify a knock out the native ADH1 gene.

Simple Level 2 GSL to specify a knock out the native ADH1 gene.

-- Promoter replacement

Native gene promoters can be easily replaced using another Level 2 GSL pattern: promoter element to insert, followed by a >, and the gene element for which to disrupt native regulation.

#refgenome S288C
// Replace the native ADH1 promoter with the promoter from TDH3
pTDH3>gADH1

This syntax means "use this new promoter to drive this native gene" but the GSL compiler knows to implement that logic by designing a construct that has homology upstream and downstream of the native gene, inserts the a selectable marker, and places the new promoter immediately in front of the native gene's ORF.

Schematic of low level promoter replacement design.

Schematic of low level promoter replacement design.

Example Level 2 promoter replacement of TDH3 at the native ADH1 locus in Genetic Constructor.

Example Level 2 promoter replacement of TDH3 at the native ADH1 locus in Genetic Constructor.


Pragmas: extra hints for the compiler

The above GSL syntax example show various ways we can edit and organize DNA constructs. Rather than being direct sequence edits, GSL pragmas are extra instructions we can give to the compiler to provide metadata or hint to the compiler to build the construct in a certain way.

Pragmas exist outside of GSL too!

Pragma directives are a common feature used in many compilers. (See the Wikipedia article for more info about pragmas in general)

Pragmas are specified with a #, followed by the pragma name being invoked, and a pragma value if applicable. They can be used globally, for example to specify a certain reference genome for all subsequent designs, or locally, perhaps to add metadata to one particular element of a design. There are many pragmas available but two basic ones are introduced below.

-- #name

The #name pragma is a simple way to provide a custom name to your construct, or even just an element within your construct. Below, we can once again see the example of replacing the native ACS1 promoter with that of TDH3 while simultaneously truncating the beginning of the ACS1 gene. A name for the full DNA construct is given by the first pragma: #name tdh3_promoter_swap_at_acs1trunc while the truncated ACS1 part is individually given a name name by including #name acs1_ntermtrunc in curly braces immediately following the slice element.

Name pragmas can be used to describe an entire construct or an individual element within a construct.

Name pragmas can be used to describe an entire construct or an individual element within a construct.

#name tdh3_promoter_swap_at_acs1trunc
gACS1[-700:-1] ; ### ; pTDH3 ; /ATGACCATC/ ; gACS1[106:700] {#name acs1_ntermtrunc}

-- #refgenome

In the Genetic Constructor release of GSL, the default reference genome is the yeast strain S288C. As other genomes become available, it is possible to use the #refgenome pragma to specify the source of certain elements of your construct. For example, perhaps you want to use parts from another annotated yeast genome like BY4741. Here we can use a global #refgenome pragma to use all parts from this yeast genome instead of S288C.

Use the `#refgenome` pragma to specify that all part should be derived from the yeast strain BY4741.

Use the #refgenome pragma to specify that all part should be derived from the yeast strain BY4741.

Alternatively, if you want your construct to mostly use parts from BY4741 but keep the ERG10 gene from S288C, you can use a local #refgenome pragma immediately after the gERG10 part.

Use a local `#refgenome` pragma to ensure the ERG10 gene is derived from S288C instead of BY4741.

Use a local #refgenome pragma to ensure the ERG10 gene is derived from S288C instead of BY4741.

// First construct derives all elements from BY4741 genome
#refgenome BY4741
#name BY4741_construct
uHO ; pADH1 ; gERG10 ; ### ; dHO 

// Second construct only derives ERG10 from S288C while the rest
// come from BY4741
#name BY4741_construct_with_288C_erg10
uHO ; pADH1 ; gERG10 {#refgenome S288C} ; ### ; dHO 

Note

Currently the only reference genomes available are the yeast strains S288C, BY4741, and BY4742.

-- more pragmas

Several more pragmas and their functions are described in the code below.

// +-----------------------------------------------------------------+
// | Pragma: #name [name_value]                                      |
// | Function: Specify the name for a construct or a part.           |
// +-----------------------------------------------------------------+
// Example:
#name acs1_ntermtrunc

// +-----------------------------------------------------------------+
// | Pragma: #refgenome [refgenome_value]                            |
// | Function: Use a specific reference genome. Can be used for the  |
// |   entire construct (individual line) or for parts (inline using |
// |   {} ).                                                         |
// +-----------------------------------------------------------------+
// Example:
#refgenome BY4741

// +-----------------------------------------------------------------+
// | Pragma: #stitch                                                 |
// | Function: Indicates that the construct should be created in a   |
// |   single stitch without a multipart marker in between.          |
// +-----------------------------------------------------------------+
// Example:
#stitch
gHO^

// +-----------------------------------------------------------------+
// | Pragma: #linkers ___|___                                        |
// | Function: By default, GSL connects adjacent elements with       |
// |   short overlapping "linkers." This pragma lets users define    |
// |   the linkers and their ordering within the construct.          |
// |   Common Options: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E                 |
// |     *Letter linkers designed to go between promoters and genes  |
// |   Format: first_stitch_linkers | second_stitch_linkers          |
// +-----------------------------------------------------------------+
// Example
#linkers 0,2,A,3,9|0,2,9
uHO ; pTDH3 ; gADH1 ; tCYC1 ; ###; dHO

// +-----------------------------------------------------------------+
// | Pragma: #fuse                                                   |
// | Function: instructs GSL to create a scarless connection without | 
// |   a linker between two parts.
// +-----------------------------------------------------------------+
// Example
# linkers 0,2,3,9|0,2,9
uHO; pTDH3 {#fuse} ; gADH1 ; tCYC1 ; ###; dHO

// +-----------------------------------------------------------------+
// | Pragma: #seed [integer]                                         |
// | Function: When an inline amino acid sequence is used, the       |
// |   compiler uses a random number generator to determine the      |
// |   exact codon usage when it optimizes into the DNA sequence.    |
// |   Using different seeds for the same amino acid sequence will   |
// |   result in different DNA sequences. The exact same codon       |
// |   optimizations can be recreated by using the same seed.        |
// +-----------------------------------------------------------------+
// Example:
/$MGQYKLILNGKTLKGETT...FER*/ {#seed 2212; #name synthase_v1}
/$MGQYKLILNGKTLKGETT...FER*/ {#seed 4863; #name synthase_v2}
/$MGQYKLILNGKTLKGETT...FER*/ {#seed 5597; #name synthase_v3}

Beginners Tutorial Complete!

Hopefully this tour through the most common GSL syntax is enough to get you started designing DNA constructs! Once you are comfortable using the language patterns above, be on the lookout for Intermediate/Advanced tutorials which will introduce more useful syntax to help further abstract away DNA design into higher-level ideas.

GSL Beginners Tutorial