General Graph Syntax¶
Overview¶
The first resolution of the CGsmiles notation captures the coarsest representation of a molecule. The syntax is adapted from the SMILES notation and can be used to represent arbitrary graphs. These graphs do not need to be molecules but the syntax is geared towards molecules. The basic syntax features are sufficient to write a CGsmiles string for any (connected) graph. The advanced syntax features can be used to reduce the verbosity through use of a multiplication operator, allow annotation of bond orders, which are important for atomic resolutions and resolving multiple resolutions, as well as a general annotation syntax that permits writing of node labels.
Basic Syntax Features¶
The basic structure of CGsmiles involves describing each node within a graph using a specific notation that identifies connections and relationships between nodes. Here’s how the nodes and their connections are represented:
Nodes¶
Each node is described as # followed by an alphanumeric
identifier, enclosed in square brackets. For example, a node named A
is represented as [#A].
Edges¶
Edges are connections between nodes. At the atomic resolution they are covalent or coordination bonds. At any other resolution they simply describe the connectivity between nodes.
Nodes that follow each other in the string are assumed to be connected by an
edge. For example, to denote that nodes A and B are connected, you would
write [#A][#B].
Example: [#A][#B] denotes nodes A and B connected directly.
Branches¶
Branches allow the representation of complex branching structures within
molecules. Branches are indicated by enclosing them in parentheses. For
instance, to connect node D to node B in a sequence from A to C, the notation
would be [#A][#B]([#D])[#C].
Example: [#A][#B]([#D])[#C] shows a branch with D connected to B.
Rings and Non-linear Edges¶
This feature allows the description of rings and other complex topologies. Rings
are indicated by integers following the node identifiers.
An edge will connect nodes marked with the same integer. For example, a triangle
connecting nodes A, B, and C would be written as [#A]1[#B][#C]1.
Example: [#A]1[#B][#C]1 forms a triangular ring structure.
String Encapsulation¶
For clarity and to define boundaries, CGsmiles strings are enclosed in curly braces.
Example: {[#A][#B]([#D])[#C]}
Advanced Syntax Features¶
Bond orders¶
One can specify a bond order for edges between nodes. At the atomic resolution these bond orders describe the order of covalent bonds as in SMILES. There are fife bond order symbols that specify the bond order 0-4 (‘.’, ‘-’, ‘=’, ‘#’, ‘$’). The bond order symbol must be placed between two nodes if the bond is implicit:
Zero bond order
Example: {[#A].[#B]}
It must be placed between a node and before the ring marker, if it refers to the ring bond:
Zero bond order but only between A and C
Example: {[#A].1[#B][#C]1}
For branches the bond order symbol must be placed between the node and the branch brace if it refers to the first atom in the branch and otherwise after the branch braces.
Zero bond order between A and B
Example: {[#A].([#B][#C])[#D]}
Zero bond order between A and D
Example: {[#A].([#B][#C]).[#D]}
The meaning of bond orders at the atomic resolution is well defined. At coarse resolutions bond orders may be used to describe virtual edges (i.e. bond order 0). Virtual edges have no corresponding connectivity of the nodes at the atomic resolution. Additionally, bond orders of 1-4 are used to indicate that rings at the finer resolution are mapped to linear graphs at the coarse level. See section Layering Resolutions.Linearized rings.
Annotations¶
Some important information are are not encoded by the graph representation
of a molecule. Such information are for examples charges or chirality.
CGsmiles supports a general annotation syntax, which allows users to store
this kind of information in the form of symbol=value pairs. Any node
name may be followed by one or more of these symbol=value pairs separated by
a semi-colon. For example, to specify that node a has a charge of 1 but node
B does not one can write:
Example: {[#A;q=1][#B;q=0]}
We could also specify the mass in addition to the charge.
Example: {[#A;q=1;mass=72][#B;q=0;mass=36]}
The symbol is a string of arbitrary length though one letter strings are most convenient for brevity sake.
Users can specify some predefined symbols, which work like arguments to a
Python function. That means they have a default value and the symbol keyword may
be omitted if the previous positions are filled. For example, charge q and
weight w are part of the predefined symbols for any coarse resolution. One
can define a weight by either providing the keyword as in [#A;w=0.5] or
omitting the keyword but then defining the default charge value as in
[#A;0;0.5]. In case of the charge as it is the first keyword the following
strings are identical [#A;0] and [#A;q=0].
Additionally, these symbols are converted to longer keywords upon reading. For example, the symbol q gets assigned the keyword charge. A set of such symbols is named a dialect and can be specified using the functionality in the dialect module. Note that currently dialects are not easily accessible for modification.
CGsmiles comes with two sets of predefined dialects. One is used for the coarse resolution fragments / graphs and the other for those which are of atomic resolution. The table below lists the specifications of those keywords. Note that it is always permissible to use the keyword explicitly.
Reserved Annotation Symbols
Symbol |
Resolution |
Keyword |
Type |
Example |
Default |
|---|---|---|---|---|---|
q |
coarse |
charge |
float |
{[#A;q=1]} or {[#A;1]} |
0.0 |
w |
coarse |
weight |
float |
{[#A;w=0.5]} or {[#A;0;0.5]} |
1.0 |
w |
atomic |
weight |
float |
same as above |
1.0 |
x |
atomic |
chirality |
S or R |
{#frag=Br[C;x=S]ClI} |
None |
Multiplication Operator¶
To efficiently represent repeated units in large molecules, such as polymers,
CGsmiles syntax includes a multiplication operator |. This operator can be
applied after a node or a branch to repeat it a specified number of times.
Node Multiplication: The multiplication operator is placed after a node and followed by an integer indicating the number of repetitions. For example,
[#A]|5represents five consecutive nodes of type A, which is equivalent to writing[#A][#A][#A][#A][#A].
Example: [#A]|5 simplifies the representation of five A nodes.
Branch Multiplication: When the multiplication operator is placed after a branch, the entire branch including the anchoring node is repeated as specified. This feature is particularly useful for describing structures like graft polymers. For instance,
[#A]([#B][#B])|5represents a chain of five units where each unit starts with node A followed by two B nodes.
Example: [#A]([#B][#B])|5 describes repeated branches of [#B][#B] anchored to [#A].
Syntax Features Lookup Table¶
Below is the updated quick reference table for the essential features of CGsmiles syntax:
Feature |
Description |
Example |
|---|---|---|
Nodes |
Represent nodes in the graph. |
[#A] |
Edges |
Direct connections between nodes. |
[#A][#B] |
Branches |
Indicate branching off the main chain. |
[#A][#B]([#D])[#C] |
Rings |
Describe rings and non-linear connections. |
[#A]1[#B][#C]1 |
Encapsulation |
Enclose cgsmiles strings for clarity. |
{[#a][#b]([#d])[#c]} |
Bond Orders |
Specify the order (0-4) between bonds. 0 = .; 1 = -, 2 = =, 3 = #, 4 = $ |
{[#a]=[#b]} ; double bond {[#a].[#b]} ; zero order bond |
Annotations |
Store node labels as key value pairs. |
{[#A;q=1][#B;q=0]} ; charges q |
Multiplication |
Repeat a node or branch a specified number of times. |
[#A]|5, [#A]([#B][#B])|5 |