General Graph Syntax

Overview

The first resolution of the CGsmiles notation captures the coarsest representation of a molecule. The syntax is adapted from the SMILES notation and can be used to represent arbitrary graphs. These graphs do not need to be molecules but the syntax is geared towards molecules. The basic syntax features are sufficient to write a CGsmiles string for any (connected) graph. The advanced syntax features can be used to reduce the verbosity through use of a multiplication operator, allow annotation of bond orders, which are important for atomic resolutions and resolving multiple resolutions, as well as a general annotation syntax that permits writing of node labels.

Basic Syntax Features

The basic structure of CGsmiles involves describing each node within a graph using a specific notation that identifies connections and relationships between nodes. Here’s how the nodes and their connections are represented:

Nodes

Each node is described as # followed by an alphanumeric identifier, enclosed in square brackets. For example, a node named A is represented as [#A].

Edges

Edges are connections between nodes. At the atomic resolution they are covalent or coordination bonds. At any other resolution they simply describe the connectivity between nodes.

Nodes that follow each other in the string are assumed to be connected by an edge. For example, to denote that nodes A and B are connected, you would write [#A][#B].

Example: [#A][#B] denotes nodes A and B connected directly.

Branches

Branches allow the representation of complex branching structures within molecules. Branches are indicated by enclosing them in parentheses. For instance, to connect node D to node B in a sequence from A to C, the notation would be [#A][#B]([#D])[#C].

Example: [#A][#B]([#D])[#C] shows a branch with D connected to B.

Rings and Non-linear Edges

This feature allows the description of rings and other complex topologies. Rings are indicated by integers following the node identifiers. An edge will connect nodes marked with the same integer. For example, a triangle connecting nodes A, B, and C would be written as [#A]1[#B][#C]1.

Example: [#A]1[#B][#C]1 forms a triangular ring structure.

String Encapsulation

For clarity and to define boundaries, CGsmiles strings are enclosed in curly braces.

Example: {[#A][#B]([#D])[#C]}

Advanced Syntax Features

Bond orders

One can specify a bond order for edges between nodes. At the atomic resolution these bond orders describe the order of covalent bonds as in SMILES. There are fife bond order symbols that specify the bond order 0-4 (‘.’, ‘-’, ‘=’, ‘#’, ‘$’). The bond order symbol must be placed between two nodes if the bond is implicit:

Zero bond order
Example: {[#A].[#B]}

It must be placed between a node and before the ring marker, if it refers to the ring bond:

Zero bond order but only between A and C
Example: {[#A].1[#B][#C]1}

For branches the bond order symbol must be placed between the node and the branch brace if it refers to the first atom in the branch and otherwise after the branch braces.

Zero bond order between A and B
Example: {[#A].([#B][#C])[#D]}
Zero bond order between A and D
Example: {[#A].([#B][#C]).[#D]}

The meaning of bond orders at the atomic resolution is well defined. At coarse resolutions bond orders may be used to describe virtual edges (i.e. bond order 0). Virtual edges have no corresponding connectivity of the nodes at the atomic resolution. Additionally, bond orders of 1-4 are used to indicate that rings at the finer resolution are mapped to linear graphs at the coarse level. See section Layering Resolutions.Linearized rings.

Annotations

Some important information are are not encoded by the graph representation of a molecule. Such information are for examples charges or chirality. CGsmiles supports a general annotation syntax, which allows users to store this kind of information in the form of symbol=value pairs. Any node name may be followed by one or more of these symbol=value pairs separated by a semi-colon. For example, to specify that node a has a charge of 1 but node B does not one can write:

Example: {[#A;q=1][#B;q=0]}

We could also specify the mass in addition to the charge.

Example: {[#A;q=1;mass=72][#B;q=0;mass=36]}

The symbol is a string of arbitrary length though one letter strings are most convenient for brevity sake.

Users can specify some predefined symbols, which work like arguments to a Python function. That means they have a default value and the symbol keyword may be omitted if the previous positions are filled. For example, charge q and weight w are part of the predefined symbols for any coarse resolution. One can define a weight by either providing the keyword as in [#A;w=0.5] or omitting the keyword but then defining the default charge value as in [#A;0;0.5]. In case of the charge as it is the first keyword the following strings are identical [#A;0] and [#A;q=0].

Additionally, these symbols are converted to longer keywords upon reading. For example, the symbol q gets assigned the keyword charge. A set of such symbols is named a dialect and can be specified using the functionality in the dialect module. Note that currently dialects are not easily accessible for modification.

CGsmiles comes with two sets of predefined dialects. One is used for the coarse resolution fragments / graphs and the other for those which are of atomic resolution. The table below lists the specifications of those keywords. Note that it is always permissible to use the keyword explicitly.

Reserved Annotation Symbols

Symbol

Resolution

Keyword

Type

Example

Default

q

coarse

charge

float

{[#A;q=1]} or {[#A;1]}

0.0

w

coarse

weight

float

{[#A;w=0.5]} or {[#A;0;0.5]}

1.0

w

atomic

weight

float

same as above

1.0

x

atomic

chirality

S or R

{#frag=Br[C;x=S]ClI}

None

Multiplication Operator

To efficiently represent repeated units in large molecules, such as polymers, CGsmiles syntax includes a multiplication operator |. This operator can be applied after a node or a branch to repeat it a specified number of times.

  • Node Multiplication: The multiplication operator is placed after a node and followed by an integer indicating the number of repetitions. For example, [#A]|5 represents five consecutive nodes of type A, which is equivalent to writing [#A][#A][#A][#A][#A].

Example: [#A]|5 simplifies the representation of five A nodes.
  • Branch Multiplication: When the multiplication operator is placed after a branch, the entire branch including the anchoring node is repeated as specified. This feature is particularly useful for describing structures like graft polymers. For instance, [#A]([#B][#B])|5 represents a chain of five units where each unit starts with node A followed by two B nodes.

Example: [#A]([#B][#B])|5 describes repeated branches of [#B][#B] anchored to [#A].

Syntax Features Lookup Table

Below is the updated quick reference table for the essential features of CGsmiles syntax:

Feature

Description

Example

Nodes

Represent nodes in the graph.

[#A]

Edges

Direct connections between nodes.

[#A][#B]

Branches

Indicate branching off the main chain.

[#A][#B]([#D])[#C]

Rings

Describe rings and non-linear connections.

[#A]1[#B][#C]1

Encapsulation

Enclose cgsmiles strings for clarity.

{[#a][#b]([#d])[#c]}

Bond Orders

Specify the order (0-4) between bonds. 0 = .; 1 = -, 2 = =, 3 = #, 4 = $

{[#a]=[#b]} ; double bond {[#a].[#b]} ; zero order bond

Annotations

Store node labels as key value pairs.

{[#A;q=1][#B;q=0]} ; charges q

Multiplication

Repeat a node or branch a specified number of times.

[#A]|5, [#A]([#B][#B])|5