Chirality, Isomerism & Aromaticity¶
When transitioning between CG and atomistic representations, certain atomistic features have no direct counterparts in CG models and require special treatment.
Implicit Hydrogen¶
The simplest case is the treatment of implicit hydrogen atoms. SMILES allows for
shorthand notation where hydrogen atoms can be omitted and CGsmiles adopts this
approach. Hydrogen atoms are automatically assigned once the full atomistic
molecule is resolved. This procedure ensures proper handling of any unconsumed
bonding operators, which are interpreted as additional hydrogen atoms where
applicable. However, hydrogen atoms requiring specific annotations, such as a
weight (e.g. [H;w=0.5]) must be explicitly included.
cis/trans Isomerism¶
cis and trans isomers are distinguished using a ‘/’ or ‘’ between atoms to indicate their relative orientation around a double bond, following the OpenSMILES definition. A pair of these symbols defines the isomerism of the two atoms as outlined in Table S2 of the main paper describing the syntax. We note that this notation is permutation invariant, i.e. when double bond substituents are split across fragments, the relative position needs to be assigned only once as if constructing the complete SMILES string.
Chirality¶
CGsmiles adopts an explicit method of chirality assignment using annotations. A
chiral atom can be annotated using the x keyword as shorthand for chirality.
For example, S-Alanine is represented as C[C;x=S]C(=O)ON, while R-Alanine is
written as C[C;x=R]C(=O)ON. The x may be omitted if a weight is defined
beforehand, such as in C[C;1;S]C(=O)ON, which is also valid. Consult the
general annotation syntax for more information.
Aromaticity¶
In SMILES, aromaticity is encoded using lowercase letters as a shorthand for aromatic atoms or a colon as a marker for aromatic bonds. CGsmiles utilizes the same convention. In addition, aromatic systems may also be split across multiple fragments by simply keeping the shorthand. For example, Martini Benzene is represented as:
{[#TC5]1[#TC5][#TC5]1}.{#TC5=[$]cc[$]}
Although the shorthand for aromaticity is well-defined, its interpretation in SMILES remains somewhat ambiguous. To ensure unambiguous valance assignment, necessary for tasks like adding hydrogen atoms, CGsmiles employs the following definition: only atoms capable of participating in delocalization-induced molecular equivalence (i.e., systems where multiple resonance structures can be drawn without introducing charges) are considered aromatic. By this definition Benzene is aromatic but thiophene is not. CGsmiles uses the same definition as Pysmiles package, which provides a more detailed discussion of this topic. To enhance user-friendliness, the CGsmiles API automatically corrects strings with incorrectly assigned aromaticity at the time of reading. If corrections cannot be made unambiguously, an error is raised, ensuring robust and accurate handling of aromaticity.