Use of SMILES strings in INTERCHEM, INTERCHEM-pc, and STRAUSS

Introduction
SMILES strings afford an easy way of entering certain types structures in INTERCHEM, and the facility is available from the first (input and output) menu. It should be emphasised that this method may not be the easiest method for all types; the facilities provided in the build package and through the sketch pad might be more suitable in some cases.

Origin of SMILES
The definition of SMILES (Simplified Molecular Input Line Entry System) is contained in three papers by David Weininger and his associates (1, 2, 3).

Basic rules of SMILES as used in INTERCHEM
SMILES is a linear notation for chemical structures, following along lines developed in earlier systems, for example the Dyson and Wisweser systems. In contrast to these two, it has one advantage, in that any string which is syntactically correct should be capable of being interpreted (parsed) by the modelling system and so yield a 2D- or 3D structure. Both the Dyson and the Wisweser systems required that only the unique correct string (the canonical string) should be acceptable. SMILES does allow of a canonical string, unique for each structure, and suitable for database searching requirements, and provision of a facility for generating such strings from 3D structures is likely to be provided in future editions of INTERCHEM.

The basic rules of SMILES (1) have been modified for use by INTERCHEM, and these modifications are noted below.

SMILES Strings. A SMILES string is written as a single sequence of characters without any space characters. A space character denotes the end of the string.

Specifying atoms. Atoms are represented by their atomic symbols, in general the first or only letter of the symbol is written as an upper-case letter, the second (if present) must be lower-case. Atoms of the "organic set" (B, C, N, O, P, S, F, Cl, Br, and I) are written without brackets. Hydrogen is not usually specified; when it needs to be, according to the original rules, it has to be enclosed in square brackets [H]. As an extension to the rules, for INTERCHEM, hydrogen may be specified without the use of square brackets. Square brackets are needed to specify atoms of other elements (e.g. gold [Au]), and where multiple attached hydrogen atoms and/or charges are present (e.g. ammonium cation [NH4+]). Aromatic atoms (carbon, nitrogen, oxygen, and phosphorus) are specified by use of lower-case letters.

Specification of Bonds. Single, double, triple, and aromatic bonds are represented by the symbols - (hyphen), = (equals sign), # (hash symbol), and : (colon) respectively. Single bonds need not be specified between atoms, which are not aromatic, and aromatic bonds need not be specified between atoms, which are aromatic. Nitro groups are coded by having "aromatic" bonds between the nitrogen atom and both of the oxygen atoms (see the example of p-nitrobenzenesulphonic acid below).

Specification of Branches. Branches are specified by enclosing the atoms within parentheses. Branches can be nested to any depth and stacked.

Cyclic Structures. Cyclic structures are represented by formally breaking one single or aromatic bond in each ring and labeling the atoms, which participated, in the broken bond with the same integer. The integers 1 to 9 are appended immediately after the relevant atomic symbols. An atom may have more than one such labels attached (thus C67 would specify a carbon atom labeled for two ring closures 6 and 7). Labels can be re-used, when no longer "active", thus it is possible to specify many more than nine ring closures with the single digit labels. However, two-digit labels can be specified by prefacing them with % (the percent sign).

Disconnected Structures. The original specification of SMILES allowed disconnected compounds (e.g. ionic structures, salts, etc) to be specified by placing a period (full stop) between the individual structures. This facility is not implemented in INTERCHEM.

Treatment of Aromatic structures. The original SMILES specification requires that aromatic structures shall be capable of being entered using aromatic atoms (c, n, etc) and implied aromatic bonds, and well as by using alternating single and double bonds (i.e. benzene as C1=CC=CC=C1). This facility is not available in the INTERCHEM system; aromatic structures must be entered using lower-case letters.

Specification of Stereochemistry. The three papers published by Weininger make only passing mention of stereochemistry. There are implementations of SMILES in at least four molecular modeling packages, which handle stereochemistry, which are all incompatible with each other. They all have defects, and a separate method of specifying stereochemistry has been used in INTERCHEM. The rules are as follows:

(1) All stereochemical designations are enclosed within braces (curly brackets {…}).

(2) The single letters R or S (or the corresponding lower-case letters) denote the configuration of the following atom according to the Cahn-Ingold-Prelog convention. While any number of such designations may be present, INTERCHEM takes account of the first one specified only.

(3) The stereochemical relationship of pairs of atoms is denoted by preceding each atom of a pair by identical groups of characters:

Xnn

where X is one of the characters:

T or t, C or c, G or g, +, or -

and

nn is a one- or two- digit integer

 

Atom pairs labeled in this way and separated by two (or more) other atoms define a torsion (or pseudo-torsion) angle subject to the following limitations:

Symbol

Lower limit

Upper limit

T or t

+170°

+190°

C or c

-10°

+10°

G or g

-70°

+70°

+

+50°

+70°

-

-70°

-50°

Each atom may have up to four stereochemical labels (each label must be enclosed in a separate pair of braces).

(4) In order to specify stereochemistry hydrogen atoms may be used explicitly.

Examples of SMILES Strings.

Name

SMILES string

n-Butane 
CCCC
Neopentane
CC(C)(C)(C)
Benzene
c1ccccc1
Methanesulphonic acid 
CS(=O)(=O)O
p-Nitrobenzenesulphonic acid
c1cc(N(:O):O)ccc1S(=O)(=O)O
Cyclohexane
C1CCCCC1
1,4-Dimethylcyclohexane
CC1CCC(C)CC1
Camphor 
CC12C(=O)CC(C1(C)(C))CC2
Hexahelicene
c1ccc2ccc3ccc4ccc5ccc6ccccc6c5c4c3c2c1
Diazepam
c12ccc(Cl)cc1C(c3ccccc3)=NCC(=O)N2C
(1R)-trans-1,2-Dimethylcyclohexane 
C{R}C1({T1}H)C({T1}H)(C)CCCC1
(1S)-cis-1,2-Dimethylcyclohexane 
C{S}C1({G1}H)C({G1}H)(C)CCCC1

Using SMILES in INTERCHEM
Input of structures via SMILES strings is available from the first (input -output) menu of INTERCHEM.

(1) Select from the menu Build New Structure using SMILES

A large text window appears together with a popup menu

(2) Select from this menu Enter New SMILES String

(3) Type (in the text window) the SMILES string, followed by <ENTER>

If the string is syntactically correct, it is accepted by the program and processed using ICMECH. The progress of the optimization is shown in the text window. Finally a new popup menu appears.

(4) Select from this menu Deposit in Structure Area –-A- (or one of the alternatives).

(5) Select EXIT.

The text window disappears and the structure will be drawn as a numbered wire-frame diagram in the appropriate window, where it can be treated in the usual ways.

If the SMILES string is rejected by the program, this will be indicated by an error message, and the part of the string, up to the point where the error occurred, will be displayed. The first popup menu re-appears with an extra option Correct the Previous SMILES String. If this is selected the correct part of the string can be grabbed using the "mouse editing" technique, re-entered into the text window and the string completed by typing in the correction. If the string is now accepted, the above sequence of steps can be continued from (4).

After a successful generation of a structure, it may not be the one desired; it might be the wrong conformer perhaps. In these circumstances, it is possible to attempt the generation of a lower energy conformer. In general, the last SMILES string entered will have been retained by the program. Go back to the input-output menu, reselect Build New Structure using SMILES, and then select from the next menu, either Lowest Energy Structure from Previous SMILES String, or Store Series of Conformers from Previous SMILES String. In both cases a further menu requests the number of structures required (4, 8, or 16). The string is then processed the required number of times. If the option to select the lowest energy conformer has been chosen, this will be left in the system and it can be processed from step (4) above. If the option to retain a number of structures was selected, then you will be required to enter root files name (five characters). The required number of structures will be retained as a series of INTERCHEM D-files. These may be examined using the Read Series of Stored Data Files option from the input-output menu.

Using SMILES in INTERCHEM-pc
The SMILES input facility is available on the first menu in INTERCHEM-pc. The basic steps are similar to those used above but the standard Windows operating tools facilitate the process. On-line help is provided.

Using SMILES in STRAUSS-cmd
When entering a single SMILES string into the command line version of STRAUSS it is essential to enclose the entire string in double quotes (e.g. "CS(=O)(=O)O" for methanesulphonic acid). The reason for this is that the UNIX operating system will otherwise reject the parentheses. This precaution does not need to be taken when using the menu driven version of STRAUSS, or when using STRAUSS-cmd for processing SMILES strings contained in an input file.

References.

(1) SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Coding Rules. David Weininger, J Chem. Inf. Comput. Sci., 1988, 28, 31.

(2) SMILES. 2. Algorithm for Generation of Unique SMILES Notation. David Weininger, Arthur Weininger, and Joseph Weininger, J. Chem. Inf. Comput. Sci., 1989, 29, 97

(3) SMILES. 3. Depict. Graphical Depiction of Chemical Structures. David Weininger, J. Chem. Inf. Comput. Sci., 1990, 30, 237.

Revised 18th June 2007