Searching for Fragments using Smarts Strings

In the last page you saw how we can search for molecules using their smiles string. Smarts strings provide a much more powerful way to search for molecules, or fragments within molecules.

They can be vieweed as a “regular expression” for chemical structures. This blog post gives an excellent introduction to what they are and how they can be used.

You can search for matches using a smarts string just by putting smarts before the query, e.g.

>>> mols = sr.load(sr.expand(sr.tutorial_url, "ala.crd", "ala.top"))
>>> print(mols["smarts [#6]"])
AtomMatch( size=6
0: [1] CH3:2
1: [1] C:5
2: [1] CA:9
3: [1] CB:11
4: [1] C:15
5: [1] CH3:19
)

will search for all carbon atoms. In this case there are six matches to the query, which can be seen in the returned AtomMatch.

>>> m = mols["smarts [#6]"]
>>> print(m.num_groups())
6

You can extract individual matches using the group() function, and all matches using the groups() function.

>>> print(m.group(3))
Selector<SireMol::Atom>( size=1
0:  Atom( CB:11   [  16.05,    6.39,   15.26] )
)
>>> print(m.groups())
[Selector<SireMol::Atom>( size=1
0:  Atom( CH3:2   [  18.98,    3.45,   13.39] )
), Selector<SireMol::Atom>( size=1
0:  Atom( C:5     [  18.48,    4.55,   14.35] )
), Selector<SireMol::Atom>( size=1
0:  Atom( CA:9    [  16.54,    5.03,   15.81] )
), Selector<SireMol::Atom>( size=1
0:  Atom( CB:11   [  16.05,    6.39,   15.26] )
), Selector<SireMol::Atom>( size=1
0:  Atom( C:15    [  15.37,    4.19,   16.43] )
), Selector<SireMol::Atom>( size=1
0:  Atom( CH3:19  [  13.83,    3.94,   18.35] )
)]

The atoms contained in the AtomMatch object are all the ones that are involved in any of the matches. Since this object is also a standard molecule view, it can be used like any other molecule view, e.g.

>>> print(m.atoms())
Selector<SireMol::Atom>( size=6
0:  Atom( CH3:2   [  18.98,    3.45,   13.39] )
1:  Atom( C:5     [  18.48,    4.55,   14.35] )
2:  Atom( CA:9    [  16.54,    5.03,   15.81] )
3:  Atom( CB:11   [  16.05,    6.39,   15.26] )
4:  Atom( C:15    [  15.37,    4.19,   16.43] )
5:  Atom( CH3:19  [  13.83,    3.94,   18.35] )
)

More complex searches

Smarts searches can match multiple atoms in a group. For example, [#6]!:[#6] matches a carbon bonded via an aliphatic bond to another carbon (i.e. it matches aliphatic carbons). There are two atoms contained in this search, so each group match will contain two atoms.

>>> m = mols["smarts [#6]!:[#6]"]
>>> print(m)
AtomMatch( size=3
0: [2] CH3:2,C:5
1: [2] CA:9,CB:11
2: [2] CA:9,C:15
)

The atoms in each group match are returned in the same order as they are matched in the smarts string. They can be accessed via the standard index operator of the atom container, e.g.

>>> print(m.group(0)[1])
Atom( C:5     [  18.48,    4.55,   14.35] )

prints the second matching atom in the first matched group.

Matching multiple molecules

A smarts match may match multiple molecules. For example, [#1]-[#8]-[#1] would match H-O-H, in that order.

>>> m = mols["smarts [#1]-[#8]-[#1]"]
>>> print(m)
AtomMatchM( size=630
0: WAT:1965 [3] H1:24,O:23,H2:25
1: WAT:1974 [3] H1:27,O:26,H2:28
2: WAT:1984 [3] H1:30,O:29,H2:31
3: WAT:1993 [3] H1:33,O:32,H2:34
4: WAT:2002 [3] H1:36,O:35,H2:37
...
625: WAT:2395 [3] H1:1899,O:1898,H2:1900
626: WAT:2419 [3] H1:1902,O:1901,H2:1903
627: WAT:2430 [3] H1:1905,O:1904,H2:1906
628: WAT:2463 [3] H1:1908,O:1907,H2:1909
629: WAT:2475 [3] H1:1911,O:1910,H2:1912
)

Note here how a AtomMatchM object is returned, signifying this contains multiple molecules. The same num_groups(), group() and groups(), functions can be used to extract individual match groups.

>>> print(m.group(5))
Selector<SireMol::Atom>( size=3
0:  Atom( H1:39   [  19.17,   10.39,   17.12] )
1:  Atom( O:38    [  18.83,    9.65,   16.62] )
2:  Atom( H2:40   [  19.39,    9.61,   15.84] )
)

Note again how the atoms are returned in the same order as they matched the smarts string.

>>> print(mols[6].atoms())
Selector<SireMol::Atom>( size=3
0:  Atom( O:38    [  18.83,    9.65,   16.62] )
1:  Atom( H1:39   [  19.17,   10.39,   17.12] )
2:  Atom( H2:40   [  19.39,    9.61,   15.84] )
)

Generating smarts strings from molecules

You can generate a smarts string from a molecule by calling its smarts() function, e.g.

>>> print(mols[0].smarts())
[#6H3]-[#6](=[#8])-[#7H]-[#6H](-[#6H3])-[#6](=[#8])-[#7H]-[#6H3]

You can include hydrogens by specifying include_hydrogens=True

>>> print(mols[0].smarts(include_hydrogens=True))
[H]-[#6](-[H])(-[H])-[#6](=[#8])-[#7](-[H])-[#6](-[H])(-[#6](-[H])(-[H])-[H])-[#6](=[#8])-[#7](-[H])-[#6](-[H])(-[H])-[H]

You can get the smarts string as a sire search term by passing as_search=True, e.g.

>>> print(mols[0].smarts(as_search=True))
smarts [#6H3]-[#6](=[#8])-[#7H]-[#6H](-[#6H3])-[#6](=[#8])-[#7H]-[#6H3]
>>> print(mols[mols[0].smarts(as_search=True)])
AtomMatch( size=1
0: [10] CH3:2,C:5,O:6,N:7,CA:9...
)

Creating molecules from smarts

You can also create fragment molecules using smarts strings via the sire.smarts() function. Note that these can only be used for substructure searching, as they are lacking properties such as coordinates.

>>> mol = sr.smarts("[#6H3]-[#6](=[#8])-[#7H]-[#6H](-[#6H3])-[#6](=[#8])-[#7H]-[#6H3]")
>>> print(mol.atoms())
Selector<SireMol::Atom>( size=10
0:  Atom( C1:1 )
1:  Atom( C2:2 )
2:  Atom( O3:3 )
3:  Atom( N4:4 )
4:  Atom( C5:5 )
5:  Atom( C6:6 )
6:  Atom( C7:7 )
7:  Atom( O8:8 )
8:  Atom( N9:9 )
9:  Atom( C10:10 )
)
>>> print(mol.bonds())
SelectorBond( size=9
0: Bond( C1:1 => C2:2 )
1: Bond( C2:2 => O3:3 )
2: Bond( C2:2 => N4:4 )
3: Bond( N4:4 => C5:5 )
4: Bond( C5:5 => C6:6 )
5: Bond( C5:5 => C7:7 )
6: Bond( C7:7 => O8:8 )
7: Bond( C7:7 => N9:9 )
8: Bond( N9:9 => C10:10 )
)
>>> print(mol.properties())
Properties(
    formal_charge => SireMol::AtomCharges( size=10
0: 0 |e|
1: 0 |e|
2: 0 |e|
3: 0 |e|
4: 0 |e|
5: 0 |e|
6: 0 |e|
7: 0 |e|
8: 0 |e|
9: 0 |e|
),
    hybridization => SireMol::AtomHybridizations( size=10
0: unspecified
1: unspecified
2: unspecified
3: unspecified
4: unspecified
5: unspecified
6: unspecified
7: unspecified
8: unspecified
9: unspecified
),
    element => SireMol::AtomElements( size=10
0: Carbon (C, 6)
1: Carbon (C, 6)
2: Oxygen (O, 8)
3: Nitrogen (N, 7)
4: Carbon (C, 6)
5: Carbon (C, 6)
6: Carbon (C, 6)
7: Oxygen (O, 8)
8: Nitrogen (N, 7)
9: Carbon (C, 6)
),
    isotope => SireMol::AtomIntProperty( size=10
0: 0
1: 0
2: 0
3: 0
4: 0
5: 0
6: 0
7: 0
8: 0
9: 0
),
    mass => SireMol::AtomMasses( size=10
0: 12.011 g mol-1
1: 12.011 g mol-1
2: 15.999 g mol-1
3: 14.007 g mol-1
4: 12.011 g mol-1
5: 12.011 g mol-1
6: 12.011 g mol-1
7: 15.999 g mol-1
8: 14.007 g mol-1
9: 12.011 g mol-1
),
    chirality => SireMol::AtomChiralities( size=10
0: unspecified
1: unspecified
2: unspecified
3: unspecified
4: unspecified
5: unspecified
6: unspecified
7: unspecified
8: unspecified
9: unspecified
),
    is_aromatic => SireMol::AtomIntProperty( size=10
0: 0
1: 0
2: 0
3: 0
4: 0
5: 0
6: 0
7: 0
8: 0
9: 0
),
    connectivity => Connectivity: nConnections() == 9.
Connected residues:
Connected atoms:
  * Atom C1:LIG:1 bonded to C2:LIG:1.
  * Atom C2:LIG:1 bonded to O3:LIG:1 N4:LIG:1 C1:LIG:1.
  * Atom O3:LIG:1 bonded to C2:LIG:1.
  * Atom N4:LIG:1 bonded to C2:LIG:1 C5:LIG:1.
  * Atom C5:LIG:1 bonded to N4:LIG:1 C7:LIG:1 C6:LIG:1.
  * Atom C6:LIG:1 bonded to C5:LIG:1.
  * Atom C7:LIG:1 bonded to N9:LIG:1 O8:LIG:1 C5:LIG:1.
  * Atom O8:LIG:1 bonded to C7:LIG:1.
  * Atom N9:LIG:1 bonded to C10:LIG:1 C7:LIG:1.
  * Atom C10:LIG:1 bonded to N9:LIG:1.
)