Commit 39e70bea authored by jspiegel's avatar jspiegel

Update Gypsum-DL and Dimorphite-DL

parent 00c54110
......@@ -4,6 +4,8 @@ Changes
4.0.2
-----
* Updated Gypsum-DL to version 1.1.5.
* Updated Dimorphite-DL to version 1.2.4.
* Fixed failure in `$PATH/autogrow4/accessory_scripts/make_lineage_figures.py`
to detect source compounds when run had `--use_docked_source_compounds` set
to False. This patched required the added user variable
......
Changes
=======
1.1.5
-----
* Updated Dimorphite-DL to 1.2.4. Now better handles compounds with
polyphosphate chains (e.g., ATP).
* Minor updates to the Durrant-lab filters:
* When running Gypsum-DL without the `--use_durrant_lab_filters` parameter,
Gypsum-DL now displays a warning. We strongly recommend using these
filters, but we choose not to turn them on by default in order to maintain
backwards compatibility.
* Added filter to compensate for a phosphate-related bug in MolVS, one of
Gypsum-DL's dependencies. MolVS sometimes tautomerizes `[O]P(O)([O])=O` to
`[O][PH](=O)([O])=O`, so the Durrant-lab filters now remove any tautomers
with substructures that match the SMARTS string `O=[PH](=O)([#8])([#8])`.
* Added filters to compensate for frequently seen, unusual MolVS
tautomerization of adenine. The Durrant-lab filters now remove tautomers
with substructures that match `[#7]=C1[#7]=C[#7]C=C1` and
`N=c1cc[#7]c[#7]1`.
* Added filter to remove terminal iminols. While amide-iminol
tautomerization is valid, amides are far more common, and accounting for
this tautomerization produces many improbable iminol componds. The
Durrant-lab filters now remove compounds with substructures that match
`[$([NX2H1]),$([NX3H2])]=C[$([OH]),$([O-])]`.
* Added filter to remove molecules containing `[Bi]`.
* Gypsum-DL now outputs molecules with total charges between -4e and +4e.
Before, the cutoff was -2e to 2e. We expanded the range to permit ATP and
other similar molecules.
1.1.4
-----
......
# Gypsum-DL 1.1.4
# Gypsum-DL 1.1.5
Gypsum-DL is a free, open-source program for preparing 3D small-molecule
models. Beyond simply assigning atomic coordinates, Gypsum-DL accounts for
......@@ -217,9 +217,9 @@ outputs to ensure they are chemically feasible.
### Durrant-Lab Filters
In looking over many Gypsum-DL-generated variants, we have identified several
substructures that, though technically possible, strike us as improbable. Here
are some examples:
In looking over many Gypsum-DL-generated variants, we have identified a number
of substructures that, though technically possible, strike us as improbable or
otherwise poorly suited for virtual screening. Here are some examples:
* `C=[N-]`
* `[N-]C=[N+]`
......@@ -227,9 +227,14 @@ are some examples:
* `[#7+]~[#7+]`
* `[#7-]~[#7-]`
* `[!#7]~[#7+]~[#7-]~[!#7]`
If you'd like to discard molecular variants with these substructures, use the
`--use_durrant_lab_filters` flag.
* `[#5]` (boron)
* `O=[PH](=O)([#8])([#8])`
* `N=c1cc[#7]c[#7]1`
* `[$([NX2H1]),$([NX3H2])]=C[$([OH]),$([O-])]`
* Metals
If you'd like to discard molecular variants with substructures such as these,
use the `--use_durrant_lab_filters` flag.
### Advanced Methods for Eliminating Problematic Compounds
......
......@@ -96,10 +96,11 @@ def remove_highly_charged_molecules(mol_lst):
charge_closest_to_neutral = charges[idx_of_closest_to_neutral]
# Now create a new mol list, where the charges deviation from the most
# neutral by no more than 2.
# neutral by no more than 4. Note that this used to be 2, but I increased
# it to 4 to accommodate ATP.
new_mol_lst = []
for i, charge in enumerate(charges):
if abs(charge - charge_closest_to_neutral) <= 2:
if abs(charge - charge_closest_to_neutral) <= 4:
new_mol_lst.append(mol_lst[i])
else:
Utils.log(
......
......@@ -180,8 +180,20 @@ def prepare_molecules(args):
)
params["add_html_output"] = False
# Warn the user if he or she is not using the Durrant lab filters.
if params["use_durrant_lab_filters"] ==- False:
Utils.log(
"WARNING: Running Gypsum-DL without the Durrant-lab filters. In looking over many Gypsum-DL-generated " +
"variants, we have identified a number of substructures that, though technically possible, strike us " +
"as improbable or otherwise poorly suited for virtual screening. We strongly recommend removing these " +
"by running Gypsum-DL with the --use_durrant_lab_filters option.",
trailing_whitespace="\n"
)
# Load SMILES data
if isinstance(params["source"], str):
Utils.log("Loading molecules from " + os.path.basename(params["source"]) + "...")
# Smiles must be array of strs.
src = params["source"]
if src.lower().endswith(".smi") or src.lower().endswith(".can"):
......
......@@ -28,7 +28,8 @@ try:
except:
Utils.exception("You need to install rdkit and its dependencies.")
# Get the substructures you won't permit (per substructure matching)
# Get the substructures you won't permit (per substructure matching, not
# substring matching)
prohibited_smi_substrs_for_substruc = [
"C=[N-]",
"[N-]C=[N+]",
......@@ -38,6 +39,10 @@ prohibited_smi_substrs_for_substruc = [
"[!#7]~[#7+]~[#7-]~[!#7]", # Doesn't hit azide.
# Vina can't process boron anyway...
"[#5]", # B
"O=[PH](=O)([#8])([#8])", # molvs does odd tautomer: OP(O)(O)=O => O=[PH](=O)(O)O
"[#7]=C1[#7]=C[#7]C=C1", # Prevents an odd tautomer sometimes seen with adenine.
"N=c1cc[#7]c[#7]1", # Variant of above
"[$([NX2H1]),$([NX3H2])]=C[$([OH]),$([O-])]" # Terminal iminol
]
# Get the substrings you won't permit (per substring matching)
......@@ -63,7 +68,8 @@ prohibited_smi_substrs_for_substr = [
"[Mo", # Mo
"[Cd", # Cd
"[Au", # Au
"[Pb" "[Bi", # Pb # Bi
"[Pb", # Pb
"[Bi", # Bi
]
......@@ -147,7 +153,7 @@ def parallel_durrant_lab_filter(contnr, prohibited_substructs):
:param contnr: The molecule container.
:type contnr: MolContainer.MolContainer
:param prohibited_substructs: A list of the prohibited subsstructures.
:param prohibited_substructs: A list of the prohibited substructures.
:type prohibited_substructs: list
:return: Either the container with bad molecules removed, or a None
object.
......
Changes
=======
1.2.4
-----
* Dimorphite-DL now better protonates compounds with polyphosphate chains
(e.g., ATP). See `site_substructures.smarts` for the rationale behind the
added pKa values.
* Added test cases for ATP and NAD.
* `site_substructures.smarts` now allows comments (lines that start with `#`).
* Fixed a bug that affected how Dimorphite-DL deals with new protonation
states that yield invalid SMILES strings.
* Previously, it simply returned the original input SMILES in these rare
cases (better than nothing). Now, it instead returns the last valid SMILES
produced, not necessarily the original SMILES.
* Consider `O=C(O)N1C=CC=C1` at pH 3.5 as an example.
* Dimorphite-DL first deprotonates the carboxyl group, producing
`O=C([O-])n1cccc1` (a valid SMILES).
* It then attempts to protonate the aromatic nitrogen, producing
`O=C([O-])[n+]1cccc1`, an invalid SMILES.
* Previously, it would output the original SMILES, `O=C(O)N1C=CC=C1`. Now
it outputs the last valid SMILES, `O=C([O-])n1cccc1`.
* Improved suport for the `--silent` option.
* Reformatted code per the [*Black* Python code
formatter](https://github.com/psf/black).
1.2.3
-----
......
Dimorphite-DL 1.2.3
Dimorphite-DL 1.2.4
===================
What is it?
......@@ -34,7 +34,7 @@ usage: dimorphite_dl.py [-h] [--min_ph MIN] [--max_ph MAX]
[--smiles_file FILE] [--output_file FILE]
[--label_states] [--test]
Dimorphite 1.2.3: Creates models of appropriately protonated small moleucles.
Dimorphite 1.2.4: Creates models of appropriately protonated small moleucles.
Apache 2.0 License. Copyright 2020 Jacob D. Durrant.
optional arguments:
......
......@@ -10,8 +10,30 @@ Carboxyl [C:1](=[O:2])-[O:3]-[H] 2 3.456652971502591 1.2871420886834017
Thioic_acid [C,c,N,n:1](=[O,S:2])-[SX2,OX2:3]-[H] 2 0.678267 1.497048763660801
Phenyl_Thiol [c,n:1]-[SX2:2]-[H] 1 4.978235294117647 2.6137000480499806
Thiol [C,N:1]-[SX2:2]-[H] 1 9.12448275862069 1.3317968158171463
# [*]OP(=O)(O[H])O[H]. Note that this matches terminal phosphate of ATP, ADP, AMP.
Phosphate [PX4:1](=[O:2])(-[OX2:3]-[H])(-[O+0:4])-[OX2:5]-[H] 2 2.4182608695652172 1.1091177991945305 5 6.5055 0.9512787792174668
# Note that Internal_phosphate_polyphos_chain and
# Initial_phosphate_like_in_ATP_ADP were added on 6/2/2020 to better detail with
# molecules that have polyphosphate chains (e.g., ATP, ADP, NADH, etc.). Unlike
# the other protonation states, these two were not determined by analyzing a set
# of many compounds with experimentally determined pKa values.
# For Internal_phosphate_polyphos_chain, we use a mean pKa value of 0.9, per
# DOI: 10.7554/eLife.38821. For the precision value we use 1.0, which is roughly
# the precision of the two ionizable hydroxyls from Phosphate (see above). Note
# that when using recursive SMARTS strings, RDKit considers only the first atom
# to be a match. Subsequent atoms define the environment.
Internal_phosphate_polyphos_chain [$([PX4:1](=O)([OX2][PX4](=O)([OX2])(O[H]))([OX2][PX4](=O)(O[H])([OX2])))][O:2]-[H] 1 0.9 1.0
# For Initial_phosphate_like_in_ATP_ADP, we use the same values found for the
# lower-pKa hydroxyl of Phosphate (above).
Initial_phosphate_like_in_ATP_ADP [$([PX4:1]([OX2][C,c,N,n])(=O)([OX2][PX4](=O)([OX2])(O[H])))]O-[H] 1 2.4182608695652172 1.1091177991945305
# [*]P(=O)(O[H])O[H]. Cannot match terminal phosphate of ATP because O not among [C,c,N,n]
Phosphonate [PX4:1](=[O:2])(-[OX2:3]-[H])(-[C,c,N,n:4])-[OX2:5]-[H] 2 1.8835714285714287 0.5925999820080644 5 7.247254901960784 0.8511476450801531
Phenol [c,n,o:1]-[O:2]-[H] 1 7.065359866910526 3.277356122295936
Peroxide1 [O:1]([$(C=O),$(C[Cl]),$(CF),$(C[Br]),$(CC#N):2])-[O:3]-[H] 2 8.738888888888889 0.7562592839596507
Peroxide2 [C:1]-[O:2]-[O:3]-[H] 2 11.978235294117647 0.8697645895163075
......@@ -31,9 +53,17 @@ Anilines_secondary [c:1]-[NX3+0:2]([H:3])[!H:4] 1 4.335408163265306 2.1768842022
Anilines_tertiary [c:1]-[NX3+0:2]([!H:3])[!H:4] 1 4.16690685045614 2.005865735782679
Aromatic_nitrogen_unprotonated [n+0&H0:1] 0 4.3535441240733945 2.0714072661859584
Amines_primary_secondary_tertiary [C:1]-[NX3+0:2] 1 8.159107682388349 2.5183597445318147
# e.g., [*]P(=O)(O[H])[*]. Note that cannot match the internal phosphates of ATP, because
# oxygen is not among [C,c,N,n,F,Cl,Br,I]
Phosphinic_acid [PX4:1](=[O:2])(-[C,c,N,n,F,Cl,Br,I:3])(-[C,c,N,n,F,Cl,Br,I:4])-[OX2:5]-[H] 4 2.9745 0.6867886750744557
# e.g., [*]OP(=O)(O[H])O[*]. Cannot match ATP because P not among [C,c,N,n,F,Cl,Br,I]
Phosphate_diester [PX4:1](=[O:2])(-[OX2:3]-[C,c,N,n,F,Cl,Br,I:4])(-[O+0:5]-[C,c,N,n,F,Cl,Br,I:4])-[OX2:6]-[H] 6 2.7280434782608696 2.5437448856908316
# e.g., [*]P(=O)(O[H])O[*]. Cannot match ATP because O not among [C,c,N,n,F,Cl,Br,I].
Phosphonate_ester [PX4:1](=[O:2])(-[OX2:3]-[C,c,N,n,F,Cl,Br,I:4])(-[C,c,N,n,F,Cl,Br,I:5])-[OX2:6]-[H] 5 2.0868 0.4503028610465036
Primary_hydroxyl_amine [C,c:1]-[O:2]-[NH2:3] 2 4.035714285714286 0.8463816543155368
*Indole_pyrrole [c;R:1]1[c;R:2][c;R:3][c;R:4][n;R:5]1[H] 4 14.52875 4.06702491591416
*Aromatic_nitrogen_protonated [n:1]-[H] 0 7.17 2.94602395490212
......@@ -88,11 +88,14 @@ def random_sample(lst, num, msg_if_cut=""):
return lst
def log(txt):
def log(txt, trailing_whitespace=""):
"""Prints a message to the screen.
:param txt: The message to print.
:type txt: str
:param trailing_whitespace: White space to add to the end of the
message, after the trim. "" by default.
:type trailing_whitespace: string
"""
whitespace_before = txt[: len(txt) - len(txt.lstrip())].replace("\t", " ")
......@@ -102,7 +105,7 @@ def log(txt):
width=80,
initial_indent=whitespace_before,
subsequent_indent=whitespace_before + " ",
)
) + trailing_whitespace
)
......
......@@ -15,7 +15,7 @@
# limitations under the License.
"""
Gypsum-DL 1.1.4 is a conversion script to transform smiles strings and 2D SDFs
Gypsum-DL 1.1.5 is a conversion script to transform smiles strings and 2D SDFs
into 3D models.
"""
......@@ -73,7 +73,7 @@ from gypsum_dl import Utils
PARSER = argparse.ArgumentParser(
formatter_class=argparse.RawDescriptionHelpFormatter,
description="""
Gypsum-DL 1.1.4, a free, open-source program for preparing 3D small-molecule
Gypsum-DL 1.1.5, a free, open-source program for preparing 3D small-molecule
models. Beyond simply assigning atomic coordinates, Gypsum-DL accounts for
alternate ionization, tautomeric, chiral, cis/trans isomeric, and
ring-conformational forms.""",
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment