Commit 77411954 authored by hgarrereyn's avatar hgarrereyn
Browse files

large refactor

parent 38b5cee4
# Lead Optimization
# Structure
This repository contains code for machine learning based lead optimization.
- `config`: configuration information (eg. TRAIN/TEST partitions)
# Overview
- `config`: fixed configuration information (eg. TRAIN/VAL/TEST partitions)
- `configurations`: benchmark model configurations
- `data`: training/inference data (see [`data/README.md`](data/README.md))
- `docker`: Docker environment
- `leadopt`: main module code
- `models`: architecture definitions
- `data_util.py`: utility wrapper code around fragment and fingerprint datasets
- `models`: pytorch architecture definitions
- `data_util.py`: utility code for reading packed fragment/fingerprint data files
- `grid_util.py`: GPU-accelerated grid generation code
- `infer.py`: code for inference with a trained model
- `metrics.py`
- `train.py`: training loops
- `util.py`: extra utility code (mostly rdkit)
- `pretained`: pretrained models (see [`pretrained/README.md`](pretrained/README.md))
- `scripts`: data processing scripts (see [`scripts/README.md`](scripts/README.md))
- (outdated) `infer.py`: code for inference with a trained model
- `metrics.py`: pytorch implementations of several metrics
- `model_conf.py`: contains code to configure and train models
- `util.py`: utility code for rdkit/openbabel processing
- (outdated) `scripts`: data processing scripts (see [`scripts/README.md`](scripts/README.md))
- `train.py`: CLI interface to launch training runs
- `leadopt.py`: CLI interface to run inference on new samples
- (outdated) `leadopt.py`: CLI interface to run inference on new samples
# Dependencies
You can build a virtualenv with the requirements:
```sh
$ python3 -m venv leadopt_env
$ source ./leadopt_env/bin/activate
$ pip install -r requirements.txt
```
Note: `Cuda 10.1` is required during training
# Training
You can train models with the `train.py` utility script
To train a model, you can use the `train.py` utility script. You can specify model parameters as command line arguments or load parameters from a configuration args.json file.
```bash
python train.py \
--save_path=/path/to/model \
--wandb_project=my_project \
{model_type} \
--model_arg1=x \
--model_arg2=y \
...
```
or
```bash
python train.py \
--save_path=/path/to/model \
--wandb_project=my_project \
--configuration=./configurations/args.json
```
`save_path` is a directory to save the best model. The directory will be created if it doesn't exist. If this is not provided, the model will not be saved.
`wandb_project` is an optional wandb project name. If provided, the run will be logged to wandb.
See below for available models and model-specific parameters:
# Leadopt Models
In this repository, trainable models are subclasses of `model_conf.LeadoptModel`. This class encapsulates model configuration arguments and pytorch models and enables saving and loading multi-component models.
```py
from leadopt.model_conf import LeadoptModel, MODELS
model = MODELS['voxel']({args...})
model.train(save_path='./mymodel')
...
model2 = LeadoptModel.load('./mymodel')
```
Internally, model arguments are configured by setting up an `argparse` parser and passing around a `dict` of configuration parameters in `self._args`.
## VoxelNet
```
--no_partitions If set, disable the use of TRAIN/VAL partitions during
training.
-f FRAGMENTS, --fragments FRAGMENTS
Path to fragments file.
-fp FINGERPRINTS, --fingerprints FINGERPRINTS
Path to fingerprints file.
-lr LEARNING_RATE, --learning_rate LEARNING_RATE
--num_epochs NUM_EPOCHS
Number of epochs to train for.
--test_steps TEST_STEPS
Number of evaluation steps per epoch.
-b BATCH_SIZE, --batch_size BATCH_SIZE
--grid_width GRID_WIDTH
--grid_res GRID_RES
--fdist_min FDIST_MIN
Ignore fragments closer to the receptor than this
distance (Angstroms).
--fdist_max FDIST_MAX
Ignore fragments further from the receptor than this
distance (Angstroms).
--fmass_min FMASS_MIN
Ignore fragments smaller than this mass (Daltons).
--fmass_max FMASS_MAX
Ignore fragments larger than this mass (Daltons).
--ignore_receptor
--ignore_parent
-rec_typer {single,single_h,simple,simple_h,desc,desc_h}
-lig_typer {single,single_h,simple,simple_h,desc,desc_h}
-rec_channels REC_CHANNELS
-lig_channels LIG_CHANNELS
--in_channels IN_CHANNELS
--output_size OUTPUT_SIZE
--pad
--blocks BLOCKS [BLOCKS ...]
--fc FC [FC ...]
--use_all_labels
--dist_fn {mse,bce,cos,tanimoto}
--loss {direct,support_v1}
```
This diff is collapsed.
This folder contains benchmark model configurations referenced in the paper.
You can retrain from these configurations using the `train.py` script:
```sh
python train.py \
--save_path=/path/to/model \
--wandb_project=my_project \
--configuration=./configurations/model.json
```
Note: these configuration files assume the working directory is the `leadopt` base directory and that the data directory is accessible at `./data`.
{
"version": "voxelnet",
"no_partitions": false,
"fragments": "./data/fragments_desc.h5",
"fingerprints": "./data/fp_rdk_desc.h5",
"learning_rate": 0.001,
"num_epochs": 300,
"test_steps": 500,
"batch_size": 32,
"grid_width": 24,
"grid_res": 1.0,
"fdist_min": 0.5,
"fdist_max": 3,
"fmass_min": null,
"fmass_max": 150,
"ignore_receptor": false,
"ignore_parent": false,
"rec_typer": "simple",
"lig_typer": "simple",
"rec_channels": 4,
"lig_channels": 3,
"in_channels": 7,
"output_size": 2048,
"pad": false,
"blocks": [32, 64],
"fc": [2048],
"use_all_labels": true,
"dist_fn": "mse",
"loss": "direct"
}
This folder contains data used during training and inference.
You can download the data here: ...
You can download the data here: https://pitt.box.com/s/ubohnl10idnarpam40hq6chggtaojqv7
Overview:
- `fragments_desc.h5` (3.9 GB): pdbbind fragment information created by `scripts/process_pdbbind2.py`
- `fragments_mini.h5` (824 KB): a minimal version of `fragments_desc.h5` with 3 receptors and 50 fragments for testing
- `fp_rdk_desc.h5` (625 MB): precomputed rdk fingerprints created by `scripts/make_fingerprints.py`
This diff is collapsed.
'''
grid_util.py
contains code for gpu-accelerated grid generation
'''
"""
Contains code for gpu-accelerated grid generation.
"""
import math
import ctypes
......@@ -17,12 +15,46 @@ GPU_DIM = 8
@numba.cuda.jit
def gpu_gridify(grid, width, res, center, rot, atom_num, atom_coords, atom_types, layer_offset, batch_i):
'''
GPU kernel to add atoms to a grid
def gpu_gridify(grid, atom_num, atom_coords, atom_layers, layer_offset,
batch_idx, width, res, center, rot):
"""Adds atoms to the grid in a GPU kernel.
width, res, offset and rot control the grid view
'''
This kernel converts atom coordinate information to 3d voxel information.
Each GPU thread is responsible for one specific grid point. This function
receives a list of atomic coordinates and atom layers and simply iterates
over the list to find nearby atoms and add their effect.
Voxel information is stored in a 5D tensor of type: BxTxNxNxN where:
B = batch size
T = number of atom types (receptor + ligand)
N = grid width (in gridpoints)
Each invocation of this function will write information to a specific batch
index specified by batch_idx. Additionally, the layer_offset parameter can
be set to specify a fixed offset to add to each atom_layer item.
How it works:
1. Each GPU thread controls a single gridpoint. This gridpoint coordinate
is translated to a "real world" coordinate by applying rotation and
translation vectors.
2. Each thread iterates over the list of atoms and checks for atoms within
a threshold to add to the grid.
Args:
grid: DeviceNDArray tensor where grid information is stored
atom_num: number of atoms
atom_coords: array containing (x,y,z) atom coordinates
atom_layers: array containing (idx) offsets that specify which layer to
store this atom. (-1 can be used to ignore an atom)
layer_offset: a fixed ofset added to each atom layer index
batch_idx: index specifiying which batch to write information to
width: number of grid points in each dimension
res: distance between neighboring grid points in angstroms
(1 == gridpoint every angstrom)
(0.5 == gridpoint every half angstrom, e.g. tighter grid)
center: (x,y,z) coordinate of grid center
rot: (x,y,z,y) rotation quaternion
"""
x,y,z = numba.cuda.grid(3)
# center around origin
......@@ -67,17 +99,17 @@ def gpu_gridify(grid, width, res, center, rot, atom_num, atom_coords, atom_types
while i < atom_num:
# fetch atom
fx, fy, fz = atom_coords[i]
# ft = atom_types[i][0]
ft = atom_types[i]
ft = atom_layers[i]
i += 1
# invisible atoms
if ft == -1:
continue
# fixed radius (^2)
r2 = 4
# exit early
# quick cube bounds check
if abs(fx-tx) > r2 or abs(fy-ty) > r2 or abs(fz-tz) > r2:
continue
......@@ -89,13 +121,31 @@ def gpu_gridify(grid, width, res, center, rot, atom_num, atom_coords, atom_types
# add effect
if d2 < r2:
grid[batch_i,layer_offset+ft,x,y,z] += v
grid[batch_idx, layer_offset+ft, x, y, z] += v
def mol_gridify(grid, atom_coords, atom_layers, layer_offset, batch_idx,
width, res, center, rot):
"""Wrapper around gpu_gridify.
(See gpu_gridify() for details)
"""
dw = ((width - 1) // GPU_DIM) + 1
gpu_gridify[(dw,dw,dw), (GPU_DIM,GPU_DIM,GPU_DIM)](
grid, len(atom_coords), atom_coords, atom_layers, layer_offset,
batch_idx, width, res, center, rot
)
def make_tensor(shape):
'''
Create a pytorch tensor and numba array with the same GPU memory backing
'''
"""Creates a pytorch tensor and numba array with shared GPU memory backing.
Args:
shape: the shape of the array
Returns:
(torch_arr, cuda_arr)
"""
# get cuda context
ctx = numba.cuda.cudadrv.driver.driver.get_active_context()
......@@ -115,46 +165,32 @@ def make_tensor(shape):
def rand_rot():
'''
Sample a random 3d rotation from a uniform distribution
Returns a quaternion vector (w,x,y,z)
'''
"""Returns a random uniform quaternion rotation."""
q = np.random.normal(size=4) # sample quaternion from normal distribution
q = q / np.sqrt(np.sum(q**2)) # normalize
return q
def mol_gridify(
grid,
mol_atoms,
mol_types,
batch_i,
center=np.array([0,0,0]),
width=48,
res=0.5,
rot=np.array([1,0,0,0]),
layer_offset=0,
):
'''wrapper to invoke gpu gridify kernel'''
dw = ((width - 1) // GPU_DIM) + 1
gpu_gridify[(dw,dw,dw), (GPU_DIM,GPU_DIM,GPU_DIM)](
grid,
width,
res,
center,
rot,
len(mol_atoms),
mol_atoms,
mol_types,
layer_offset,
batch_i
)
def get_batch(data, rec_channels, parent_channels, batch_set=None, batch_size=16, width=48, res=0.5, ignore_receptor=False, ignore_parent=False, include_freq=False):
def get_batch(data, rec_channels, parent_channels, batch_size=16, batch_set=None,
width=48, res=0.5, ignore_receptor=False, ignore_parent=False):
"""Builds a batch grid from a FragmentDataset.
Args:
data: a FragmentDataset object
rec_channels: number of receptor channels
parent_channels: number of parent channels
batch_size: size of the batch
batch_set: if not None, specify a list of data indexes to use for each
item in the batch
width: grid width
res: grid resolution
ignore_receptor: if True, ignore receptor atoms
ignore_parent: if True, ignore parent atoms
Returns: (torch_grid, batch_set)
torch_grid: pytorch Tensor with voxel information
examples: list of examples used
"""
assert (not (ignore_receptor and ignore_parent)), "Can't ignore parent and receptor!"
dim = 0
......@@ -164,110 +200,94 @@ def get_batch(data, rec_channels, parent_channels, batch_set=None, batch_size=16
dim += parent_channels
# create a tensor with shared memory on the gpu
t, grid = make_tensor((batch_size, dim, width, width, width))
torch_grid, cuda_grid = make_tensor((batch_size, dim, width, width, width))
if batch_set is None:
batch_set = np.random.choice(len(data), size=batch_size, replace=False)
fingerprints = np.zeros((batch_size, data.fingerprints['fingerprint_data'].shape[1]))
freq = np.zeros(batch_size)
for i in range(len(batch_set)):
idx = batch_set[i]
f_coords, f_types, p_coords, p_types, r_coords, r_types, conn, fp, extra = data[idx]
# random rotation
examples = [data[idx] for idx in batch_set]
for i in range(len(examples)):
example = examples[i]
rot = rand_rot()
if ignore_receptor:
mol_gridify(grid, p_coords, p_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=0)
mol_gridify(
cuda_grid,
example['p_coords'],
example['p_types'],
layer_offset=0,
batch_idx=i,
width=width,
res=res,
center=example['conn'],
rot=rot
)
elif ignore_parent:
mol_gridify(grid, r_coords, r_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=0)
mol_gridify(
cuda_grid,
example['r_coords'],
example['r_types'],
layer_offset=0,
batch_idx=i,
width=width,
res=res,
center=example['conn'],
rot=rot
)
else:
mol_gridify(grid, p_coords, p_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=0)
mol_gridify(grid, r_coords, r_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=parent_channels)
fingerprints[i] = fp
freq[i] = extra['freq']
t_fingerprints = torch.Tensor(fingerprints).cuda()
t_freq = torch.Tensor(freq).cuda()
if include_freq:
return t, t_fingerprints, t_freq, batch_set
else:
return t, t_fingerprints, batch_set
# def get_batch_dual(data, batch_set=None, batch_size=16, width=48, res=0.5, ignore_receptor=False, ignore_parent=False):
# # get batch
# t, fp, batch_set = get_batch(data, batch_set, batch_size, width, res, ignore_receptor, ignore_parent)
# f = data.fingerprints['fingerprint_data']
# # corrupt fingerprints
# false_fp = torch.clone(fp)
# for i in range(batch_size):
# # idx = np.random.randint(fp.shape[1])
# # false_fp[i,idx] = (1 - false_fp[i,idx]) # flip
# idx = np.random.randint(f.shape[0])
# false_fp[i] = torch.Tensor(f[idx]) # replace
# comb_t = torch.cat([t,t], axis=0)
# comb_fp = torch.cat([fp, false_fp], axis=0)
# y = torch.zeros((batch_size * 2,1)).cuda()
# y[:batch_size] = 1
# return (comb_t, comb_fp, y, batch_set)
# def get_batch_full(data, batch_set=None, batch_size=16, width=48, res=0.5, ignore_receptor=False, ignore_parent=False):
# assert (not (ignore_receptor and ignore_parent)), "Can't ignore parent and receptor!"
# dim = 18
# if ignore_receptor or ignore_parent:
# dim = 9
# # create a tensor with shared memory on the gpu
# t_context, grid_context = make_tensor((batch_size, dim, width, width, width))
# t_frag, grid_frag = make_tensor((batch_size, 9, width, width, width))
# if batch_set is None:
# batch_set = np.random.choice(len(data), size=batch_size, replace=False)
# for i in range(len(batch_set)):
# idx = batch_set[i]
# f_coords, f_types, p_coords, p_types, r_coords, r_types, conn, fp = data[idx]
mol_gridify(
cuda_grid,
example['p_coords'],
example['p_types'],
layer_offset=0,
batch_idx=i,
width=width,
res=res,
center=example['conn'],
rot=rot
)
mol_gridify(
cuda_grid,
example['r_coords'],
example['r_types'],
layer_offset=parent_channels,
batch_idx=i,
width=width,
res=res,
center=example['conn'],
rot=rot
)
# # random rotation
# rot = rand_rot()
# if ignore_receptor:
# mol_gridify(grid_context, p_coords, p_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=0)
# elif ignore_parent:
# mol_gridify(grid_context, r_coords, r_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=0)
# else:
# mol_gridify(grid_context, p_coords, p_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=0)
# mol_gridify(grid_context, r_coords, r_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=9)
# mol_gridify(grid_frag, f_coords, f_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=0)
# return t_context, t_frag, batch_set
def get_raw_batch(r_coords, r_types, p_coords, p_types, conn, num_samples=32, width=24, res=1, r_dim=9, p_dim=9):
# create a tensor with shared memory on the gpu
t, grid = make_tensor((num_samples, (r_dim + p_dim), width, width, width))
return torch_grid, examples
def get_raw_batch(r_coords, r_types, p_coords, p_types, conn, num_samples=32,
width=24, res=1, rec_channels=9, parent_channels=9):
"""Sample a raw batch with provided atom coordinates.
Args:
r_coords: receptor coordinates
r_types: receptor types (layers)
p_coords: parent coordinates
p_types: parent types (layers)
conn: (x,y,z) connection point
num_samples: number of rotations to sample
width: grid width
res: grid resolution
rec_channels: number of receptor channels
parent_channels: number of parent chanels
"""
B = num_samples
T = rec_channels + parent_channels
N = width
torch_grid, cuda_grid = make_tensor((B,T,N,N,N))
for i in range(num_samples):
# random rotation
rot = rand_rot()
mol_gridify(cuda_grid, p_coords, p_types, layer_offset=0, batch_idx=i,
width=width, res=res, center=conn, rot=rot)
mol_gridify(cuda_grid, r_coords, r_types, layer_offset=p_dim, batch_idx=i,
width=width, res=res, center=conn, rot=rot)
mol_gridify(grid, p_coords, p_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=0)
mol_gridify(grid, r_coords, r_types, batch_i=i, center=conn, width=width, res=res, rot=rot, layer_offset=p_dim)
return t
return torch_grid
import torch
import torch.nn as nn
import torch.nn.functional as F
def mse(yp, yt):
"""Mean squared error loss."""
return torch.sum((yp - yt) ** 2, axis=1)
def bce(yp, yt):
"""Binary cross entropy loss."""
return torch.sum(F.binary_cross_entropy(yp, yt, reduction='none'), axis=1)
def tanimoto(yp, yt):
"""Tanimoto distance metric."""
intersect = torch.sum(yt * torch.round(yp), axis=1)
union = torch.sum(torch.clamp(yt + torch.round(yp), 0, 1), axis=1)
return torch.mean(intersect / union)
return 1 - (intersect / union)
_cos = nn.CosineSimilarity(dim=1, eps=1e-6)
def cos(yp, yt):
"""Cosine distance as a loss (inverted)."""
return 1 - _cos(yp,yt)
def broadcast_fn(fn, yp, yt):
"""Broadcast a distance function."""
yp_b, yt_b = torch.broadcast_tensors(yp, yt)
return fn(yp_b, yt_b)
def average_position(fingerprints, fn, norm=True):
"""Returns the average ranking of the correct fragment relative to all
possible fragments.
Args:
fingerprints: NxF tensor of fingerprint data
fn: distance function to compare fingerprints
norm: if True, normalize position in range (0,1)
"""
def _average_position(yp, yt):
# distance to correct fragment
p_dist = broadcast_fn(fn, yp, yt.detach())