CADD Webapp by Curtis W.

Method

Computing Process

When a user uploads a file, an initial validation is performed by JavaScript to confirm the eligibility of the file type and facilitate the content reading process. The resulting data is then transmitted to a Python function, where a connection is established with the ChEMBL Database API. This allows for the extraction of canonical smiles corresponding to the submitted ChEMBL IDs.

Subsequently, a Java shell subprocess is executed to compute the PaDEL descriptors for the valid molecules. The data is then subjected to parameter reduction, which utilizes features identified during the data pre-processing phase. This optimizes the data for the final step, which is the application of a pre-saved machine learning model to predict the outcome from the PaDEL descriptors.

The processed results are promptly displayed to the user. For an in-depth understanding of the methods used for URL redirection, data value validation, and session value transmission, please refer to the github repository.

Data Pre-processing

All raw data were obtained directly from the CHEMBL Database and the bioactivity type with the highest number of recorded data points was selected for each target compound. As a result, the webapp is capable of predicting IC50 values for some compounds and potency values for others. Pre-processing of the data was performed, which involved removing compounds with no recorded bioactivity and canonical smiles values, and converting the bioactivity values to a more usable scale using the logarithm operation. The PaDEL Descriptor was then used to calculate the molecular descriptors. To improve the quality of the dataset, features with low variance (σ2 < 0.02) and high correlation (r > 0.8) were identified and removed from the dataset. Some data points with abnormally low concentration values were also removed.

Model Training

For each target molecule, a variety of machine learning models with different parameters were compared, and the ones that demonstrated the best performance were selected. And in general, regressor models such as HistGradientBoosting Regressor or LGBM Regressor were found to have the best performance. For example, for acetylcholinesterase, LGBMRegressor outperformed over 40 other regressor models, as well as a feedforward artificial neural network that was tested with more than 60 different parameter combinations. It was observed that many target molecules suffered from significant overfitting, and attempts to reduce the number of features using different variance or correlation thresholds yielded only marginal improvements in model performance. The artificial neural networks also exhibited a trend of better performance with fewer hidden layers, fewer nodes per layer, and a higher number of epochs for most target molecules.

IC50

IC50 (half maximal inhibitory concentration) is a measure of the concentration of a drug or compound required to inhibit 50% of the target activity. It is commonly used in pharmacology to evaluate drug potency and to compare different compounds. The lower the IC50 value, the more potent the drug or compound is considered to be.

Potency

Potency is a pharmacological measure that quantifies the amount of a drug or compound required to produce a specific therapeutic effect. It is a critical parameter in drug discovery and development, as it reflects the strength of the drug or compound's biological activity. The lower the potency value, the more potent the drug or compound is considered to be.