Lmst

Chris Mungall

Berkeley Lab, I work on #GeneOntology #MonarchInitiative #AllianceGenome #NationalMicrobimeDataCollaborative #OBOFoundry.

Here's a link to the pre-print again https://arxiv.org/abs/2505.18470, and many thanks to all the others involved, Janna Hastings, @justaddcoffee, Daniel Korn and special thanks to the ChEBI team Adnan Malik and Noel O'Boyle for checking the results. May the force be with you!

We have more of our findings up on a website here: https://chemkg.github.io/c3p, along with the code we used to learn the classes.

One thing to note is that even though we used LLMs to learn the classes, the resulting programs can be executed for classification with no other runtime dependencies beyond python and rdkit, and the results are fully interpretable!

Slide emphasizing interpretability of program-based approach. C3PO and R2D2, on tattooine, C3PO says to R2D2 "Master, I regret to inform you that your SMILES is not classified as a thienopyrimidine, because I could not find a 6-membered ring " R2D2 answers with an impenetrable "boop-beep-boop"

additionally we suspected that there are a number of misclassifications in ChEBI that end up diverting the learning process. We used a combination of LLM vision models (viewing the chemical structures) and literature search to confirm some of these cases, and then after validation by a ChEBI curator, these were fixed in the ontology.

Examples of terpenoid classifications flagged by a vision model. Full table can be explored on https://docs.google.com/spreadsheets/d/1lqHS2DSKax6TwbsBibN2Sw1bC6CJRwV3sYqk9FwUYGQ/edit?gid=1854221255#gid=1854221255

We investigated the cases where we expected the LLM to do better than it did. For example, terpenoids should in theory be easily classifiable by counting the number of isoprene units. But it turns out not to be that straightforward, chemicals can be classified both by origin or by structures....

Overall the results were promising, although we fell short of the current state of the art for automated classification, chebifier.

macro stats for chebifier vs ensemble llm-learned program, with chebifier achieving higher f1 and comparable recall

The agent’s decision process was recorded via GitHub commits so you can look at the evolution of each individual program, along with the LLM’s “thinking process”. Here is glycerophosphocholine, which eventually converged on an f1 of 0.97
https://github.com/chemkg/c3p/commits/main/c3p/programs/glycerophosphocholine.py

Learned program for glycerophosphocholine, which converged on a pattern involving matching one of three smarts strings:

# Pattern 1: fully free glycerol backbone.
pattern_free = "OCC(O)CO[P](=O)(O)OCC[N+](C)(C)C"
# Pattern 2: lysophosphatidylcholine with one acyl chain at position sn-1.
pattern_lyso_sn1 = "OC(=O)OCC(O)CO[P](=O)(O)OCC[N+](C)(C)C"
# Pattern 3: lysophosphatidylcholine with one acyl chain at position sn-2.
# In the SMARTS below, the '*' after OC(=O) allows for any carbon chain.
pattern_lyso_sn2 = "OCC(OC(=O)*)CO[P](=O)(O)OCC[N+](C)(C)C"

But some classes just couldn’t be learned well – and some of these were surprising.

https://chemkg.github.io/c3p/analysis/classes_bottom.html

classes that did not learn well - lactol, glycosamimoglycan, tertiary amine.

We did an experiment where we tried this on a subset of ChEBI classes (we excluded the more obscure ones, focusing on those of most relevance to biologists). It turns out that there are a lot of chemical classes that can easily be learned (well defined structures like triglycerides, as well as trivial classes for things like polonium atoms – we didn’t filter non polyatomic classes!).

https://chemkg.github.io/c3p/analysis/classes_top.html

But there is no open library of programs (“ontology”) for all major chemical classes, and writing this would take some time. We thought, what if we use LLMs to generate a program for each class? We could benchmark the results against existing curated classifications in CHEBI and iteratively improve them, eventually building up a ChEBI Chemical Classification Programs Ontology (C3PO). This could be used to classify new structures added to ChEBI - or any structures.

Image taken from star wars of luke talking to c3po, c3po says "Luke, what if we use LLMs to create an RDKit program for every class in CHEBI?", and luke responds "I’m starting to think you’ve blown a circuit Threepio
"

Unlike some other domains where ontologies are used, classification of chemical structures is in theory relatively crisp and deterministic, and we should be able to write python programs using libraries like RDKit that can accurately classify a SMILES string based on objective features like number of rings, counting atoms and so on.

$Example python program for classifying alkanes; program is reproduced below # Parse SMILES mol = Chem.MolFromSmiles(smiles) if mol is None: return False, "Invalid SMILES string" # Check for presence of only carbon and hydrogen atoms elements = {atom.GetAtomicNum() for atom in mol.GetAtoms()} if elements.difference({6, 1}): # Atomic number 6 is C, 1 is H return False, "Contains atoms other than carbon and hydrogen" # Check if the molecule is saturated (only single bonds) for bond in mol.GetBonds(): if bond.GetBondTypeAsDouble() != 1.0: return False, "Contains unsaturated bonds (double or triple bonds present)" # Check for acyclic structure (cannot have rings) if mol.GetRingInfo().NumRings() > 0: return False, "Contains rings, not acyclic" # Correctly count carbons carbon_count = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6) # Correctly calculate the total hydrogen count, including implicit hydrogens hydrogen_count = sum(atom.GetTotalNumHs() for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6) # Calculate expected hydrogen count based on alkanes' CnH2n+2 rule expected_hydrogen_count = 2 * carbon_count + 2 if hydrogen_count != expected_hydrogen_count: return False, f"Formula C{carbon_count}H{hydrogen_count} does not match CnH2n+2 (expected H{expected_hydrogen_count})" return True, "Molecule matches the definition of an alkane"$

This achieves high accuracy, but the underlying embeddings are hard for humans to introspect or tweak the classifications. Instead of learning latent representations, what if we instead learn interpretable rules in the form of programs?

Deep learning approaches to classification typically learn a latent representation of the inputs and use this to predict classes. For chemical classification, the leading deep learning approach is the awesome Chebifier, which works off of embeddings of SMILES strings. See https://chebifier.hastingslab.org/ and Glauer et al https://pubs.rsc.org/en/content/articlelanding/2024/dd/d3dd00238a

How can we scale up manual classification of chemical structures in databases like ChEBI? Can we help curators place new structures into classes like "terpenoid", based on their chemical structure? We describe a new approach in our manuscript "Chemical classification program synthesis using generative artificial intelligence" (aka the “C3PO project), pre-print here: https://arxiv.org/abs/2505.18470

A subset of the ChEBI ontology, with terms
organized in a hierarchy. The leaf nodes typically correspond to discrete structures (e.g.
specific stereoisomers of artemisinin (CHEBI:223316)), with ground (non-wildcard) SMILES
strings, with parents and ancestors representing chemical class groupings (e.g. triterpenoid
saponin (CHEBI:61778), or more general classes such as lipid (CHEBI:18059). Although
ChEBI itself does not have an explicit distinction between classes and structures, we show
CHEMROF metaclasses indicating this additional level of organization.

@jerven Done, and thanks for your contributions, these are out in a new release: https://github.com/chemkg/chemrof/releases/tag/v0.3.0

@jerven answered your issue in GitHub, thanks for the prompt!

@jerven @cthoyt but many people just need the properties (https://chemkg.github.io/chemrof/#slots), and ChEBI themselves are using some of the properties for the new ChEBI, we are using these for our materials KG at my institute, this is where a lot of current efforts are...

@jerven @cthoyt Using it to create our benchmarks for our AI-driven chemical classification program synthesis https://chemkg.github.io/c3p/

@jerven @cthoyt more database-style ontologies need this explicit level of metaclasses to manage the different levels in the hierarchy (see https://github.com/OBOFoundry/OBOFoundry.github.io/issues/2454) -- but this is especially the case for ChEBI

@jerven @cthoyt chemrof provides the framework for which the different levels of the hierarchy should be managed https://chemkg.github.io/chemrof/ontology/

@jerven @cthoyt ChEBI desperately needs to be simplified and move towards a modern system for managing the classes, this is agreed on after the last ChEBI workshop and the new group lead is on board, slides here: Mungall, C. (2024, November 19). The need for a simpler collaboratively maintained CHEBI hierarchy. CHEBI 2024 Workshop, Hinxton, UK. Zenodo. https://doi.org/10.5281/zenodo.14298221

Client Info

Server: https://mastodon.social

Version: 2025.04

Repository: https://github.com/cyevgeniy/lmst