Lmst

Huh. Was using my new #cheminformatics fingerprint generator code generator on the PubChem fingerprints which can be defined by a single match.

It told me bits 472 and 506 are the same.

I pulled up the primary documentation, at https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt and .. indeed they are, with the same patterns in reverse order!

472 C:N:C-C
506 C-C:N:C

The full list of such pairs is:

472/506, 585/657, 589/626, 620/632, 462/537, 581/642, 594/666, 470/520, 584/677, 595/608, 634/668, 490/556, 660/678

scientific discussion moved from "letters to the editor" to PubPeer. There is a lot to be said about that, but one thing that must be said is that PubPeer can disappear.

But Letters to the Editor are preserved, for better or worse. if not mistaken, it was @dalke that pointed me at #ICCS2025 to a 1977-1978 discussion via such letters. I looked them up, and annotated some of the citations with the Citation Typing Ontology, made @nanopub and put them in @wikidata

#openscience #cheminformatics

Graph depiction with five articles as nodes with the article titles as node labels, and edges between them based on CiTO intent annotations. We see two research papers, resulting in three "letters to the editor" creating a lively discussions with scientific critiques going back and forth. The graph can be recreated with this SPARQL: https://w.wiki/EYE5

the average number of citations to articles in the Journal of Cheminformatics dropped for the second year in a row. Still leading the #cheminformatics field, but no longer significantly.

Meanwhile, I can publish in RSC's Digital Discovery for free (NL big deal) and J.Cheminform. not (excluded from the NL big deal).

Both are CC-BY, similar review standards, similar average citation count, similar open data standards, publisher support by RSC better than BMC.

You see my problem?

Does anyone find the Klekota-Roth #cheminformatics fingerprints useful, or even potentially interesting? I'm considering adding them to chemfp.

Original paper, https://academic.oup.com/bioinformatics/article/24/21/2518/192573 . "Using diverse phenotypic assays, we defined bioactivity for multiple compound libraries. Many substructures were associated with bioactivity ... validating the privileged substructure concept."

CDK implements them (4860 SMARTS tests). At #ICCS2025 I found light curiosity but no strong interest.

"How many of the compounds that appear in the chemical literature are mentioned just once?" https://doi.org/10.59350/rzepa.28802

"I am actually impressed that as many as 61.5% are mentioned more than once, since before learning the answer, I had intuitively guessed that percentage as being much lower."

(me too)

#chemistry #cheminformatics

new preprint with #opensource #cheminformatics by @Kohulan et al.: "Cheminformatics Microservice V-3: A Web Portal for Chemical Structure Manipulation and Analysis" https://doi.org/10.26434/chemrxiv-2025-xjkxl

"Here, we present Cheminformatics Microservice V3, a significant update to the existing platform that provides unified programmatic access to cheminformatics libraries, including RDKit, Chemistry Development Kit (CDK), and Open Babel through a RESTful API framework."

so, maybe you are wondering how we're doing with the 1 million IUPAC names? https://chem-bla-ics.linkedchemistry.info/2025/03/08/iupac-names.html https://chem-bla-ics.linkedchemistry.info/2025/04/27/one-million-iupac-names-2-the-100-thousand-milestone.html

Well, last week the 200k milestone was passed. But we're also running out of #openaccess articles from Europe PMC.

By doing some simple string replacements (e.g. methyl/ethyl/propyl) we can grow the size easily to beyond 1 million! That is not bad, with three months of running scripts in the background.

But first #ICCS2025 and teaching grading first.

#cheminformatics

Did my first timings of chemfp's "shardsearch" for searching ~1 billion #cheminformatics fingerprints by aggregate search of smaller shards.

Was annoyed that k=3 NN search of 1024-bit Morgan fingerprints took 10 minutes on my desktop. It should have been much faster, like less than a minute.

Then realized "wc shard*.fpb" takes *23 minutes*.

I'm gonna need a faster disk. Have spinning rust for 7 TB of space, not speed.

Zstd should help. It uses 1/4 the space. I'll need smaller shards to test.

Vortex script to change display https://macinchem.org/2025/05/18/vortex-script-to-change-display-of-workspace/ #cheminformatics #vortex

OPSIN (systematic IUPAC nomenclature) now lives at the EBI https://www.ebi.ac.uk/opsin/, as part of a collaboration with Daniel Lowe.

#cheminformatics

When I started chemfp, PubChem and CAS had <100M records. Real-world #cheminformatics data doubles about every 10 years (since WWII), so I figured 200M was good enough.

Chemfp's FPB format fails at around 268M recs due to its hash table format layout. (I stored 16*X instead of X in a 4 byte field.)

I've now been working with bigger synthetic data sets.

I've tweaked the FPB format to handle 1B records. 🎉

Can't go much bigger as id lookups get slow due to 32-bit hash collisions/pigeon-holing.

At the request of a journal editor, I reviewed a paper by leading researchers on one of my favorite #chemistry topics - tautomers! This article was featured in the Journal of Chemical Information and Modeling. I am grateful for the #PeerReview certificate presented by the American Chemical Society. It was an honor to be entrusted with this responsibility.

Reminder that I'm #OpenToWork for #cheminformatics or #scientificSoftware development. Let's discuss how my skills can benefit your team.

2024 ACS Publications Peer Reviewer Certificate of Recognition & Appreciation to Jeremy Monat

Hi everyone, we have completed the draft schedules for the oral presentations and the poster sessions. Please find the full program at https://iccs-nl.org/general-information/scientific-program/. You can also find linked the descriptions of the two pre-conference workshops on Sunday. See you in three weeks!

#ICCS2025 #chemistry #cheminformatics

Hah! Finally resolved a chemfp bug that's been bothering me for 5+ years.

chemfp handles #cheminformatics fingerprints in two formats - the easy-to-read text FPS format, and the fast-to-load binary FPB format.

The default FPS->FPB converter takes several times more RAM than the final FPB file. That's a problem with 30 GB files! I have an option to break processing into chucks (eg, 6GB), but it never worked right (eg, it creates <1GB files).

Turns out my memory use estimator was quite wrong.

A comprehensive article on reaction prediction. https://macinchem.org/2025/05/05/reaction-prediction/ #cheminformatics.

The 2025_03_1 release of #RDKit release includes my contribution to speed up part of getting 2D fingerprints for a molecule by ~75x! So if you generate #chemical fingerprints, now is a good time to upgrade.

Reminder that I'm #OpenToWork so if you're hiring for #cheminformatics or #scientificSoftware development, let's talk.

#chemistry #DrugDiscovery #pharma #PythonForChemists

https://github.com/rdkit/rdkit/releases/tag/Release_2025_03_1

thanks to the @fosstodon admins for giving statements. Not all our #fosstodon answers have been answered.

We live in difficult times where tensions run high and where independent justice on social media is absent. @fosstodon welcomed our project, with hesitance, not knowing who is behind this account or who is behind the Blue Obelisk movement. This brings risks, courage, and misuse.

We like to thank @fosstodon for allowing us to share our #openscience #cheminformatics news here for 2.5 years

@jhylin I also find that #cheminformatics project ideas evolve as I work on them. I sometimes start out with one idea, then when I solve it in code I realize that it opened a vista to another problem that I also need to solve to address the goal of the blog post.

Here's the expanded CYP-ADRs dataset on adverse drug reactions for cytochrome P450 substrates (drugs) with ideas behind this work.

Dataset: https://github.com/jhylin/Adverse_drug_reactions/blob/main/Data/cyp_substrates_adrs.csv

Ideas: https://jhylin.github.io/Data_in_life_blog/posts/22_Simple_dnn_adrs/0_Ideas.html

(I seem to be working in reverse lately... where project ideas are only more fully formed after having partially worked on it)

#prescription_drugs #cytochromep450 #AdverseReactions #cheminformatics

itching to put these 500+ experimental boiling points in @wikidata ... but this 2004 paper does not have SMILES, but this shorthand notation (screenshot). Should be doable, but also is a nice B.Sc. student project, I guess. https://doi.org/10.1021/ci049802u

#cheminformatics

#cheminformatics

Client Info