#Cheminformatics

Egon Willighagenegonw@social.edu.nl
2025-05-24

so, maybe you are wondering how we're doing with the 1 million IUPAC names? chem-bla-ics.linkedchemistry.i chem-bla-ics.linkedchemistry.i

Well, last week the 200k milestone was passed. But we're also running out of #openaccess articles from Europe PMC.

By doing some simple string replacements (e.g. methyl/ethyl/propyl) we can grow the size easily to beyond 1 million! That is not bad, with three months of running scripts in the background.

But first #ICCS2025 and teaching grading first.

#cheminformatics

Andrew Dalkedalke@toots.nu
2025-05-21

Did my first timings of chemfp's "shardsearch" for searching ~1 billion #cheminformatics fingerprints by aggregate search of smaller shards.

Was annoyed that k=3 NN search of 1024-bit Morgan fingerprints took 10 minutes on my desktop. It should have been much faster, like less than a minute.

Then realized "wc shard*.fpb" takes *23 minutes*.

I'm gonna need a faster disk. Have spinning rust for 7 TB of space, not speed.

Zstd should help. It uses 1/4 the space. I'll need smaller shards to test.

2025-05-17

OPSIN (systematic IUPAC nomenclature) now lives at the EBI ebi.ac.uk/opsin/, as part of a collaboration with Daniel Lowe.

#cheminformatics

Andrew Dalkedalke@toots.nu
2025-05-14

When I started chemfp, PubChem and CAS had <100M records. Real-world #cheminformatics data doubles about every 10 years (since WWII), so I figured 200M was good enough.

Chemfp's FPB format fails at around 268M recs due to its hash table format layout. (I stored 16*X instead of X in a 4 byte field.)

I've now been working with bigger synthetic data sets.

I've tweaked the FPB format to handle 1B records. 🎉

Can't go much bigger as id lookups get slow due to 32-bit hash collisions/pigeon-holing.

2025-05-14

At the request of a journal editor, I reviewed a paper by leading researchers on one of my favorite #chemistry topics - tautomers! This article was featured in the Journal of Chemical Information and Modeling. I am grateful for the #PeerReview certificate presented by the American Chemical Society. It was an honor to be entrusted with this responsibility.

Reminder that I'm #OpenToWork for #cheminformatics or #scientificSoftware development. Let's discuss how my skills can benefit your team.

2024 ACS Publications Peer Reviewer Certificate of Recognition & Appreciation to Jeremy Monat

Hi everyone, we have completed the draft schedules for the oral presentations and the poster sessions. Please find the full program at iccs-nl.org/general-informatio. You can also find linked the descriptions of the two pre-conference workshops on Sunday. See you in three weeks!

#ICCS2025 #chemistry #cheminformatics

Andrew Dalkedalke@toots.nu
2025-05-09

Hah! Finally resolved a chemfp bug that's been bothering me for 5+ years.

chemfp handles #cheminformatics fingerprints in two formats - the easy-to-read text FPS format, and the fast-to-load binary FPB format.

The default FPS->FPB converter takes several times more RAM than the final FPB file. That's a problem with 30 GB files! I have an option to break processing into chucks (eg, 6GB), but it never worked right (eg, it creates <1GB files).

Turns out my memory use estimator was quite wrong.

2025-05-05

A comprehensive article on reaction prediction. macinchem.org/2025/05/05/react #cheminformatics.

2025-05-01

The 2025_03_1 release of #RDKit release includes my contribution to speed up part of getting 2D fingerprints for a molecule by ~75x! So if you generate #chemical fingerprints, now is a good time to upgrade.

Reminder that I'm #OpenToWork so if you're hiring for #cheminformatics or #scientificSoftware development, let's talk.

#chemistry #DrugDiscovery #pharma #PythonForChemists

github.com/rdkit/rdkit/release

2025-04-29

thanks to the @fosstodon admins for giving statements. Not all our #fosstodon answers have been answered.

We live in difficult times where tensions run high and where independent justice on social media is absent. @fosstodon welcomed our project, with hesitance, not knowing who is behind this account or who is behind the Blue Obelisk movement. This brings risks, courage, and misuse.

We like to thank @fosstodon for allowing us to share our #openscience #cheminformatics news here for 2.5 years

2025-04-28

@jhylin I also find that #cheminformatics project ideas evolve as I work on them. I sometimes start out with one idea, then when I solve it in code I realize that it opened a vista to another problem that I also need to solve to address the goal of the blog post.

2025-04-27

Here's the expanded CYP-ADRs dataset on adverse drug reactions for cytochrome P450 substrates (drugs) with ideas behind this work.

Dataset: github.com/jhylin/Adverse_drug

Ideas: jhylin.github.io/Data_in_life_

(I seem to be working in reverse lately... where project ideas are only more fully formed after having partially worked on it)

#prescription_drugs #cytochromep450 #AdverseReactions #cheminformatics

Egon Willighagenegonw@social.edu.nl
2025-04-27

itching to put these 500+ experimental boiling points in @wikidata ... but this 2004 paper does not have SMILES, but this shorthand notation (screenshot). Should be doable, but also is a nice B.Sc. student project, I guess. doi.org/10.1021/ci049802u

#cheminformatics

Screenshot of the supplementary info of the article linked in the article, zoomed in on the "Structure" column, with entries like this:

CH3-CH2-CH2Cl
CH2F2
CH3-CF2-CF3
etc
Andrew Dalkedalke@toots.nu
2025-04-25

Woo-hoo! My #ICCS2025 poster was accepted. I'm going to Noordwijkerhout in a few months.

Like the last couple of times, I'll be going there by train.

Anyone on the route (Trollhättan→Copenhagen→Hamburg→Noordwijkerhout and vice versa) want me to visit? I can talk about SMILES, #cheminformatics history, and fingerprint similarity for hours. :)

Or perhaps interested in licensing chemfp?

For that matter, I've also available to modernize old in-house cheminformatics code.

2025-04-13

Identify PDB id associated with Uniprot id vortex script
macinchem.org/2025/04/13/unipr #cheminformatics

2025-04-08

I'm excited to present "Finding Tautomers" at the first North American #RDKit User Group Meeting in the #Boston #MA area on Friday April 11!

Reminder that I'm #OpenToWork so if you're in the area and hiring for #cheminformatics or #scientificSoftware development, let me know and we can meet to discuss your needs.

Finding Tautomers title slide
Egon Willigh☮gen 🟥egonw
2025-04-06

it seems I just released my first Pypi package every.

pyBacting 2.14 with Bacting 1.0.5 is now out: pypi.org/project/pybacting/0.2

This gives you access in Python to (some of) the functionality of the Chemistry Development Kit, OPSIN, ChemSpider, PubChem, InChI, Excel files, BridgeDb, and BioJava

2025-04-05

#openscience #cheminformatics dates back to the late nineties with the emerging collaborative development of JChemPaint, Jmol, and the Chemical Markup Language. Sketch of the history by Chris Steinbeck: "The evolution of open science in cheminformatics: a journey from closed systems to collaborative innovation" jcheminf.biomedcentral.com/art

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst