Write-up of #ISMBECCB2023 and our 11th birthday celebrations in Lyon.
Going Large (Language Models) at ISMB2023/@BOSC@genomic.social http://gigasciencejournal.com/blog/going-large-language-models-at-ismb2023/
Write-up of #ISMBECCB2023 and our 11th birthday celebrations in Lyon.
Going Large (Language Models) at ISMB2023/@BOSC@genomic.social http://gigasciencejournal.com/blog/going-large-language-models-at-ismb2023/
We had an excellent #CompMS session at the #ISMBECCB2023 conference last week.
Many thanks to keynote speakers @lgoracci01@twitter.com, @RenardLab@twitter.com, and @tomas_pluskal@twitter.com; all selected speakers; and poster presenters for showcasing the latest computational advances in mass spectrometry, with applications across #proteomics, #metabolomics, #lipidomics, and more.
RT @: Keep calm, Pfam is still running!But now it's hosted on the InterPro website! At #ISMBECCB2023, we had the opportunity to learn more about @PfamDB and its integration with @InterProDB website. We even won these really cool t-shirts,Thanks!
Day 5 recap from #ISMBECCB2023: text mining, ChatGPT, and 2024 locations. Papers for highlights are in the description.
https://youtube.com/shorts/SiqudkjnN1Q?feature=share
Mark Gerstein at #ISMBECCB2023: Deep learning is exciting, but let's not forget about the physical and biological models underlying the science we're interested in. Let's make biomedical data science more like weather forecasting.
Névéol: What can we do?
Understand the stakes better.
Facilitate levers like data sharing, shared tasks, and policy.
Write more documentation, for protocols, etc.; elicit audits.
See Cohen-Boulakia et al 2017 Future Gen Comput Syst
Aurélie Névéol:
How can we make clinical NLP more reproducible? Can NLP also help with reproducibility? Even word or sentence tokenization can be inconsistent. Most NLP folks have, at least once, failed to repeat someone else's experiment, or even their own. Sometimes it's due to differences in preprocessing, software versions, training vs test splits, or other boring things. Availability issues, page limits, and the bias toward novelty don't help either.
One perk of attending #ISMBECCB2023 virtually: watching the recording of a keynote I missed instead of the talk I had planned to watch but turned out not to be interested in.
(I guess you could also plug in your headphones and do the same if you're there in person, but that's noticeably ruder.)
KB: cell type matching across species https://github.com/kbiharie/TACTiCS #ISMBECCB2023
Sylwia Szymanska: Word embeddings capture functions of low complexity regions: scientific literature analysis using a transformer-based language model
Low-complexity regions in proteins are biologically important. But there isn't a database or even a list of these relationships. So let's extract them with a language model.
#ismbeccb2023
#textmining
Brett Beaulieu-Jones: Can we use large language models with clinical notes to estimate likelihood of seizure recurrence? Yes - and even with good results - but models are difficult to interpret. So can we build a model that includes things we really care about, then add an instructable layer? Yes! Use note metadata as weak supervision -> instructions for the model. A tuned T5-Flan model does really well.
Robert Leaman: BioNER requires multiple entity types -> relation extraction. But human-annotated NER data are scarce - < 0.01% of PubMed articles. Can use pre trained language model to do multitask NER...but instead we could modify the data. Include annotations for negative mentions by type + tokens for sentence start/end.
AIONER converts training data to this form and aggregates data sets. Moderate improvement on most BioNER types.
Repo here: https://github.com/ncbi/AIONER
Katerina Nastou: Benchmarking species name NER is some thing the S800 corpus was used for, but otherwise well-performing models were doing poorly on it. Problem? Annotation inconsistencies in S800. It's been manually revised using stricter rules, and just species, strain, and genera names (each with their own tags). 200 more documents too, so now it's S1000.
How do NER models do on it? F1 up around 89 to 91.
Get corpus at https://zenodo.org/record/7064902
Esmaeil Nourani: Health involves lifestyle factors. Can we extract relations connecting those to disease?
Developing a draft lifestyle ontology. Started with 869 concepts across multiple branches. Needed to get synonyms, too - embeddings helped with that, and also allowed discovery of new candidate terms. Full draft is now 1652 concepts. Ready for NER and RE.
Krallinger: Organizing shared tasks. Some processes can take years. Examples - CANTEMIST, CodiEsp, MESINESP, MEDDOCAN, MEDDOPROF, ClinSpEn, DisTEMIST. Most recently MEDDOPLACE, PharmaCoNER
#ismbeccb2023
#textmining
Krallinger: It's important to engage clinical experts from the beginning. That includes their considerations on the content sources.
Annotation guidelines are necessary. See the guides at http://zenodo.org/communities/medicalnlp
Translating these to languages beyond English helps the community.
Krallinger: Developing language models for clinical data in Spanish. Since clinical text varies so much in structure and content, you need a balance between general language and domain-specific optimization. Need some clear annotation guidelines too.
Really need a set of clear clinical use cases, too.
#ismbeccb2023 #textmining
Hi #ismbeccb2023.
I'm in Text Mining today.
Martin Krallinger: Unstructured text from clinical narratives is still underused. There are many other text sources too, like patient forums or drug leaflets, but clinical narratives are especially difficult. No out of the box NLP solution works. Need data, infrastructure, and reproducible benchmarks.
Day 4 recap from #ISMBECCB2023: gene regulation, single-cell data, and visualization of spatial transcriptomics. Papers/preprints/links for highlights are in the description.
https://youtube.com/shorts/TkKDmY6lmZU?feature=share
Oh today I saw more alternative splicing goodies at #ismbeccb2023 #ismb2023