Daniel van Strien

📖🤗 Machine learning Librarian at Hugging Face

2025-01-03

Was 2024 the year of datasets? Is 2025 the year for community-built datasets?

It's exciting to see the progress of many languages in FineWeb-C:
- Total annotations submitted: 41,577
- Languages with annotations: 106
- Total contributors: 363

2024-10-29

Researchers: Want your ML datasets to have more impact? Share them on @huggingface Hub!

✨ Benefits:
• Visibility in the ML community
• Interactive data viewer
• Support for TB-scale datasets
• Integration with @DataPolars @pandas_dev @duckdb and more
huggingface.co/blog/researcher

2024-09-26

ColPali is revolutionizing multimodal retrieval. Can we make it even more effective with domain-specific fine-tuning?

Check out my latest blog post, where I create a dataset for fine-tuning a ColPali model for a new domain using an open Vision Language Model.

danielvanstrien.xyz/posts/post

Screenshot of the "I want to Believe" poster from X-Files with the text "in multimodal retrieval" inserted above.Screenshot of a document alongside a JSON description of the document
2024-09-12

Can we search for datasets on the @huggingface Hub based on their content?

> Some datasets lack good documentation 😢
> The dataset viewer preview offers a wealth of information

🤔 How about: query -> dataset based on structure content?

Check out V1: huggingface.co/spaces/libraria

2024-09-10

You can help improve this project by rating synthetic user search queries for hub datasets. If you have a @huggingface login, you can start annotating in @argilla_io in < 5 seconds here: davanstrien-my-argilla.hf.spac

2024-09-10

I need to do some tidying, but I'll share all the code and in-progress datasets for this soon!

2024-09-10

Almost ready: search for a @huggingface dataset on the Hub from information in the datasets viewer preview!

Soon, you can find deep-cut datasets even if they don't have a full dataset card (you should still document your datasets!)

2024-09-09

The @huggingface's Semantic Dataset Search is back in action! Find similar datasets by ID or do a semantic search of dataset cards.

Give it a try:
huggingface.co/spaces/libraria

2024-08-07

@arnicas Occasionally the books sounds interesting but often the blurbs are not very good. Think LLMs are still very lacking in this kind of task tbh.

2024-08-07

Is your summer reading list still empty? Curious if an LLM can generate a book blurb you'd enjoy and help build a KTO preference dataset at the same time?

A demo using @huggingface Spaces and @gradio to collect LLM output preferences: huggingface.co/spaces/davanstr

2024-07-15

SPIQA from @Google is a large-scale question-answering dataset centred on figures, tables, and text paragraphs from scientific research papers in various computer science domains.
huggingface.co/datasets/google

2024-06-17

HelpSteer2 from @nvidia is an open-source dataset to train top-performing reward models!
- 21,362 samples with annotated attributes
- Attributes: Helpfulness, Correctness, Coherence, Complexity, Verbosity
- Multi-turn prompts
- 88.8% on RewardBench
huggingface.co/datasets/nvidia

2024-05-14

Created an "Awesome Synthetic Datasets" list in my ongoing quest to learn more about building synthetic datasets using large language models. Currently includes important tools, datasets, and papers.

Check it out here: github.com/davanstrien/awesome

2024-04-30

Translations from 56 contributors, based on a dataset by 314 community members! These translations will facilitate the creation of evaluations, experimentation with SPIN, building DPO datasets, and more. Interested in contributing to datasets? github.com/huggingface/data-is

2024-04-10

As part of the Multilingual Prompt Evaluation Project (MPEP), we are now automatically exporting the @argilla_io datasets to the @huggingface Hub. We have more than 15 active community-led translation efforts collaborating to enhance datasets for various languages. ❤️

2024-04-10

Big thanks to @davidbstein1957 and @ignacio_at_nlp for working on this 🤗

2024-04-10

VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs: huggingface.co/datasets/xcodem

2024-04-03

Doing some rare "front end" work 😬

2024-03-28

Experimenting with TL;DR summaries for @huggingface datasets using a Chrome plugin.

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst