#Dataset

2025-06-20

Как мы сделали полезным крупнейший русскоязычный датасет запросов к LLM

Привет! Меня зовут Роман Куцев, я основатель LLM Arena . У нас каждый день сотни людей общаются с языковыми моделями, тестируют, сравнивают, задают вопросы. В какой-то момент стало ясно: в этих логах — не просто сессии пользователей. Это — живая картина того, как люди используют LLM в реальности. Так родилась идея: собрать открытый, структурированный датасет промптов и дать AI-комьюнити инструмент, с которым можно не просто смотреть, но и исследовать, фильтровать, понимать логику запросов юзеров к LLM. Изучая Arena Explorer от LMSYS, мы сначала хотели взять их путь за основу. Но быстро стало понятно — мы можем и должны пойти дальше. И построили систему, которая обусловлена русскоязычным контекстом, с другим уровнем прозрачности и внимания к качеству.

habr.com/ru/articles/920150/

#llm #llmarena #датасет #dataset #ai #ии #разметка_данных #валидация_данных

2025-06-20

16WW Energy Series Dataset - Data series of electricity and gas and other energy generation and use. #energy #microgen #dataset - earth.org.uk/energy-series-dat

2025-06-19

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

2025-06-15

Preprint: "Institutional #Books 1.0: A 242B Token #Dataset From Harvard Library's Collections, Refined for Accuracy and Usability" arxiv.org/abs/2506.08300 & "AI Chatbots Need More Books to Learn From. These #Libraries are Opening Their Stacks" infodocket.com/2025/06/12/ap-a #AI #LLMs #data

2025-06-15

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

2025-06-15

16WW Energy Series Dataset - Data series of electricity and gas and other energy generation and use. #energy #microgen #dataset - earth.org.uk/energy-series-dat

2025-06-14

@tioan@peering.social

Aber wie unlocke ich die automatisch?
Was meinst du mit Privatkopie?

Ah, gerade was gefunden:
https://schroederdennis.de/storage/proxmox-backup-server-mit-verschluesselten-zfs-dataset/

Was für den
#PBS geht, sollte ja auch für #Proxmox selbst gehen.

Und
https://blog.berrnd.de/proxmox-voll-verschluesselt-luks-lvm-dropbear-initramfs

Dann wäre Proxmox auf den M.2 verschlüsselt und das
#Dataset auf den HDs auch und ich kann Proxmox remote und das Dataset automatisch entschlüsseln.

Die VMs selbst müssen dann ja gar nicht mehr verschlüsselt werden.

Allerdings laufen die M.2 so nicht mit einem
#ZFS RAID 1 🤔

PUPUWEB Blogpupuweb
2025-06-14

Harvard launches Institutional Books 1.0, a public domain dataset for AI with 242B tokens from 394M scanned pages and 983K books in 254 languages, supporting open research and language diversity.

Broken Feathermonk3yspid3r
2025-06-12
I am thrilled to share our new global dataset on the drivers of forest loss at 1 km resolution, which has been 2+ years in the making! We developed the data using a customized ResNet model trained on a set of samples we collected through visual interpretation of very high-resolution satellite imagery. The model used satellite imagery (Landsat & Sentinel-2) and ancillary data to classify seven driver categories: permanent agriculture, hard commodities (e.g. mining and energy infrastructure), shifting cultivation, logging, wildfires, settlements and infrastructure, and other natural disturbances.
2025-06-09

Bats Around our Home aka 16WW - Our house in a bat hotspot - pipistrelles rather than vampires! #bat #Chiroptera #dataset - earth.org.uk/bats-at-16WW.html

2025-06-09

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

Chema Alonso :verified:chemaalonso@ioc.exchange
2025-06-08

El lado del mal - ReplayDF: Replay Attacks contra modelos de detección de Audio DeepFakes elladodelmal.com/2025/06/repla #DeepFake #Audio #hacking #Dataset #IA #AI #GenAI #hacking

2025-06-07

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

2025-06-07

Benchmark — разрушитель LLM'ок, или Как мы собрали свой мультиязычный SWE-Bench

В статье представлено многоязычное расширение SWE-Bench от команды

habr.com/ru/companies/doubleta

#AI #ML #DS #SWE #bench #ML4se #Dataset #Датасет #Разметка_данных #benchmark

devSJR :python: :rstats:devSJR@fosstodon.org
2025-06-06

Researchers have created a dataset for training and evaluating language models without intellectual property infringement!

The Common Pile includes diverse text sources like web content, books, research papers (from PubMed & ArXiv), and online discussions. ArXiv is not peer-reviewed. 🤔 It's designed to be high-quality and reproducible for NLP & LLM research.

Resource:
github.com/r-three/common-pile

#LLM #LargeLanguageModels #Dataset #AIresearch #CommonPile

Bar chart displaying the sizes of various datasets that comprise the Common Pile, an 8 TB collection of openly licensed text. The sources are categorized by textual domains such as code, government and legal documents, wikis, web content, academic papers, online forums, public domain texts, and educational resources. Each bar represents a different source, with the dataset size indicated on a logarithmic scale ranging from 1 MB to 1 TB. Notable sources include Stack Exchange at approximately 275 GB, PubMed Central at around 306 GB, Wikipedia at about 89 GB, ArXiv at roughly 400 GB, and various other domains contributing smaller proportions to the total size. The chart visually emphasizes the diversity and scale of the data sources integrated into the Common Pile.
Original study: https://github.com/r-three/common-pile/blob/main/paper.pdf, Figure 1.
2025-06-01

In questa #newsletter parliamo di:
🟢 #ChatGPT e gli altri #LLM si stanno mangiando #internet
🔴 #WiFi sui treni in #Italia, ecco come funziona la società che ha in pancia #Starlink
🟠 #Scuole italiane: Non ci siamo! Occorre una riforma epocale sulle #tecnologie digitali.
🔵 Quando l’#AI riflette i bias: le disuguaglianze di genere nei #dataset

@informatica

bit.ly/43GzT79

2025-05-26

This is also something I am proud of not only because of the effort that finally paid off but also because of this being the first article of mine in published in #RSC (something I wished for a long time), and also the one where I convinced coauthors to publish our data (link to figshare inside!), as well as the one where another coauthor proposed we check the box on the open review, so you can also see the comments the reviewers had.
Beyond the formal acknowledgments, huge thanks to our editor (unsure if naming them is appropriate since editors aren't usually credited publicly) who patiently extended deadlines until we finished required experiments. It is also common to be half-grateful half-annoyed-by the reviewers, but this time the feel of gratefulness is enhanced by the push of the reviewer that made me find conditions for larger material quantities. Not enough for solid-state NMR sadly (we are working on scaling it a bit further), but enough for KBr FTIR, which is not always possible for conducting polymers.
And the figshare link for the #DataSet: doi.org/10.6084/m9.figshare.28
No OA, sorry, but I will be happy to send you a copy if you reach out.

2025-05-25

Grid-Tied Rooftop Solar PV Generation Stats via Sunny Beam - High-frequency (per-minute) grid-tied generation stats for 16WW roof-mounted 5.16kWp solar PV. #PV #dataset - earth.org.uk/grid-tie-generati

2025-05-24

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

2025-05-24

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst