#DATAset

PUPUWEB Blogpupuweb
2025-06-14

Harvard launches Institutional Books 1.0, a public domain dataset for AI with 242B tokens from 394M scanned pages and 983K books in 254 languages, supporting open research and language diversity.

Broken Feathermonk3yspid3r
2025-06-12
I am thrilled to share our new global dataset on the drivers of forest loss at 1 km resolution, which has been 2+ years in the making! We developed the data using a customized ResNet model trained on a set of samples we collected through visual interpretation of very high-resolution satellite imagery. The model used satellite imagery (Landsat & Sentinel-2) and ancillary data to classify seven driver categories: permanent agriculture, hard commodities (e.g. mining and energy infrastructure), shifting cultivation, logging, wildfires, settlements and infrastructure, and other natural disturbances.
2025-06-09

Bats Around our Home aka 16WW - Our house in a bat hotspot - pipistrelles rather than vampires! #bat #Chiroptera #dataset - earth.org.uk/bats-at-16WW.html

2025-06-09

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

Chema Alonso :verified:chemaalonso@ioc.exchange
2025-06-08

El lado del mal - ReplayDF: Replay Attacks contra modelos de detección de Audio DeepFakes elladodelmal.com/2025/06/repla #DeepFake #Audio #hacking #Dataset #IA #AI #GenAI #hacking

2025-06-07

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

2025-06-07

Benchmark — разрушитель LLM'ок, или Как мы собрали свой мультиязычный SWE-Bench

В статье представлено многоязычное расширение SWE-Bench от команды

habr.com/ru/companies/doubleta

#AI #ML #DS #SWE #bench #ML4se #Dataset #Датасет #Разметка_данных #benchmark

devSJR :python: :rstats:devSJR@fosstodon.org
2025-06-06

Researchers have created a dataset for training and evaluating language models without intellectual property infringement!

The Common Pile includes diverse text sources like web content, books, research papers (from PubMed & ArXiv), and online discussions. ArXiv is not peer-reviewed. 🤔 It's designed to be high-quality and reproducible for NLP & LLM research.

Resource:
github.com/r-three/common-pile

#LLM #LargeLanguageModels #Dataset #AIresearch #CommonPile

Bar chart displaying the sizes of various datasets that comprise the Common Pile, an 8 TB collection of openly licensed text. The sources are categorized by textual domains such as code, government and legal documents, wikis, web content, academic papers, online forums, public domain texts, and educational resources. Each bar represents a different source, with the dataset size indicated on a logarithmic scale ranging from 1 MB to 1 TB. Notable sources include Stack Exchange at approximately 275 GB, PubMed Central at around 306 GB, Wikipedia at about 89 GB, ArXiv at roughly 400 GB, and various other domains contributing smaller proportions to the total size. The chart visually emphasizes the diversity and scale of the data sources integrated into the Common Pile.
Original study: https://github.com/r-three/common-pile/blob/main/paper.pdf, Figure 1.
2025-06-01

In questa #newsletter parliamo di:
🟢 #ChatGPT e gli altri #LLM si stanno mangiando #internet
🔴 #WiFi sui treni in #Italia, ecco come funziona la società che ha in pancia #Starlink
🟠 #Scuole italiane: Non ci siamo! Occorre una riforma epocale sulle #tecnologie digitali.
🔵 Quando l’#AI riflette i bias: le disuguaglianze di genere nei #dataset

@informatica

bit.ly/43GzT79

2025-05-26

This is also something I am proud of not only because of the effort that finally paid off but also because of this being the first article of mine in published in #RSC (something I wished for a long time), and also the one where I convinced coauthors to publish our data (link to figshare inside!), as well as the one where another coauthor proposed we check the box on the open review, so you can also see the comments the reviewers had.
Beyond the formal acknowledgments, huge thanks to our editor (unsure if naming them is appropriate since editors aren't usually credited publicly) who patiently extended deadlines until we finished required experiments. It is also common to be half-grateful half-annoyed-by the reviewers, but this time the feel of gratefulness is enhanced by the push of the reviewer that made me find conditions for larger material quantities. Not enough for solid-state NMR sadly (we are working on scaling it a bit further), but enough for KBr FTIR, which is not always possible for conducting polymers.
And the figshare link for the #DataSet: doi.org/10.6084/m9.figshare.28
No OA, sorry, but I will be happy to send you a copy if you reach out.

2025-05-25

Grid-Tied Rooftop Solar PV Generation Stats via Sunny Beam - High-frequency (per-minute) grid-tied generation stats for 16WW roof-mounted 5.16kWp solar PV. #PV #dataset - earth.org.uk/grid-tie-generati

2025-05-24

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

2025-05-24

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

George Macgregorg3om4c@code4lib.social
2025-05-22

"neither Mendeley nor EndNote’s App functions recognized dataset DOIs" ...which is rather astonishing, for 'in-app' #DOI look-ups. #Zotero, #Paperpile & #Sciwheel are at least at the forefront though more to be done.

I would comment that analysis of 'plugin' import of #dataset metadata into ref managers more complex to understand; highly dependent on the structured data exposed by host #repository.

Obstacles to Dataset Citation Using #Bibliographic Management Software doi.org/10.5334/dsj-2025-017

Hacker Newsh4ckernews
2025-05-21

Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024)

arxiv.org/abs/2502.00627

OpenAIREOpenAIRE
2025-05-20

We’re going live soon!
Join the Community Call and explore:
- tracking
- Linking to research outputs
- New indicators
Shape the future of dataset visibility and responsible reuse!
Join us: shorturl.at/kbgYM

Jesus Castagnetto 🇵🇪jmcastagnetto
2025-05-19

From "Introducing HealthBench"
openai.com/index/healthbench/

A to models. Seems to have a wide range of topics (from to global health) and languages (from to ).

Zeitgeisty Aphorismszeitgeisty
2025-05-18

There is not less of life in human than of in .

💧🌏 Greg CocksGregCocks@techhub.social
2025-05-15

US National Inventory Of Dams [NID] [open spatial data]
--
H/T Data Is Plural data-is-plural.com/
--
“The National Inventory of Dams [nid.sec.usace.army.mil/] [is a is a congressionally authorized] ‘document[ing] all known dams in the U.S. and its territories that meet certain criteria’ related [nid.sec.usace.army.mil/#/about] to the dam’s height, reservoir size, and likely impacts of its ‘failure or mis-operation.’..."
#GIS #spatial #mapping #fedscience #NationalInventoryofDams #NLDI #US #USA #dams #criteria #inventory #opendata #downloadable #risk #hazard #infrastructure #energy #HEP #waterresources #hydroelectric #webmap #USACE #operations #dataset #safety #inspection #uses
@USACE @FEMA @EPA

2025-05-14

On 16WW Mains Inlet Water Temperature - Domestic mains water temperature data for 16WW on tap; seasonal min/max about 10C/20C in winter/summer. #dataset #water #temperature - earth.org.uk/note-on-data-for-

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst