#corpora

Christof Schöchchristof@fedihum.org
2024-12-05

Later today at #CHR2024, we are going to present our work on #Multilingual #Stylometry!

We isolated the influence of #language on #authorship #attribution #accuracy by translating multiple #corpora into each others' languages while keeping #corpus composition stable.

Interactive showcase: showcases.clsinfra.io/stylomet

Full paper: ceur-ws.org/Vol-3834/paper9.pd

This work was developed within the @CLSinfra project in #Trier, #Krakow and #Prague with Artjoms Šeļa, Evgeniia Fileva and Julia Dudar.

Two colorful heatmaps, in hues of blue, yellow and red; one above the other; with various dropdown lists on the left to vary parameters.
Tristan MillerLogological
2024-11-06

My lab, Computational Linguistics at Manitoba, is seeking motivated PhD students for and research in computational humour, historical born-digital , and language technology: clam.cs.umanitoba.ca/open-posi

Jan :rust: :ferris:janriemer@floss.social
2024-10-09

So you wanna parse/manipulate some #PDF's, huh!?

Well, you better #test your #software thoroughly or bad things will happen!🧪

So how about "this corpus [which] contains nearly 8 million PDFs gathered from across the web in July/August of 2021":
digitalcorpora.org/corpora/fil

The entire corpus when uncompressed takes up nearly 8 TB!

You can find some more links to different #corpora (even to ones deemed #unsafe!😬) at pdf-association's Github:

github.com/pdf-association/pdf

#Parsing #Testing

FakespeakFakespeak
2024-09-02

👋 Greetings! 👋

We wanted to remind all that the project is still alive and kicking – especially after a long and filled summer vacation.

We have some great events and research output coming out in the next few months, including a conference, fake news , publications bringing together advanced linguistic features and , and a special issue in Linguistics Vanguard on the language of fake news.

Follow along!

2024-06-24

Interestingly, very few psychologists are aware of #linguistic #corpora 📊 and their immense research potential. Platforms like CLARIN-PL offer invaluable data that can significantly enhance our understanding of human behaviour and social interactions. 🤝🗣️ It's time more of us psych folk tapped into these resources to advance our field! 🌟🔍

2024-04-25

And another one for fellow linguists interested in compiling #corpora of digital discourse: MastoScraper takes advantage of the Mastodon API to collect toots based on a keyword search.
Here goes, feedback welcome!
#linguistics @linguistics
fmoncomble.github.io/mastoscra

The Enemy is WithinDeixis9
2024-04-20

Finally a corpus containing foul language.

Lexical tutor concordance now has a corpus of movie language COCA Movies 1.6m so we can see how language is used actually used therein.

A potentia game changer for corpus linguistics considering the vast number of humans who only use dictionaries to look up swear words?

lextutor.ca/cgi-bin/conc/wwwas

Tatjana Schefflertschfflr@fediscience.org
2024-03-15

Next week, we'll be discussing how to archive and research social media data on a large scale "After Twitter". Very excited to see what comes out of this conference, and also the following data sprint delving into huge German Twitter corpora.
dnb.de/twittertagung
#AfterTwitter #corpora #research

Digital Neo-Latin Studiesdigneolatin@hcommons.social
2024-02-16

interesting publication on medieval Latin text corpora by @TimGeelhaar : 🔖 Geelhaar, Tim. „Hamsterrad oder Himmelsleiter? Oder warum die Digitalisierung so endlos scheint“. Application/epub+zip,application/pdf, 2024. doi.org/10.15499/KDS-005-016.

#Latin #Neolatin #Corpora #OpenAccess

2024-02-07

#Eduhub days 2024 at #ZHAW and I cannot be there 😢 🩼

If you go, stop by at the marketplace—in the afternoon my colleague Maren Runte will show our work on creating a learning space for working with linguistic #corpora (to be released later this year)

#EduhubDays24 #DigitalLinguistics

eduhubdays2024.events.switch.c

Christof Schöchchristof@fedihum.org
2024-01-22

CLS INFRA Training School Vienna 2024 June 10th–12th, 2024:

ExploreCor: "Using Programmable #Corpora in #Computational #Literary #Studies"

This intensive program covers some of the most important steps in the research cycle of CLS, focusing on “Programmable Corpora” – dynamic collections of literary texts manipulated programmatically.

Apply now! pretix.eu/CLSINFRA-trainingsch

Colocated with #CCLS2024, June 13-14, 2024: jcls.io/site/conference/

@CLSinfra #CLSINFRA

CLS INFRA Logo – Computational Literary Studies Infrastructure
Tatjana Schefflertschfflr@fediscience.org
2023-11-30

@ZBW_MediaTalk Das Tagungsprogramm steht (und wird bald veröffentlicht), Bewerbungen für den folgenden Data Sprint sind noch möglich! #datasprint #twitter #dataScience #corpora
dnb.de/twitterdatasprint

Michael Piotrowskimxp@mastodon.acm.org
2023-10-12

Today in History and Theory of #DigitalHumanities, @fabianmoss is talking about #corpora and #models!

Fabian Moss standing next to a blackboard, giving a talk.
2023-10-01

Our paper on lexical innovation in contemporary Italian has been accepted for the 9th Italian Conference on Computational Linguistics (Venice, 30 November-02 December 2023).

#NLProc #languagechange #neologism #corpora

@GretaFranzini

Tatjana Schefflertschfflr@fediscience.org
2023-09-14

Oliver Watteler and Ulrike Schneider are talking about "Can I publish my social media corpus" @ #cmc2023 #corpora #socialMedia #linguistics #gdpr

Tatjana Schefflertschfflr@fediscience.org
2023-09-14

One of the prettiest (if not very practical) university locations in Germany. 👋 from #cmc2023 at the University of Mannheim! #corpora #linguistics #socialMedia

Castle in front of blue sky
2023-09-14

Looking forward to present at #cmc2023 in #mannheim in publishing social media data for secondary use and to interesting talks.
#corpora #gdpr #dsgvo #Dataservices #socialmedia

uni-mannheim.de/cmc-corpora202

Opening ceremony of linguistic's conference in a Grande hall of the Castle of Mannheim; view of the stage

Eröffnungsfeier der Linguistik-Konferenz im Großen Saal des Mannheimer Schlosses; Blick auf die Bühne
Quinn Dombrowskiquinnanya@mstdn.social
2023-09-12

#corpora I bet we all have some anxiety about them. How do you choose what texts to look at? How do you know when you have enough? The #DataSittersClub is back, asking those questions and more to corpus linguist Shelley Staples, while exploring pizza and the Newbery Award for youth literature. #DigitalHumanities datasittersclub.github.io/site

Cover of DSC 19: Shelley and the Bad Corpus, with Claudia in the hospital with a broken leg.
Victoria Stuart 🇨🇦 🏳️‍⚧️persagen
2023-09-03

CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering
arxiv.org/abs/2308.14873

* document scaling a key component in text-as-data applications for social scientists
* major field of interest for political researchers
* uncover differences between speakers or parties w. the help of different probabilistic / non-probabilistic approaches

The Enemy is WithinDeixis9
2023-08-13

@clarkesworld
Biological data scraping bots i.e. corpus linguists have been getting away with this for years.

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst