#Textprocessing

2025-05-29

How to print multiline file content starting from start_pattern and ending at end_pattern? #commandline #textprocessing

askubuntu.com/q/1549597/612

Admin:Docsitdocs
2025-05-16

Von Dateiinhalten anzeigen über präzises Suchen bis zum Vergleichen: Diese Linux-Befehle sind unverzichtbar für die tägliche Administration und die LPIC-1-Prüfung.

itdocs.wiki/lpic-1-serie/lpic-

2025-05-09

Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.

F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA

#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

Slide from Information Service Engineering 2025, LEcture 02, Natural Language PRocessing 01, A Brief History of NLP, NLP timeline. The timeline is located in the middle of the slide from top to bottom. The pointer on the timeline indicates 1990s. On the left, the formula for conditional probability of a word, following a given series of words, is given as a formula. Below, an AI generated portrait of William Shakespeare is displayed with 4 speech buubles, representing artificially generated text based on 1-grams, 2-grams, 3-grams and 4 grams. The 4-grams text example looks a lot like original Shakespeare text. On the right side the following text is displayed: 
N-grams for statistical language modeling were introduced and popularised by Frederick Jelinek and Stanley F. Chen from IBM Thomas J. Watson Research Center, who developed efficient algorithms and techniques for estimating n-gram probabilities from large text corpora for speech recognition and machine translation.

Bibliographical reference:
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA.
Stephandurchaus
2025-04-19

you can pass variables to an script with the option -v. This is useful, for example, when you want to include the file name in the output:

```
find . -type f -iname '*.csv' -exec awk -F, -v filename={} '{print filename, $2}' {} \;
```

Even though seemingly awkward at first glance, is definitely one of the most versatile and useful tools on .

N-gated Hacker Newsngate
2025-04-14

🚀 Behold the epic tale of Janet's , where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
bakpakin.com/writing/how-janet

Holle Medinghmeding
2025-03-06

🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows

At , Nina Rastinger explores how well handles abbreviations & NER:

✅ NER works well, even with small, low-cost models
❌ Abbreviations are tricky—costs & resource demands skyrocket
🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
Balancing accuracy & efficiency in text processing remains a challenge! ⚖️

Nina Rastinger at Panel More than Chatbots: Multimodal Large Language Models in Humanities Workflows #dhd2025
2025-02-11

Master regular expressions for efficient substring extraction in Python! Learn greedy vs. non-greedy matching & capturing groups for precise results.
tech-champion.com/data-science

Steven Sandersonspsanderson@rstats.me
2025-01-24

🐧 Struggling with text formatting in Linux? You're not alone! My new article breaks down the essentials for beginners.

Read more at spsanderson.com/steveondata/po and let’s discuss!

#Programming #linux #tech #printf #nl #fold #groff #TextProcessing #Formatting #Blog #CLI #CommandLineTips

**ALT Text:** A terminal window displaying a command line input that formats and prints system information, including hostname, memory, and disk usage. The command uses `printf` and `awk` to neatly align the output, showing the hostname as "server," memory as "31G," and disk usage as "37%." The background is a soft gradient, enhancing the focus on the terminal text.
Steven P. Sanderson II, MPHstevensanderson@mstdn.social
2025-01-24

🐧 Struggling with text formatting in Linux? You're not alone! My new article breaks down the essentials for beginners.

Read more at spsanderson.com/steveondata/po and let’s discuss!

#Programming #linux #tech #printf #nl #fold #groff #TextProcessing #Formatting #Blog #CLI #CommandLineTips

**ALT Text:** A terminal window displaying a command line input that formats and prints system information, including hostname, memory, and disk usage. The command uses `printf` and `awk` to neatly align the output, showing the hostname as "server," memory as "31G," and disk usage as "37%." The background is a soft gradient, enhancing the focus on the terminal text.
Pragmatic Bookshelf 📚pragprog@techhub.social
2025-01-23

New at PragProg

Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.

In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.

pragprog.com/titles/d-snrem

@staffannoteberg

#regularexpressions #patternmatching #regex #regexp #textprocessing

Steven Sandersonspsanderson@rstats.me
2025-01-17

💡 Struggling with text processing in Linux?

My latest blog post at spsanderson.com/steveondata/po breaks down common challenges and offers practical solutions! Learn how to tackle duplicates and sort data with ease. 📊🔍

#Text #Blog #Technology #Textprocessing #Programming #Linux #Help

Let me know what you think!

**ALT Text:** A terminal window displaying a Linux command-line session. The user creates a file named `fruits.txt` containing the words "apple," "banana," and "cherry" using the `echo` command. The `sort` command is used to sort the file, and `uniq` removes duplicates, resulting in a sorted list of unique words: "apple," "banana," and "cherry." Finally, the `cut -c 1-3 fruits.txt` command extracts the first three characters of each line, outputting "app," "ban," "app," and "che."
Steven P. Sanderson II, MPHstevensanderson@mstdn.social
2025-01-17

💡 Struggling with text processing in Linux?

My latest blog post at spsanderson.com/steveondata/po breaks down common challenges and offers practical solutions! Learn how to tackle duplicates and sort data with ease. 📊🔍

#Text #Blog #Technology #Textprocessing #Programming #Linux #Help

Let me know what you think!

**ALT Text:** A terminal window displaying a Linux command-line session. The user creates a file named `fruits.txt` containing the words "apple," "banana," and "cherry" using the `echo` command. The `sort` command is used to sort the file, and `uniq` removes duplicates, resulting in a sorted list of unique words: "apple," "banana," and "cherry." Finally, the `cut -c 1-3 fruits.txt` command extracts the first three characters of each line, outputting "app," "ban," "app," and "che."
Veronica Olsen 🏳️‍🌈🇳🇴🌻veronica@mastodon.online
2024-11-12

Back when I first wrote text processing code in the 90s on my Amiga 1200, I always used the ¤ symbol as a placeholder character for splitting and replacing to exclude things I wanted skipped without affecting character count. It was available on the Norwegian keyboard, and practically never used in text.

Recently I discovered that Unicode has two "Not a character" symbols perfect for the same usage: \uFFFE and \uFFFF.

They can be really useful!

#Code #Python #TextProcessing #Unicode

Veronica Olsen 🏳️‍🌈🇳🇴🌻veronica@mastodon.online
2024-10-23

2. Immediately after the split, replace U+FFFF with newline, but keep both versions of the line, and pass the one with the U+FFFF to the text paragraph parser. Everything else (like headings) gets the cleaned one.

3. After paragraph lines with a single break between them (belonging to the same paragraph) have been processed, THEN I replace the U+FFFF characters there.

It seems to work, but it took me like 3-4 hours to crack. 😅

4/4

#Python #TextProcessing #Unicode

Veronica Olsen 🏳️‍🌈🇳🇴🌻veronica@mastodon.online
2024-10-23

I tried using the alternative line and paragraph separators from Unicode, but splitlines accepts them too. Then I discovered these Unicode characters:

U+FFFE <noncharacter-FFFE> not a character.
U+FFFF <noncharacter-FFFF> not a character.

The solution, then was:

1. Replace all occurrences of [br] with or without a trailing newline, using regex pattern "(?i)(?<!\\)(\[br\]\n?)", with a U+FFFF character.

3/4

#Python #TextProcessing #Unicode

Veronica Olsen 🏳️‍🌈🇳🇴🌻veronica@mastodon.online
2024-10-23

This works fine in principle, but it is incredibly hard to figure out exactly when to make the replacement.

For instance, if I do it too early, the parser will split on the breaks as I use splitlines() early on. If I do it too late, I get double line breaks some places.

2/4

#Python #TextProcessing #Unicode

Veronica Olsen 🏳️‍🌈🇳🇴🌻veronica@mastodon.online
2024-10-23

I've been struggling with solving an issue with my text editor project. The editor is plain text and uses a blank line to separate paragraphs.

The editor has an option to preserve or not preserve single line breaks inside paragraphs when generating the output.

However, some users want to not preserve them, but still want to be able to add hard breaks sometimes. So I've been trying out using [br] as a hard break shortcode.

1/4

#Python #TextProcessing #Unicode

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst