#TextData

2025-05-28

I needed an excuse to make a word cloud shaped like a brain and I finally found one. Who said dreams don’t come true?

But in all seriousness, I learned a couple of things about how to work with text data, especially short text across multiple languages and I’ve shared that here:

neurofrontiers.blog/how-people

#WordCloud #TextData #DataAnalysis #DataIsBeautiful #multilingual

iCode2Ifeanyi5
2025-03-31

Do your eyes still get stung by profanity in text data?

Well, there is an R script that enables you to handle it. Really useful for working with social media data.

Check out the script on github:

github.com/Ifeanyi55/noProfani

CoListycolisty
2025-01-16
Steven Sandersonspsanderson@rstats.me
2024-09-19

A script I just wrote to process text files for a reconciliation for a vendor.

#R #RStats #RProgramming #Programming #Coding #TextData

#purrr #map #dplyr #readLines

wdir <- "W:/path/to/files"

fl <- list.files(wdir, pattern = "\\.txt$", full.names = TRUE)

fl <- purrr::discard(fl, stringr::str_detect(fl, "updated"))
fl <- fl[!fl %in% c("W:path/to/files/bad_file.txt")]

fl <- setNames(fl, basename(fl))

txt_to_df <- function(txt_file) {
  txt_lines <- readLines(txt_file)
  x <- data.frame(txt_lines)
  writeLines(txt_lines, "test.csv")
  data <- read.csv("test.csv", header = TRUE, sep = "|")
  data <- dplyr::as_tibble(data) |>
    dplyr::mutate(dplyr::across(.cols = dplyr::everything(), .fns = as.character))
  return(data)
}
add_missing_cols <- function(df) {
  if (ncol(df) != 102) {
    missing_cols <- c("MissingColumns1","MissingColumn2","MissingColumn3")
    df[missing_cols] <- NA
  }
  return(df)
}

ret <- purrr::map(fl, txt_to_df)

ret <- purrr::map(ret, add_missing_cols)

ret <- purrr::map(ret, janitor::clean_names)

bad_cols_txt <- purrr::keep(ret, \(x) ncol(x) != 102)

good_files_txt <- purrr::keep(ret, \(x) ncol(x) == 102)

good_files_tbl <- good_files_txt |>
  purrr::map(\(x) x |>
    dplyr::mutate(dplyr::across(.cols = dplyr::everything(), .fns = as.character))) |>
    purrr::list_rbind(names_to = "id")

if (length(bad_cols_txt) > 0) {
  bad_files_tbl <- bad_cols_txt |>
    purrr::map(\(x) x |>
      dplyr::mutate(dplyr::across(.cols = dplyr::everything(), .fns = as.character))) |>
      purrr::list_rbind(names_to = "id")
}
Steven P. Sanderson II, MPHstevensanderson@mstdn.social
2024-09-19

A script I just wrote to process text files for a reconciliation for a vendor.

#R #RStats #RProgramming #Programming #Coding #TextData

#purrr #map #dplyr #readLines

wdir <- "W:/path/to/files"

fl <- list.files(wdir, pattern = "\\.txt$", full.names = TRUE)

fl <- purrr::discard(fl, stringr::str_detect(fl, "updated"))
fl <- fl[!fl %in% c("W:path/to/files/bad_file.txt")]

fl <- setNames(fl, basename(fl))

txt_to_df <- function(txt_file) {
  txt_lines <- readLines(txt_file)
  x <- data.frame(txt_lines)
  writeLines(txt_lines, "test.csv")
  data <- read.csv("test.csv", header = TRUE, sep = "|")
  data <- dplyr::as_tibble(data) |>
    dplyr::mutate(dplyr::across(.cols = dplyr::everything(), .fns = as.character))
  return(data)
}
add_missing_cols <- function(df) {
  if (ncol(df) != 102) {
    missing_cols <- c("MissingColumns1","MissingColumn2","MissingColumn3")
    df[missing_cols] <- NA
  }
  return(df)
}

ret <- purrr::map(fl, txt_to_df)

ret <- purrr::map(ret, add_missing_cols)

ret <- purrr::map(ret, janitor::clean_names)

bad_cols_txt <- purrr::keep(ret, \(x) ncol(x) != 102)

good_files_txt <- purrr::keep(ret, \(x) ncol(x) == 102)

good_files_tbl <- good_files_txt |>
  purrr::map(\(x) x |>
    dplyr::mutate(dplyr::across(.cols = dplyr::everything(), .fns = as.character))) |>
    purrr::list_rbind(names_to = "id")

if (length(bad_cols_txt) > 0) {
  bad_files_tbl <- bad_cols_txt |>
    purrr::map(\(x) x |>
      dplyr::mutate(dplyr::across(.cols = dplyr::everything(), .fns = as.character))) |>
      purrr::list_rbind(names_to = "id")
}
Steven Sandersonspsanderson@rstats.me
2024-09-09

In today's post, I discuss using `grep()` in R for extracting substrings from text data.

While `grep()` finds pattern matches, it doesn’t return the substrings directly. I explain combining it with `regexpr()` and `substr()`, or `gregexpr()` and `regmatches()` to achieve this.

Practical examples include filtering email addresses and data frames.

Post: spsanderson.com/steveondata/po

#R #RStats #Programming #Coding #textdata

Steven P. Sanderson II, MPHstevensanderson@mstdn.social
2024-09-09

In today's post, I discuss using `grep()` in R for extracting substrings from text data.

While `grep()` finds pattern matches, it doesn’t return the substrings directly. I explain combining it with `regexpr()` and `substr()`, or `gregexpr()` and `regmatches()` to achieve this.

Practical examples include filtering email addresses and data frames.

Post: spsanderson.com/steveondata/po

#R #RStats #Programming #Coding #textdata

Steven Sandersonspsanderson@rstats.me
2024-09-04

In today's blog post, I introduce the `grep()` function in R, a key tool for searching patterns in text data.

It allows case-sensitive searches by default but can perform case-insensitive searches with the `ignore.case` argument.

This flexibility is essential for text mining, data cleaning, and analysis. I outline the basic syntax, usage examples, and common mistakes.

Post: spsanderson.com/steveondata/po

#R #RStats #RProgramming #Programming #Coding #textdata #stringr #grep

Steven P. Sanderson II, MPHstevensanderson@mstdn.social
2024-09-04

In today's blog post, I introduce the `grep()` function in R, a key tool for searching patterns in text data.

It allows case-sensitive searches by default but can perform case-insensitive searches with the `ignore.case` argument.

This flexibility is essential for text mining, data cleaning, and analysis. I outline the basic syntax, usage examples, and common mistakes.

Post: spsanderson.com/steveondata/po

#R #RStats #RProgramming #Programming #Coding #textdata #stringr #grep

Steven Sandersonspsanderson@rstats.me
2024-09-03

Today's blog post discusses using OR logic with the `grep()` function in R, which enhances pattern matching in character vectors.

By employing the pipe symbol (`|`), users can search for multiple patterns simultaneously, such as `grep("apple|banana", text_vector)`.

It also highlights the option to ignore case with `ignore.case = TRUE`.

Post: spsanderson.com/steveondata/po

#R #RStats #RProgramming #Programming #Coding #textdata #grep

Steven P. Sanderson II, MPHstevensanderson@mstdn.social
2024-09-03

Today's blog post discusses using OR logic with the `grep()` function in R, which enhances pattern matching in character vectors.

By employing the pipe symbol (`|`), users can search for multiple patterns simultaneously, such as `grep("apple|banana", text_vector)`.

It also highlights the option to ignore case with `ignore.case = TRUE`.

Post: spsanderson.com/steveondata/po

#R #RStats #RProgramming #Programming #Coding #textdata #grep

Steven P. Sanderson II, MPHstevensanderson@mstdn.social
2024-08-16

In today's post, I explain how to use the `grepl` function in base R to search for multiple patterns in strings. We break down the syntax and show how to combine patterns using the OR operator (`|`) for simultaneous searches.

A practical example demonstrates searching for "cat" or "dog" in a list of phrases, highlighting case-insensitive searching and extracting results.

Post: spsanderson.com/steveondata/po

#R #RStats #Coding #textdata #grepl #regex

2018-03-26

Challenge on Automatic Misogyny Detection
amiibereval2018.wordpress.com

> The AMI shared task proposes the automatic identification of misogynous content both in English and in Spanish languages in Twitter.

#textdata

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst