Lmst

This is a friendly reminder that there are 7 days left for submitting your extended abstract to this workshop!

(Since the workshop is non-archival, previously published work is welcome too. So consider submitting previous/future work to join the discussion in Amsterdam!)

https://sigmoid.social/@oskarvanderwal/112921219835658510

This workshop is organized by @AmsterdamNLP @uva_amsterdam & @UvAHumanities researchers Katrin Schulz, Leendert van Maanen, @wzuidema, Dominik Bachmann, and myself.
More information on the workshop can be found on the website, which will be updated regularly.
https://wai-amsterdam.github.io

One of the central issues discussed in the context of the societal impact of language technology is that ML systems can contribute to discrimination. Despite efforts to address these issues, we are far from solving them.

🌟The goal of this workshop is to bring together researchers from different fields to discuss the state of the art on bias measurement and mitigation in language technology and to explore new avenues of approach.

We're super excited to host
@dongng, John Lalor, @zeerak and @az_jacobs as invited speakers at this workshop!

Submit an extended abstract to join the discussions; either in a 20min talk or a poster session.

Working on #bias & #discrimination in #LanguageTechnology? Passionate about integrating insights from different disciplines? And do you want to discuss current limitations of #LLM 🤖 research on bias mitigation & algorithmic discrimination?
👋Join the #workshop "New Perspectives on Bias and Discrimination in Language Technology", held 4 & 5 Nov in #Amsterdam!
📝Deadline Call for Abstracts is currently open: September 15th, 2024
https://wai-amsterdam.github.io/

Special thanks go to Dominik Bachmann (shared first-author) whose insights from the perspective of psychometrics not only helped shape this paper, but also my views of current AI fairness practices more broadly.

This paper has been long in the works and is the result of many discussions trying to bridge the worlds of NLP and psychometrics. I am grateful for my co-authors Dominik Bachmann, @alinaleidinger Leendert van Maanen, @wzuidema and Katrin Schulz.

If you use/develop bias measures, we encourage you to apply a psychometric lens for reliably measuring the construct of interest. We end our paper with guidelines for developing bias measurement tools, which complements excellent advice by Dev et al. @zeerak Blodgett et al. and others!

Maybe you want to decouple gender bias from grammatical gender (see e.g., Limisiewicz & Mareček). Or—when comparing the bias of models of different sizes—you want to make sure model capability is not a confounding factor. For instance, it is possible that smaller models do not respond well to prompting (being effectively random) and appear to be less biased.

However, we believe that its flip side—divergent validity—deserves attention as well! Instead, we ask whether the bias measure is not too similar to another (easily confounded) measure or construct. We do not want to accidentally also measure something else!

Screenshot of a figure with the caption: "This figure illustrates the difference between convergent and divergent validity (see Section 4.2). In this example, the convergent validity is assessed by testing how related a gender bias measure is to another gender bias measure. The divergent validity, instead, is assessed by testing whether the gender bias measure is not strongly correlated with a measure for another, but easily confounded construct (e.g., grammatical gender)."

An obvious approach is to test the convergent validity: How well do our bias scores relate to other bias measures? And—more importantly—to the downstream harms of LMs? (Sometimes called predictive validity) See, for example, the work by Delobelle et al., Goldfarb-Tarrant et al., and others.

Ideally, one would test the validity by comparing one’s bias results with a gold standard (criterion validity). Unfortunately, we do not have access to this for something like model bias... But there exist alternative (weaker) strategies for validating bias benchmarks!

1️⃣ Construct validity: How sure are we that we measure what we actually want to measure (the construct)? Critical work by e.g., Gonen & Goldberg, Blodgett et al., Orgad & Belinkov shows many flaws that could hurt the validity. How do we design bias measures that actually measure what we want?

Screenshot of a table with the caption: "An overview of the types of construct validity we discuss in Section 4. Examples are given in the last column."

Relatedly, this interesting post comes with a compelling argument: not the size of a benchmark, but its reliability matters! We waste valuable resources when computing results for test items that do not contribute to the overall reliability of the dataset.

@LChoshen https://sigmoid.social/@LChoshen/110967533979377734

Parallel-form reliability tests if different—but designed to be equivalent—versions of a measure are consistent. E.g., how consistent are different prompt formulations for evaluating the LM responses on the same bias dataset? Are LMs sensitive to minor changes to how the questions are phrased?

1️⃣ Reliability: How much precision can we get when applying the bias measure? How resilient is it to random measurement error? Naturally, we prefer measurement tools with a higher reliability! We discuss four forms of reliability we think can be applied easily to the NLP context.

Screenshot of a table with the caption: "Examples of the reliability types we discuss in Section 3. We specify, for each reliability type, across which variations (e.g., random seeds) the consistency is measured. In the last column, we provide examples of where these reliability types could be applied."

We discuss two important concepts that say something about the quality of bias measures: 1️⃣ reliability and 2️⃣ construct validity. For both, we discuss strategies for how to assess these in the NLP setting.

It's important to understand the difference! For instance, a twice-as-high bias score (operationalization) does not necessarily mean that the model is twice as biased (construct).

Similarly, the bias measure may be excellent at distinguishing high from extremely high bias, but not so much at comparing models with low-level bias.

Making this distinction allows us to be more explicit about our assumptions and conceptualizations.

Borrowing from psychometrics (a field specialized in the measurement of concepts that are not directly observable), we argue that it is useful to decouple the "construct" (what we want to know about but cannot observe directly) from its "operationalization" (the imperfect proxy).

Screenshot of a figure with the caption: "We assume that a training dataset's bias influences the bias of a model trained on that data (but other possible sources of bias are possible, e.g., model compression may amplify existing biases (Hooker et al., 2020)). Training dataset bias and model bias are unobservable constructs (circle) that both have different possible operationalizations (squares)."

I am super excited to share that our paper "Undesirable Biases in NLP: Addressing Challenges of Measurement" has been published in JAIR!
https://doi.org/10.1613/jair.1.15195

Developing tools for measuring & mitigation is hard: LM bias is a complex sociocultural phenomenon + we have no access to a ground truth. We voice our concerns about current bias eval practices, and discuss how we can test the quality of bias measures despite these challenges.

A 🧵

Client Info