Lmst

Matthew Honnibal boosted:

Just published part 3 of my blog post series on making beautiful slides for your talks 🎨✨

This one is about presenting technical content and making dry and abstract topics more interesting. Featuring many examples, including talks by Vitaly Meursault and @sofie!

https://ines.io/blog/beautiful-slides-talks-part-3-technical-content/

spaCy and Prodigy started as indie projects, but in 2021 we decided to raise capital and have a larger team. We couldn’t make that configuration work, so we’re back to how we were before. I’ll be spending most of my time hands-on with spaCy again, and we have a lot of updates and improvements planned for Prodigy.

I hate how vaguely these things are usually discussed, so I also wrote a long post about it all: https://honnibal.dev/blog/back-to-our-roots

Matthew Honnibal boosted:

Company update: We're going back to our roots!

We're back to running Explosion as a smaller, independent-minded and self-sufficient company. spaCy and Prodigy will stay stable and sustainable and we'll keep updating our stack with the latest technologies, without changing its core identity or purpose 💙

https://explosion.ai/blog/back-to-our-roots-company-update

Matthew Honnibal boosted:

We are really excited to share that we have just released the alpha version of Prodigy v1.12! This includes LLM-assisted workflows for data annotation and prompt engineering as well as extended, fully customizable support for multi-annotator workflows.

https://support.prodi.gy/t/prodigy-1-12-alpha-release-llm-assisted-workflows-prompt-engineering-fully-custom-task-routing-for-multi-annotator-scenarios/6552

Matthew Honnibal boosted:

We present a brand new workflow for prompt engineering that allows you to compare the quality of several prompts in a tournament. The algorithm uses the Glico ranking system [https://en.wikipedia.org/wiki/Glicko_rating_system] to select the best prompt.

https://future--prodi-gy.netlify.app/docs/large-language-models#tournaments

Matthew Honnibal boosted:

Here are the slides for my #PyDataLondon keynote on LLMs from prototype to production ✨

Including:
◾ visions for NLP in the age of LLMS
◾ a case for LLM pragmatism
◾ solutions for structured data
◾ spaCy LLM + https://prodi.gy

https://speakerdeck.com/inesmontani/large-language-models-from-prototype-to-production

What will production NLP look like, once the dust settles around LLMs? One view is basically “prompts are all you need”. I disagree. I wrote a bit about this when we released #spaCy LLM last week, but the topic deserves its own post, so here it is.

https://explosion.ai/blog/against-llm-maximalism

Matthew Honnibal boosted:

Hi #MastoCats! Let me introduce Rizhik and Alaska, our guest cats from Ukraine.

Don sphynx kitten wearing a red sweater with a fluffy white collar

@kjr Good! We have several users using it with r2l and bidirectional text happily. Here's the config setting: https://prodi.gy/docs/install#config

If you don't have Prodigy, you can get a copy here: https://prodi.gy/buy

We sell Prodigy in a very old-school way, with a once-off fee for software you run yourself. There's no free download, but we're happy to issue refunds, and we can host trials for companies.

We didn't have to make any changes to Prodigy itself for this workflow — everything happens in the "recipe" script. You can build other things at least this complex for yourself, or you can start from one of our scripts and modify it according to your requirements.

The key to iteration speed is letting a small group of people — ideally just you! — annotate faster. That's where the scriptability comes in. Every problem is different, and we can't guess exactly what tool assistance or interface will be best. So we let you control that.

Modern neural networks are very sample efficient, because they use transfer learning to acquire most of their knowledge. You just need enough examples to define your problem. If annotation is mostly about problem definition, iteration is much more important than scaling.

I especially like this zero-shot learning workflow because it's a great example of what we've always set out to achieve with Prodigy. Two distinct features of Prodigy are its scriptability and the ease with which you can scale down to a single-person workflow.

This workflow looks pretty promising from initial testing. The model provides useful suggestions for categories like "ingredient", "dish" and "equipment" just from the labels, with no examples. And the precision isn't bad — I was impressed that it avoided marking "Goose" here.

So, let's compromise. We'll pipe our data through the OpenAI API, prompting it to suggest entities for us. But instead of just shipping whatever it suggested, we're going to go through and correct its annotations. Then we'll save those out and train a much smaller supervised model.

Machine learning is basically programming by example: instead of specifying a system's behaviour with code, you (imperfectly) specify the desired behaviour with training data.

Well, zero-shot learning is like that, but without the training data. That does have some advantages — you don't have to tell it much about what you want it to do. But it's also pretty limiting. You can't tell it much about what you want it to do.

So how can models like GPT3 help? One answer is zero- or few-shot learning: you prompt the model with something like "Annotate this text for these entities", and you append your text to the prompt. This works surprisingly well! It was an in the original paper.

However, zero-shot classifiers really aren't good enough for most applications. The prompt just doesn't give you enough control over the model's behaviour.

We've been working on new https://prodi.gy workflows that let you use the OpenAI API to kickstart your annotations, via zero- or few-shot learning. We've just published the first recipe, for NER annotation 🎉 https://github.com/explosion/prodigy-openai-recipes

Here's what, why and how. 🧵

Let's say you want to do some 'traditional' NLP thing, like extracting information from text. The information you want to extract isn't on the public web — it's in this pile of documents you have sitting in front of you.

Client Info