Redact + Reduce

About

This is a small app we built to help us experiment with inputs and outputs for Large Language models (LLMs). It allows you to preprocess content - either by redacting or reducing it - before copying it into an LLM.

Problems

Problem 1

It's very easy to share content with LLMs. This is great unless that content has sensitive data in it. As we talked about in protect your data on The Art of the AI Prompt it's important to redact sensitive content. We wanted a quick redact tool that would allow us to preprocess our content before copying it into an LLM.

Problem 2

LLMs have both a token limit and - assuming you're using an API - a cost for any token sent to them.

This annoying if you want to quickly summarise an article or PDF that is over the token limit.

Solution

We haven't yet seen a good solution for problem 1.

For problem 2, the most popular solution to pop up over the past months is to use a vector database as an intermediary. This is what we've explored in Conversations with your Content.

This makes a lot of sense for complex, or synthesised, data but it's also time consuming and expensive.

We were interested to see how much we could do on the cheap.

Stemming

In the past computers had very little physical, or accessible, memory. There was a time where every character written to the machine had a cost. It's why systems like flight control systems have impenetrable code since it wad written in the most concise form possible.

Stemming - where you cut out vowels, or cut off the end of vowels - was a solution used for early compression. This felt like it would be a good solution for preprocessing content for LLMs.

tl;dr
It wasn't!

It turns out by stemming words we increase the number of tokens that are used, which was the opposite of what we wanted. Which is a shame because LLMs are great at understanding stemmed words.

We've kept the ability to stem words in this app but caveat emptor that this will increase your costs if you use an API.

Reduction

The other strategies were all more destructive. They involved deleting stop words, sentences, paragraphs, the beginning, end or middle of content. Our thought was that humans are quite repetitive when we write. Our early testing indicates this is true. LLMs are very good at taking this reduced content and understanding the meaning of it.

The great thing here is that removing words leads to a reduction in tokens.

In particular removing stop words, adjectives and adverbs seems to have the most utility. The effect on LLM comprehension appears to be minimal but it reduces the content by between 25 - 35%, which potentially has some very positive future benefits.

Redact + Reduce

About

Problems

Problem 1

Problem 2

Solution

Stemming

Reduction

Select your data

File

URL

What do you want to know?

Model being used?

Redact

Reduce tokens

Chars