If there are already AI's pretending to be people, and people pretending to be AI's... we may as well have an algorithm that pretends to be an AI.
Because why not?!
This algorithm will generate 50,000 words using:
- whatever text file input data is used - in this case 4 Project Gutenberg books, although you can change this to any other text file
- keywords that the user inputs manually
So you can think of it as a tiny language model, but in this case it's purely algorithmic.
A fellow participant is also working on a very small language model, see #4 but both projects are distinct and implemented differently.
How the algorithm works
My algorithm will be a meld of Markov Chaining and the Cut-Up method, both kind of mashed together in one algorithm. Cut-Up is when a text is segmented and re-arranged to form new text.
Briefly, what happens is that I start by segmenting the text at its most commonly used words. These are defined as the most commonly used words in the text sources used (not necessarily in the English language in general).
Segmentation is done such that each resulting fragment begins and ends with one of the most commonly used words.
So in a brief example, the text
a cat was sitting on a mat and a hamster was running on a wheel and playing
Now, there's not a ton of words here so let's just assume our most commonly used words are a, on, and, was
The fragments would become:
a cat was
was sitting on
on a
a mat and
and a
a hamster was
was running on
on a
a wheel and
and playing is omitted because it doesn't end with a commonly used word
Those fragments then form the basis of a Markov Chain lookup table. Therefore, suppose I randomly start with the fragment a hamster was. The next fragment must start with was, so the algorithm narrows this down to was sitting on and was running on - it will pick one of these options randomly, and so on. However, if one of these options contained one of the user keywords, it would pick one of the ones containing the user keyword.
This is a bit different to regular Markov chaining in that the lengths of the fragments in my table will vary. For example, if the original text contains a phrase with several uncommon words in a row, they will wind up all together in 1 fragment, because the text is being cut at the most common words. By contrast, Markov chaining is usually implemented with constant lengths of fragments, typically 2 or 3 words.
I'm hoping that this method might generate slightly more coherent text than standard Markov chaining, but it's not clear whether this would really be the case in practice.
If there are already AI's pretending to be people, and people pretending to be AI's... we may as well have an algorithm that pretends to be an AI.
Because why not?!
This algorithm will generate 50,000 words using:
So you can think of it as a tiny language model, but in this case it's purely algorithmic.
A fellow participant is also working on a very small language model, see #4 but both projects are distinct and implemented differently.
How the algorithm works
My algorithm will be a meld of Markov Chaining and the Cut-Up method, both kind of mashed together in one algorithm. Cut-Up is when a text is segmented and re-arranged to form new text.
Briefly, what happens is that I start by segmenting the text at its most commonly used words. These are defined as the most commonly used words in the text sources used (not necessarily in the English language in general).
Segmentation is done such that each resulting fragment begins and ends with one of the most commonly used words.
So in a brief example, the text
a cat was sitting on a mat and a hamster was running on a wheel and playingNow, there's not a ton of words here so let's just assume our most commonly used words are
a, on, and, wasThe fragments would become:
a cat waswas sitting onon aa mat andand aa hamster waswas running onon aa wheel andand playingis omitted because it doesn't end with a commonly used wordThose fragments then form the basis of a Markov Chain lookup table. Therefore, suppose I randomly start with the fragment
a hamster was. The next fragment must start withwas, so the algorithm narrows this down towas sitting onandwas running on- it will pick one of these options randomly, and so on. However, if one of these options contained one of the user keywords, it would pick one of the ones containing the user keyword.This is a bit different to regular Markov chaining in that the lengths of the fragments in my table will vary. For example, if the original text contains a phrase with several uncommon words in a row, they will wind up all together in 1 fragment, because the text is being cut at the most common words. By contrast, Markov chaining is usually implemented with constant lengths of fragments, typically 2 or 3 words.
I'm hoping that this method might generate slightly more coherent text than standard Markov chaining, but it's not clear whether this would really be the case in practice.