An algorithm pretending to be an AI

If there are already AI's pretending to be people, and people pretending to be AI's... we may as well have an algorithm that pretends to be an AI. 

Because why not?! 

This algorithm will generate 50,000 words using:

- whatever text file input data is used - in this case 4 Project Gutenberg books, although you can change this to any other text file
- keywords that the user inputs manually  

So you can think of it as a tiny language model, but in this case it's purely algorithmic. 

A fellow participant is also working on a very small language model, see https://github.com/NaNoGenMo/2024/issues/4 but both projects are distinct and implemented differently.

## How the algorithm works
My algorithm will be a meld of Markov Chaining and the Cut-Up method, both kind of mashed together in one algorithm. Cut-Up is when a text is segmented and re-arranged to form new text. 

Briefly, what happens is that I start by segmenting the text at its most commonly used words. These are defined as the most commonly used words in the text sources used (not necessarily in the English language in general). 

Segmentation is done such that each resulting fragment begins and ends with one of the most commonly used words.

So in a brief example, the text 

`a cat was sitting on a mat and a hamster was running on a wheel and playing`

Now, there's not a ton of words here so let's just assume our most commonly used words are `a, on, and, was`

The fragments would become:
`a cat was`
`was sitting on`
`on a`
`a mat and`
`and a`
`a hamster was`
`was running on`
`on a`
`a wheel and`
`and playing` is omitted because it doesn't end with a commonly used word

Those fragments then form the basis of a Markov Chain lookup table. Therefore, suppose I randomly start with the fragment `a hamster was`. The next fragment must start with `was`, so the algorithm narrows this down to `was sitting on` and `was running on` - it will pick one of these options randomly, and so on. However, if one of these options contained one of the user keywords, it would pick one of the ones containing the user keyword.

This is a bit different to regular Markov chaining in that the lengths of the fragments in my table will vary. For example, if the original text contains a phrase with several uncommon words in a row, they will wind up all together in 1 fragment, because the text is being cut at the most common words. By contrast, Markov chaining is usually implemented with constant lengths of fragments, typically 2 or 3 words. 

I'm hoping that this method might generate slightly more coherent text than standard Markov chaining, but it's not clear whether this would really be the case in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An algorithm pretending to be an AI #19

How the algorithm works

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

An algorithm pretending to be an AI #19

Description

How the algorithm works

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions