Evaluation

Evaluation and Results

Overview

To assess the performance of our models, we used both automated and human evaluation metrics.

Automated Evaluation: BLEU Score

The BLEU score was used to compare the generated texts against the original dataset. Below are the results:

Model	BLEU Score (%)
n-grams	9.89
LLaMa2	8.55
Falcon	7.65
GPT-Neo	9.62

Key Insights

n-grams performed surprisingly well in BLEU due to its focus on exact n-gram overlaps.
Fine-tuned models (LLaMa2, Falcon, GPT-Neo) achieved lower BLEU scores but generated more coherent and realistic text.

Human Evaluation: Guessing Game

We conducted a human evaluation inspired by the Turing Test. Evaluators were asked to identify whether a given text was generated by a model or was part of the original dataset.

Model	Human Accuracy (%)
n-grams	100
LLaMa2	100
Falcon	100
GPT-Neo	100

Guessing Game Illustration

Below is an example screenshot of the Guessing Game interface:

guessing_game

Discussion

Limitations

BLEU Score: May not fully capture the quality of human-like text.
Human Evaluation: Subjective and potentially biased, as evaluators were aware of the model weaknesses.

Conclusion

While automated metrics like BLEU offer quantitative insights, human evaluation remains the gold standard for assessing natural language generation quality.

References

For more details on the evaluation setup, visit the Model Training page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation

Evaluation and Results

Overview

Automated Evaluation: BLEU Score

Key Insights

Human Evaluation: Guessing Game

Guessing Game Illustration

Discussion

Limitations

Conclusion

References

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally