Skip to content

Evaluation

chickenwithitsheadcutoff edited this page Nov 10, 2024 · 1 revision

Evaluation and Results

Overview

To assess the performance of our models, we used both automated and human evaluation metrics.

Automated Evaluation: BLEU Score

The BLEU score was used to compare the generated texts against the original dataset. Below are the results:

Model BLEU Score (%)
n-grams 9.89
LLaMa2 8.55
Falcon 7.65
GPT-Neo 9.62

Key Insights

  • n-grams performed surprisingly well in BLEU due to its focus on exact n-gram overlaps.
  • Fine-tuned models (LLaMa2, Falcon, GPT-Neo) achieved lower BLEU scores but generated more coherent and realistic text.

Human Evaluation: Guessing Game

We conducted a human evaluation inspired by the Turing Test. Evaluators were asked to identify whether a given text was generated by a model or was part of the original dataset.

Model Human Accuracy (%)
n-grams 100
LLaMa2 100
Falcon 100
GPT-Neo 100

Guessing Game Illustration

Below is an example screenshot of the Guessing Game interface:

guessing_game


Discussion

Limitations

  • BLEU Score: May not fully capture the quality of human-like text.
  • Human Evaluation: Subjective and potentially biased, as evaluators were aware of the model weaknesses.

Conclusion

While automated metrics like BLEU offer quantitative insights, human evaluation remains the gold standard for assessing natural language generation quality.


References

For more details on the evaluation setup, visit the Model Training page.

Clone this wiki locally