-
Notifications
You must be signed in to change notification settings - Fork 0
Evaluation
chickenwithitsheadcutoff edited this page Nov 10, 2024
·
1 revision
To assess the performance of our models, we used both automated and human evaluation metrics.
The BLEU score was used to compare the generated texts against the original dataset. Below are the results:
Model | BLEU Score (%) |
---|---|
n-grams | 9.89 |
LLaMa2 | 8.55 |
Falcon | 7.65 |
GPT-Neo | 9.62 |
- n-grams performed surprisingly well in BLEU due to its focus on exact n-gram overlaps.
- Fine-tuned models (LLaMa2, Falcon, GPT-Neo) achieved lower BLEU scores but generated more coherent and realistic text.
We conducted a human evaluation inspired by the Turing Test. Evaluators were asked to identify whether a given text was generated by a model or was part of the original dataset.
Model | Human Accuracy (%) |
---|---|
n-grams | 100 |
LLaMa2 | 100 |
Falcon | 100 |
GPT-Neo | 100 |
Below is an example screenshot of the Guessing Game interface:
- BLEU Score: May not fully capture the quality of human-like text.
- Human Evaluation: Subjective and potentially biased, as evaluators were aware of the model weaknesses.
While automated metrics like BLEU offer quantitative insights, human evaluation remains the gold standard for assessing natural language generation quality.
For more details on the evaluation setup, visit the Model Training page.