Forza Horizon 5 is a 2021 racing video game set in an open world environment based in a fictional representation of Mexico. We will be attempting to use the vast dataset and learn about car pricing in this game.
2 separate sections of this experiment - 2 different ML concepts are implemented. (1) Gaussian Naive Bayes model is implemented to predict prices, using sci-kit, to demonstrate Learning from Examples. However, it results in an extremely inaccurate result. (2) We have implemented General-to-specific ordering over hypotheses.
- Read
Forza_Horizon_Cars.csvdataset. preprocess(data)using thepreprocess()function defined inconditioning.py. a. 'In_Game_Price' is the target variable for prediction, so we convert string price to numerical values (remove commas) b. Handle '??' and 'info_not_found' values in all columns by replacing them with 0 c. One-hot encode 'stock_specs' column d. One-hot encode 'Drive_Type' column e. Process a few other columns - converting them to floating point values f. Eliminate NaN values by replacing them with 0- Define the required features in the
featuresvariable. - Split the dataset into training and testing data using
train_test_split() - Predict the prices using GaussianNB, output the data onto a CSV file and also capture the performance measures of the model onto a TXT file.
- Generate n hypotheses (defaulted to 100 hypotheses). a. Specificity levels are randomly assigned and shuffled to ensure randomness in selecting constraints' specificity. b. For each specificity level in specificity_levels: i. A random row is selected from the provided dataset (dataframe). ii. An empty hypothesis list is initialized to store column constraints. c. For each column in the randomly selected row, a decision is made based on the specificity level: i. If a randomly generated number falls within the specificity threshold (as defined by the level), the column and its specific value from the row are included in the hypothesis. ii. If the random number is outside the specificity threshold, a placeholder (?) is used to indicate an unspecified value for that column in the hypothesis. We ensure that 33% of hypotheses are specific, 33% are general and the rest lie in-between.
- Map all the instances of records in the dataset to the generated hypotheses.
- Store this output to an excel document, and then visualize this data in a graph.
- Graph 1 would contain a scatter plot of all Instances X
- Graph 2 would contain a scatter plot of all Hypotheses H
- Hypotheses are be sorted from bottom to top as General to specific
- 3 Example Hypotheses and 3 sample corresponding instances are printed and highlighted/plotted and labelled on the graph.
- Used to represent categorical data as binary vectors.
- We use it to convert categorical variables into a format that can be provided to machine learning algorithms to better interpret categorical data.
- Example: 'stock_specs' in our dataset assumes values of
['A', 'B', 'C', 'D', 'S1', 'S2', NaN, 'info_not_found']. These are considered as separate boolean features.
- Utilizes Bayes' theorem to predict the probability of a class given certain features.
- Assumes that all features are independent of each other, which is why it's termed "naive" (even if it's not always true in real-world scenarios).
- Bayes' theorem calculates the probability of a hypothesis (class) given the evidence (features) using conditional probabilities.
Formula: P(Class|Features) = (P(Features|Class) * P(Class)) / P(Features) - Gaussian Naive Bayes: Assumes that continuous features follow a Gaussian distribution.
- The assumption of feature independence might not hold true in real-world scenarios.
- Here, h is conjunction of constraints on attributes
- Each constraint can be
- a specific value (e.g., W ater = W arm)
- don't care (e.g., \W ater =?")
- no value allowed (e.g.,\Water=;") [Not used in our program]
- Given
- Instances X
- Target function c: In_Game_Price ->
int - Hypotheses H: Conjunctions of literals.
- Training examples D
- Determine: A hypothesis h in H such that h(x) = c(x) for all x in D.
We observe a failure in implementation of the Gaussian Naive Bayes model for this dataset, with the following scores returned: Accuracy: 0.0 Precision: 0.0 Recall: 0.0 F1 Score: 0.0
This proves that the GNB model is not a good model to suit the data we are handling, and we must implement a different model.
- When encountering unseen combinations of features, it can lead to zero probabilities, impacting predictions.
- Sensitivity to irrelevant features can affect performance.
- Gaussian Naive Bayes (GNB) assumes that features are independent and have a Gaussian distribution. This assumption might not hold true for complex real-world data, especially when dealing with predicting car prices in a video game like Forza Horizon 5.
- Car prices in our dataset could have complex and non-linear relationships between different attributes/features. Naive Bayes assumes linear relationships between features, which might not capture these complexities accurately.
| Hypothesis Number | stock_A | stock_B | stock_C | stock_D | stock_S1 | stock_S2 | drive_AWD | drive_FWD | drive_RWD | speed | handling | acceleration | launch | braking | Offroad | Top_Speed | 0-60_Mph | g-force | Horse_Power | Weight_lbs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ? | ? | False | True | False | False | ? | False | True | 3.1 | 4.3 | 2.9 | 5.3 | 2.6 | 5.9 | 0.0 | 0.0 | 0.0 | 60.0 | 1235.0 |
| 6 | False | True | False | False | False | False | False | False | True | 5.7 | 3.8 | 4.1 | 3.1 | 3.0 | 6.4 | 157.8 | 6.2 | 0.82 | 410.0 | 3244.0 |
| 45 | False | False | True | False | False | False | False | False | True | 5.4 | 5.1 | 4.1 | 2.7 | 3.0 | 5.1 | 0.0 | 0.0 | 0.0 | 206.0 | 3329.0 |
| 49 | ? | False | False | False | False | False | ? | ? | ? | 7.3 | 6.5 | ? | 6.1 | 6.1 | 4.7 | 0.0 | ? | 0.0 | 608.0 | 4134.0 |
| 59 | True | False | False | False | False | False | True | False | False | 7.6 | 6.0 | 8.7 | 9.7 | 4.8 | 5.1 | 0.0 | 0.0 | 0.0 | 603.0 | 4515.0 |
| 79 | False | False | False | True | False | False | False | False | True | 4.8 | 4.1 | 3.8 | 3.3 | 2.5 | 5.2 | 130.5 | 7.3 | 0.81 | 148.0 | 2315.0 |
| 82 | False | False | True | False | False | False | False | False | True | 4.7 | 3.6 | 3.3 | 3.2 | 2.4 | 5.4 | 0.0 | 0.0 | 0.0 | 450.0 | 3799.0 |
| 90 | False | False | False | False | True | False | False | False | True | 7.9 | 7.2 | 6.3 | 6.8 | 6.9 | 4.1 | 219.9 | 3.0 | 1.02 | 562.0 | 3274.0 |
| 99 | ? | ? | ? | False | ? | ? | True | False | ? | 6.3 | 3.8 | 5.8 | 5.6 | 3.7 | 6.7 | ? | 0.0 | 0.0 | 475.0 | ? |
| stock_A | stock_B | stock_C | stock_D | stock_S1 | stock_S2 | drive_AWD | drive_FWD | drive_RWD | speed | handling | acceleration | launch | braking | Offroad | Top_Speed | 0-60_Mph | g-force | Horse_Power | Weight_lbs | Hypothesis Number |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 4.7 | 3.6 | 3.3 | 3.2 | 2.4 | 5.4 | 0 | 0 | 0 | 450 | 3799 | 82 |
| FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | 7.9 | 7.2 | 6.3 | 6.8 | 6.9 | 4.1 | 219.9 | 3 | 1.02 | 562 | 3274 | 90 |
| FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | 4.8 | 4.1 | 3.8 | 3.3 | 2.5 | 5.2 | 130.5 | 7.3 | 0.81 | 148 | 2315 | 79 |
| FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 5.7 | 3.8 | 4.1 | 3.1 | 3.0 | 6.4 | 157.8 | 6.2 | 0.82 | 410 | 3244 | 6 |
| TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | 7.6 | 6.0 | 8.7 | 9.7 | 4.8 | 5.1 | 0 | 0 | 0 | 603 | 4515 | 59 |
Instance ID: I6129,
Instance Data: [False, False, False, True, False, False, False, False, True, '3.1', '4.3', '2.9', '5.3', '2.6', '5.9', 0.0, 0.0, 0.0, 60.0, 1235.0],
Hypothesis ID: H1
Instance ID: I7845,
Instance Data: [False, False, False, True, False, False, False, False, True, '3.1', '4.3', '2.9', '5.3', '2.6', '5.9', 0.0, 0.0, 0.0, 60.0, 1235.0],
Hypothesis ID: H49
Instance ID: I8257,
Instance Data: [False, False, False, True, False, False, False, False, True, '3.1', '4.3', '2.9', '5.3', '2.6', '5.9', 0.0, 0.0, 0.0, 60.0, 1235.0],
Hypothesis ID: H99
