-
Notifications
You must be signed in to change notification settings - Fork 29
/
Copy path_nb.qmd
306 lines (206 loc) · 11.7 KB
/
_nb.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
## Naive Bayes (by Kian Parchekani)
### Introduction
Naive Bayes classifiers are based on Bayesian classification methods that are derived from Bayes's theorem. This theorem is an equation that describes the relationship between the conditional probabilities of statistical quantities. In Bayesian classification, the objective is to find the probability of a label given a set of observed features. Bayes's theorem enables us to express this in terms of more computationally feasible quantities.
$$P(y|x_1, x_2, ..., x_n) = \frac{P(y) \times P(x_1|y) \times P(x_2|y) \times ... \times P(x_n|y)}{P(x_1, x_2, ..., x_n)}$$
The "naive" part of Naive Bayes comes from the assumption of independence between features, which is a simplifying assumption that allows for efficient computation and makes the algorithm particularly well-suited for high-dimensional data. However, this assumption may not hold true in all cases, and there are variants of Naive Bayes that relax this assumption. These classifiers are most commonly used in fields such as natural language processing, image recognition, and spam filtering.
##### Installation
Naive Bayes classifiers are found in `scikit-learn`
```
pip install scikit-learn
```
##### Types of Naive Bayes classifiers
There are three main types of Naive Bayes classifiers:
1. Gaussian Naive Bayes - used for continuous input variables that are normally distributed
2. Multinomial Naive Bayes - used for discrete input variables such as text
3. Bernoulli Naive Bayes - used for binary input variables such as spam detection
### Gaussian Naive Bayes
Gaussian Naive Bayes is possibly the simplest of the three, with the assumption being that data from each label is drawn from a simple Gaussian distribution.
In doing this, all the model needs for prediction is the mean and standard deviation of each target's distribution.
Before getting into anything, we need to load all the necessary packages, including those outside of `sklearn `.
`load_iris` is a dataset available from `sklearn` to use for classification testing.
```{python}
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix
```
First, we load the iris dataset, transform it into a pandas dataframe, adjust labels and preview the data
```{python}
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df["target"] = iris.target
df["species"] = df["target"].replace(dict(enumerate(iris.target_names)))
print(df.head())
```
We plot the data using a pairplot to visualize the relationships between the features
```{python}
sns.pairplot(df, hue="species")
plt.show()
```
As we can see, certain variables graphed against one another show distinct clusters. `GaussianNB()` accounts for these, and utilizes these distinctions in classification.
Split the data into training and testing sets before calling `GaussianNB()` and fitting the data.
```{python}
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
```
We then create `y_pred`, evaluate the accuracy of the model, and view the confusion matrix.
```{python}
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
```
This data is meant for displaying classification methods, so it is no surprise that our model is perfect. In general, however, dealing with data that has plenty of targets/labels can be perfect for Gaussian Naive Bayes, as in these instances, the assumptions of independence become less detrimental.
### Multinomial Naive Bayes Classifiers
Along with Gaussian Naive Bayes, Multinomial Naive Bayes operates off of the assumption that the features are generated from a simple multinomial distribution.
Multinomial Naive Bayes is well-suited for features that represent counts or count rates since the multinomial distribution models the probability of observing counts across multiple categories.
##### Text Classification
One of the best uses for this is in text classification, as it deals with counts/frequencies of words.
To start this example, we load a dataset from `sklearn`, along with `MultinomialNB`.
`CountVectorizer` is also loaded, as we are no longer dealing with numerical data.
```{python}
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
```
This example will use the `fetch_20newsgroups` dataset, which is a collection of newsgroup posts on various topics. Now we split it into a training set and a test set.
We can also preview the data, and see the target names, data, and targets
```{python}
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
print(newsgroups_train.target_names)
print(newsgroups_train.data[0])
print(newsgroups_train.target[0])
```
Here we get an idea of our targets, as well as one data example and the target it belongs to.
We then preprocess the data, converting categorical data to numerical.
```{python}
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
```
We then train the classifier and make our precitions using `.predict()`
```{python}
clf = MultinomialNB()
clf.fit(X_train, newsgroups_train.target)
y_pred1 = clf.predict(X_test)
```
Finally, we evaluate the performance.
```{python}
accuracy1 = accuracy_score(newsgroups_test.target, y_pred1)
print('Accuracy:', accuracy1)
```
We can plot a heatmap of the confusion matrix with `seaborn` to better visualize the accuracy.
```{python}
conf_mat = confusion_matrix(newsgroups_test.target, y_pred1)
sns.heatmap(conf_mat, annot=True, cmap='Blues')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
```
The true labels are shown on the y-axis and the predicted labels are shown on the x-axis. The cells of the heatmap show the number of test instances that belong to a particular class (true label) and were classified as another class (predicted label).
### Complement Naive Bayes
Often in cases of imbalanced datasets where certain targets have more data, `ComplementNB` is used to provide balance to the targets with less data. First we should view our targets.
```{python}
unique, counts = np.unique(newsgroups_test.target, return_counts=True)
for target_name, count in zip(newsgroups_test.target_names, counts):
print(f"{target_name}: {count}")
```
Our data is fairly balanced here, so `ComplementNB` may not be necessary.
However, using the `nyc311` data from the midterm, we can see that the frequency of values by `Agency` is very imbalanced.
```{python}
from sklearn.naive_bayes import ComplementNB
nyc = pd.read_csv('data/nyc311_011523-012123_by022023.csv')
nyc = nyc[['Descriptor', 'Agency']]
sns.countplot(x='Agency', data=nyc)
plt.show()
```
As we can see, the NYPD skews the data. We can try using this Naive Bayes classifier to predict what agency a call was made to based on the description.
We once again convert the data to vectors, fit our model and evaluate the accuracy/ confusion matrix
```{python}
nyc = nyc.dropna()
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(nyc['Descriptor'])
y = nyc['Agency']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = ComplementNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Reds", robust = True)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
```
Our model is very accurate despite containing very different frequencies. Let's compare to an ordinary Multinomial model.
```{python}
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Reds", robust = True)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
```
It appears that the Multinomial model actually has a higher accuracy; however, if we look a bit closer, we can see that the `ComplementNB` model indeed does a better job of classifying targets with less data in some cases. For instance, labels 1 and 5 are more accurate. However, it is not true for all labels, and often is slightly less accurate in classification. The use of `ComplementNB` is a matter of personal judgement / trial and error.
### Bernoulli Naive Bayes
Unlike the Multinomial Naive Bayes algorithm, which works with count-based features, Bernoulli Naive Bayes operates on binary features. This makes it particularly well-suited for modeling the presence or absence of certain words or phrases in a document, and makes is especially useful with spam detection. Each feature is assumed to be conditionally independent of all the other features given the class label.
$P(x_i|y) = P{(x_i=1}|{y})x_i + (1 - P({x_1 = 1}|y))({1-x_i})$
In this example, we look at the NYC Crash data, and try to classify injuries based on the the number one contributing factor.
```{python}
from sklearn.naive_bayes import BernoulliNB
crash = pd.read_csv('data/nyc_crashes_202301_cleaned.csv')
crash["INJURY"] = crash["NUMBER OF PERSONS INJURED"].apply(lambda x: 1 if x >= 1 else 0)
crash = crash.dropna(subset=['INJURY', 'CONTRIBUTING FACTOR VEHICLE 1'])
```
Once again, we split the data into training and testing sets and create a `CountVectorizer` to convert the text data into a binary matrix.
```{python}
X_train, X_test, y_train, y_test = train_test_split(
crash["CONTRIBUTING FACTOR VEHICLE 1"], crash["INJURY"], test_size=0.2, random_state=42)
vectorizer = CountVectorizer(binary=True)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
```
Then, a `BernoulliNB` object is created and fit to the training data. We then make predictions on the testing data.
```{python}
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred4 = bnb.predict(X_test)
```
Calculate the accuracy of the model.
```{python}
accuracy4 = accuracy_score(y_test, y_pred4)
print(f"Accuracy: {accuracy4:.2f}")
cm = confusion_matrix(y_test, y_pred4)
print("Confusion Matrix:")
print(cm)
```
Obviously, our model has some shortcomings. Namely, the accuracy is not great, and it has a high quantity of false negatives. That can partially be attributed to the variable we used, as there is not a lot of distinctions that can be made within the text for factors of each crash. With that being said, for a baseline, it is not bad, and with some tuning and more diverse, larger data to work with, it could classify injuries much better.
### Conclusion
In conclusion, using Naive Bayes as a means of classification does have its benefits and shortcomings.
Pros:
- Efficiency
- Interpretability
- Few parameters
Cons:
- Simplicity
- Data does not always match the assumptions
- Few parameters
Overall, Naive Bayes classifiers are best used in finding a solid baseline to build off of.
### References
“1.9. Naive Bayes.” Scikit, https://scikit-learn.org/stable/modules/naive_bayes.html.
Learning, UCI Machine. “SMS Spam Collection Dataset.” Kaggle, 2 Dec. 2016, https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset?resource=download.
Vanderplas, Jake. Python Data Science Handbook: Essential Tools for Working with Data. O'REILLY MEDIA, 2023.