sklearn-like api for a general implementation of Naive Bayes
0. Intro:
Although the computations that go into it are very tedious Naive Bayes is one of the more accessible ML classification algorithms out there. The reason for that is it is easy to understand, and it makes sense, intuitively speaking.
However, the sklearn
implementations of Naive Bayes are built (excluding GaussianNB
) with certain assumptions, which make them tough to use without a lot of pre-processing, as we explored in this post. This time, I’d like to implement the Naive Bayes algorithm idea for a general input dataset. I will not optimize the code, so it won’t be naturally scalable or anything, the goal is just to hack something together quickly.
1. Refresher on Naive Bayes
Let’s start with revisiting the algorithm. In the following dataset, there are 4 columns to predict whether someone will play tennis on that day. Let’s take a quick look at this small dataset:
from __future__ import division
import pandas as pd
# A really simple dataset to demonstrate the algorithm
data = pd.read_csv(
'https://raw.githubusercontent.com/petehunt/c4.5-compiler/master/example/tennis.csv',
usecols=['outlook', 'temp', 'humidity', 'wind', 'play']
)
data
outlook | temp | humidity | wind | play | |
---|---|---|---|---|---|
0 | Sunny | Hot | High | Weak | No |
1 | Sunny | Hot | High | Strong | No |
2 | Overcast | Hot | High | Weak | Yes |
3 | Rain | Mild | High | Weak | Yes |
4 | Rain | Cool | Normal | Weak | Yes |
5 | Rain | Cool | Normal | Strong | No |
6 | Overcast | Cool | Normal | Strong | Yes |
7 | Sunny | Mild | High | Weak | No |
8 | Sunny | Cool | Normal | Weak | Yes |
9 | Rain | Mild | Normal | Weak | Yes |
10 | Sunny | Mild | Normal | Strong | Yes |
11 | Overcast | Mild | High | Strong | Yes |
12 | Overcast | Hot | Normal | Weak | Yes |
13 | Rain | Mild | High | Strong | No |
Given this dataset, we can use Bayesian thinking to predict whether or not someone will play tennis on a new day, given the weather. Specifically, Naive Bayes proceeds as follows:
Let’s say, on day 14, the weather is
outlook | temp | humidity | wind |
---|---|---|---|
Overcast | Mild | Normal | Weak |
We want to compute the following probabilities:
Instead of computing these exactly, we compute proxies of these. Since this part is supposed to be a refresher only, I’ll only include the computation of the first case. It goes like this:
And the terms on the right side of the equation can be derived from looking at the dataset and counting. For example,
is just the count of total days with play = yes and temperature = mild, divided by the days with play = yes. Let’s compute;
mild = data.temp == 'Mild'
yes = data.play == 'Yes'
print('P(mild | yes) = {}/{}'.format((mild&yes).sum(), yes.sum()))
P(mild | yes) = 4/9
The final term in the formula above is $P(y)$. That’s even simpler to compute, since it is just the count of lines with yes divided by the total number of lines.
Let’s put the pieces together by computing proxies for both probabilities with code. We get;
yes = data.play == 'Yes'
o = data.outlook == 'Overcast'
m = data.temp == 'Mild'
n = data.humidity == 'Normal'
w = data.wind == 'Weak'
ans = {
'yes': yes.sum(),
'no': (~yes).sum()
}
for elt in [o, m, n, w]:
ans['yes'] *= (yes & elt).sum() / yes.sum()
ans['no'] *= ((~yes) & elt).sum() / (~yes).sum()
print('P(yes | o, m, n, w) approximately is {}'.format(ans['yes']))
print('P(no | o, m, n, w) approximately is {}'.format(ans['no']))
P(yes | o, m, n, w) approximately is 0.79012345679
P(no | o, m, n, w) approximately is 0.0
Before we end this section, notice that the algorithm thinks there’s 0 chance that the person does not play tennis under these circumstances. Not only that’s a strong opinion, but also it results in numerical difficulties in a real world implementation. To remedy this, one uses Laplace smoothing.
2. Implementation
2.a. Preprocessing
You may have noticed that in our toy dataset, every attribute was categorical. There were no continuous numerical, nor ordinal fields. Also, unlike sklearn
’s NB algorithms, we did not make any assumptions about the incoming training dataset. Columns were (thought to be) pairwise independent.
As a design choice, we will present a version of Naive Bayes that works with such a training set. Every column will be categorical, and we will implement the computations we carried out above. In practice though, not every training data will consist ofcategorical columns only. To that end, let’s first code a Discretizer
object. The Discretizer
will do the following:
- Takes in a continuous
pandas
series and bins the values. - Remembers the cutoffs so that it can apply the same transformations later to find which bin the new value belongs to.
Feel free to skip over the code for the usage.
from operator import itemgetter
from bisect import bisect_right
import numpy as np
class Discretizer(object):
def __init__(self, upperlim=20, bottomlim=0, mapping=False):
self.mapping = mapping
self.set_lims(upperlim, bottomlim)
@property
def cutoffs(self):
return [i[0] for i in self.mapping]
def set_lims(self, upperlim, bottomlim):
if not self.mapping:
self.bottomlim = bottomlim
self.upperlim = upperlim
else:
vals = sorted(np.unique(map(itemgetter(1), self.mapping)))
self.bottomlim = vals[0]
self.upperlim = vals[-1]
assert self.bottomlim < self.upperlim
def fit(self, continuous_series, subsample=None):
self.mapping = []
continuous_series = pd.Series(continuous_series).reset_index(drop=True).dropna()
if subsample is not None:
n = len(continuous_series)*subsample if subsample < 1 else subsample
continuous_series = np.random.choice(continuous_series, n, replace=False)
ranked = pd.Series(continuous_series).rank(pct=1, method='average')
ranked *= self.upperlim - self.bottomlim
ranked += self.bottomlim
ranked = ranked.map(round)
nvals = sorted(np.unique(ranked)) # sorted in case numpy changes
for nval in nvals:
cond = ranked == nval
self.mapping.append((continuous_series[cond].min(), int(nval)))
def transform_single(self, val):
if not self.mapping:
raise NotImplementedError('Haven\'t been fitted yet')
elif pd.isnull(val):
return None
i = bisect_right(self.cutoffs, val) - 1
if i == -1:
return 0
return self.mapping[i][1]
def transform(self, vals):
if isinstance(vals, float):
return self.transform_single(vals)
elif vals is None:
return None
return pd.Series(vals).map(self.transform_single)
def fit_transform(self, vals):
self.fit(vals)
return self.transform(vals)
The api is similar to sklearn
, with fit
, transform
and fit_transform
methods. Let’s see the usage on a simple example. You can check if the values of y are binned correctly by looking at the cutoffs.
D = Discretizer(upperlim=5)
y = np.random.uniform(0, 1, 1000)
D.fit(y)
print('cutoffs:')
print(D.cutoffs[1:])
print('-'*30)
print('values:')
print(y[:4])
print('-'*30)
print('bins:')
print(D.transform(y[:4]))
cutoffs:
[0.10316905194019832, 0.28798638771620189, 0.48254030703944994, 0.69792871673000578, 0.89006354032939106]
------------------------------
values:
[ 0.50908875 0.00113411 0.73404021 0.3087637 ]
------------------------------
bins:
0 3
1 0
2 4
3 2
dtype: int64
Now comes the part where we apply the Discretizer
object to the whole dataset. To that end, we will define a NaiveBayesPreprocessor
object. If a field is discrete (i.e. categorical), it will leave it (mostly) untouched (in reality, it will eliminate the values that does not occur more than 1% of the time). If the field is continuous, it will bin it as above.
class NaiveBayesPreprocessor(object):
"""
Don't pass in Nans. fill with keyword.
"""
OTHER = '____OTHER____'
FILLNA = '____NA____'
def __init__(self, alpha=1.0, min_freq=0.01, bins=20):
self.alpha = alpha # Laplace smoothing
self.min_freq = min_freq # drop values occuring less frequently than this
self.bins = bins # number of bins for continuous fields
def learn_continuous_transf(self, series):
D = Discretizer(upperlim=self.bins)
D.fit(series)
return D
def learn_discrete_transf(self, series):
vcs = series.value_counts(dropna=False, normalize=True)
vcs = vcs[vcs >= self.min_freq]
keep = set(vcs.index)
transf = lambda r: r if r in keep else self.OTHER
return transf
def learn_transf(self, series):
if series.dtype == np.float64:
return self.learn_continuous_transf(series)
else:
return self.learn_discrete_transf(series)
def fit(self, X_orig, y=None):
"""
Expects pandas series and pandas DataFrame
"""
# get dtypes
self.dtypes = defaultdict(set)
for fld, dtype in X_orig.dtypes.iteritems():
self.dtypes[dtype].add(fld)
X = X_orig
# X = X_orig.fillna(self.FILLNA)
# get transfs
self.transformations = {
fld: self.learn_transf(series)
for fld, series in X.iteritems()}
def transform(self, X_orig, y=None):
"""
Expects pandas series and pandas DataFrame
"""
X = X_orig.copy()
# X = X_orig.fillna(self.FILLNA)
for fld, func in self.transformations.items():
if isinstance(func, Discretizer):
X[fld] = func.transform(X[fld])
else:
X[fld] = X[fld].map(func)
return X
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
We will see the preprocessor in action. Before that though, now is the time to code up the NaiveBayesClassifier
.
2.b. NaiveBayesClassifier
The following class will implement the vanilla Naive Bayes algorithm as seen above. I will try to stick to a sklearn
-like api, with methods such as fit
, predict
, predict_proba
etc. Once again, code is not optimal at all, the goal is to get to something working quickly.
class NaiveBayesClassifier(object):
def __init__(self, alpha=1.0, class_priors=None, **kwargs):
self.alpha = alpha
self.class_priors = class_priors
def get_class_log_priors(self, y):
self.classes_ = y.unique()
if self.class_priors is None:
self.class_priors = y.value_counts(normalize=1)
elif isinstance(self.class_priors, str) and self.class_priors == 'equal':
raise NotImplementedError
self.class_log_priors = self.class_priors.map(np.log)
def get_log_likelihoods(self, fld):
table = self.groups[fld].value_counts(dropna=False).unstack(fill_value=0)
table += self.alpha
sums = table.sum(axis=1)
likelihoods = table.apply(lambda r: r/sums, axis=0)
log_likelihoods = likelihoods.applymap(np.log)
return {k if pd.notnull(k) else None: v for k, v in log_likelihoods.items()}
def fit(self, X, y):
y = pd.Series(y)
self.get_class_log_priors(y)
self.groups = X.groupby(y)
self.log_likelihoods = {
fld: self.get_log_likelihoods(fld)
for fld, series in X.iteritems()
}
def get_approx_log_posterior(self, series, class_):
log_posterior = self.class_log_priors[class_] # prior
for fld, val in series.iteritems():
# there are cases where the `val` is not seen before
# as in having a `nan` in the scoring dataset,
# but no `nans in the training set
# in those cases, we want to not add anything to the log_posterior
# This is to handle the Nones and np.nans etc.
val = val if pd.notnull(val) else None
if val not in self.log_likelihoods[fld]:
continue
log_posterior += self.log_likelihoods[fld][val][class_]
return log_posterior
def decision_function_series(self, series):
approx_log_posteriors = [
self.get_approx_log_posterior(series, class_)
for class_ in self.classes_]
return pd.Series(approx_log_posteriors, index=self.classes_)
def decision_function_df(self, df):
return df.apply(self.decision_function_series, axis=1)
def decision_function(self, X):
"""
returns the log posteriors
"""
if isinstance(X, pd.DataFrame):
return self.decision_function_df(X)
elif isinstance(X, pd.Series):
return self.decision_function_series(X)
elif isinstance(X, dict):
return self.decision_function_series(pd.Series(X))
def predict_proba(self, X, normalize=True):
"""
returns the (normalized) posterior probability
normalization is just division by the evidence. doesn't change the argmax.
"""
log_post = self.decision_function(X)
if isinstance(log_post, pd.Series):
post = log_post.map(np.exp)
elif isinstance(log_post, pd.DataFrame):
post = log_post.applymap(np.exp)
else:
raise NotImplementedError('type of X is "{}"'.format(type(X)))
if normalize:
if isinstance(post, pd.Series):
post /= post.sum()
elif isinstance(post, pd.DataFrame):
post = post.div(post.sum(axis=1), axis=0)
return post
def predict(self, X):
probas = self.decision_function(X)
if isinstance(probas, pd.Series):
return np.argmax(probas)
return probas.apply(np.argmax, axis=1)
def score(self, X, y):
preds = self.predict(X)
return np.mean(np.array(y) == preds.values)
3. Example usage
3.1. Wine quality dataset
First example will work with the following dataset, hosted @ https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/.
We will load it locally, and predict if a given wine is red or white. The reader would need to do minimal prework to get the dataset hosted in the link above in this format.
Let’s take a quick look at the dataset:
winedata = pd.read_csv('~/metis/github/ct_intel_ml_curriculum/data/Wine_Quality_Data.csv')
print(winedata.shape)
print('-'*30)
print(winedata.color.value_counts(normalize=True))
winedata.sample(3)
(6497, 13)
------------------------------
white 0.753886
red 0.246114
Name: color, dtype: float64
fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | color | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2196 | 7.0 | 0.29 | 0.37 | 4.9 | 0.034 | 26.0 | 127.0 | 0.99280 | 3.17 | 0.44 | 10.8 | 6 | white |
5327 | 6.2 | 0.20 | 0.33 | 5.4 | 0.028 | 21.0 | 75.0 | 0.99012 | 3.36 | 0.41 | 13.5 | 7 | white |
2911 | 9.6 | 0.25 | 0.54 | 1.3 | 0.040 | 16.0 | 160.0 | 0.99380 | 2.94 | 0.43 | 10.5 | 5 | white |
Let’s see what this dataset becomes once we transform it with the NaiveBayesPreprocessor
NaiveBayesPreprocessor().fit_transform(winedata).sample(3)
fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | color | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5998 | 7 | 6 | 4 | 18 | 12 | 13 | 14 | 14 | 5 | 17 | 4 | 5 | white |
3876 | 13 | 0 | 9 | 2 | 3 | 8 | 6 | 6 | 19 | 3 | 12 | 6 | white |
2823 | 12 | 5 | 15 | 8 | 3 | 10 | 8 | 2 | 13 | 12 | 18 | 7 | white |
Observe that the last column (which is categorical), is left untouched.
Now let’s try the Naive Bayes algorithms on this.
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from collections import defaultdict
from sklearn.pipeline import Pipeline
pipe = [('preprocessor', NaiveBayesPreprocessor(bins=20)),
('nb', NaiveBayesClassifier())]
pipe = Pipeline(pipe)
nbs = {'_vanilla_': pipe,
'Multinomial': MultinomialNB(),
'Bernoulli': BernoulliNB(),
'Gaussian': GaussianNB()}
X, y = winedata[winedata.columns[:-1]], winedata[winedata.columns[-1]]
rs = ShuffleSplit(test_size=0.25, n_splits=5)
scores = defaultdict(list)
for train_index, test_index in rs.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
for key, est in nbs.items():
est.fit(X_train, y_train)
scores[key].append(est.score(X_test, y_test))
pd.DataFrame(scores)
Bernoulli | Gaussian | Multinomial | _vanilla_ | |
---|---|---|---|---|
0 | 0.764308 | 0.973538 | 0.920615 | 0.987077 |
1 | 0.754462 | 0.969231 | 0.913846 | 0.983385 |
2 | 0.777231 | 0.971077 | 0.924308 | 0.988308 |
3 | 0.777846 | 0.966154 | 0.915077 | 0.990769 |
4 | 0.780308 | 0.965538 | 0.928615 | 0.989538 |
As you see above, Bernoulli and Multinomial Naive Bayes algorithms aren’t performing well, simply because this dataset isn’t suitable for their use, as explored in the previous post. GaussianNB
performs OK, but is beaten by our implementation of NB. To satisfy the curious among us, let’s throw GradientBoostingClassifier
and RandomForestClassifier
(without parameter tuning) at this;
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
nbs = {'GBT': GradientBoostingClassifier(), 'RF': RandomForestClassifier()}
scores = defaultdict(list)
for train_index, test_index in rs.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
for key, est in nbs.items():
est.fit(X_train, y_train)
scores[key].append(est.score(X_test, y_test))
pd.DataFrame(scores)
GBT | RF | |
---|---|---|
0 | 0.994462 | 0.994462 |
1 | 0.993846 | 0.995692 |
2 | 0.995077 | 0.993846 |
3 | 0.993231 | 0.993231 |
4 | 0.993231 | 0.993846 |
Yeah these powerful algorithms yield better results. Let’s move on to our second example.
3.2. Human Activity Recognition using Smartphones
Dataset and description can be found @ https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones
activity_data = pd.read_csv('~/metis/github/ct_intel_ml_curriculum/data/Human_Activity_Recognition_Using_Smartphones_Data.csv')
print(activity_data.shape)
print('-'*30)
print(activity_data.Activity.value_counts(normalize=True))
activity_data.sample(3)
(10299, 562)
------------------------------
LAYING 0.188756
STANDING 0.185067
SITTING 0.172541
WALKING 0.167201
WALKING_UPSTAIRS 0.149917
WALKING_DOWNSTAIRS 0.136518
Name: Activity, dtype: float64
tBodyAcc-mean()-X | tBodyAcc-mean()-Y | tBodyAcc-mean()-Z | tBodyAcc-std()-X | tBodyAcc-std()-Y | tBodyAcc-std()-Z | tBodyAcc-mad()-X | tBodyAcc-mad()-Y | tBodyAcc-mad()-Z | tBodyAcc-max()-X | ... | fBodyBodyGyroJerkMag-skewness() | fBodyBodyGyroJerkMag-kurtosis() | angle(tBodyAccMean,gravity) | angle(tBodyAccJerkMean),gravityMean) | angle(tBodyGyroMean,gravityMean) | angle(tBodyGyroJerkMean,gravityMean) | angle(X,gravityMean) | angle(Y,gravityMean) | angle(Z,gravityMean) | Activity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5882 | 0.275918 | -0.018486 | -0.100284 | -0.993155 | -0.958860 | -0.962892 | -0.993935 | -0.953637 | -0.960433 | -0.934424 | ... | -0.605982 | -0.868852 | 0.048834 | 0.094434 | -0.922991 | 0.715411 | -0.833794 | 0.212639 | 0.007696 | STANDING |
1263 | 0.324749 | -0.012294 | -0.053416 | -0.336888 | 0.157011 | -0.387988 | -0.409819 | 0.152768 | -0.426130 | -0.069544 | ... | -0.252935 | -0.653851 | -0.294646 | -0.449551 | 0.601912 | -0.138953 | -0.790079 | 0.242435 | 0.002180 | WALKING |
9466 | 0.287556 | -0.018570 | -0.108500 | -0.984497 | -0.981637 | -0.987341 | -0.985297 | -0.982858 | -0.987223 | -0.927790 | ... | -0.656437 | -0.892907 | 0.034444 | 0.195809 | 0.352398 | -0.039287 | 0.411160 | -0.307855 | -0.682102 | LAYING |
3 rows × 562 columns
Because this dataset has negative numbers in it, MultinomialNB
will not work with it. We can add the minimum value to it and make it work, but again, it just doesn’t make a lot of sense to do that.
pipe = [('preprocessor', NaiveBayesPreprocessor(bins=20)),
('nb', NaiveBayesClassifier())]
pipe = Pipeline(pipe)
nbs = {'_vanilla_': pipe,
# 'Multinomial': MultinomialNB(),
'Bernoulli': BernoulliNB(),
'Gaussian': GaussianNB()}
X, y = activity_data[activity_data.columns[:-1]], activity_data[activity_data.columns[-1]]
rs = ShuffleSplit(test_size=0.25, n_splits=5)
scores = defaultdict(list)
for train_index, test_index in rs.split(X):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
for key, est in nbs.items():
est.fit(X_train, y_train)
scores[key].append(est.score(X_test, y_test))
pd.DataFrame(scores)
Bernoulli | Gaussian | _vanilla_ | |
---|---|---|---|
0 | 0.850485 | 0.725049 | 0.793398 |
1 | 0.840388 | 0.699417 | 0.792621 |
2 | 0.846214 | 0.716893 | 0.764660 |
3 | 0.852039 | 0.776311 | 0.801165 |
4 | 0.853592 | 0.739806 | 0.801942 |
This time, BernoulliNB
does well. This is because it binarizes the dataset prior to fitting the Bernoulli Naive Bayes, and the threshold it uses to binarize is 0. Incidentally, this works well with predicting the activity. In the prior example, this had almost no added value since everything was nonnegative.
At this point, one could do some parameter tuning, play with the possible bin value etc. In any case, this dataset is not a great dataset for the Naive Bayes type algorithms, but I wanted to see how this implementation does in such an example.
In the next post, I will explore a weighted version of this implementation of Naive Bayes, and use it as a weak learner in a boosting scheme.