sklearn-like api for a general implementation of Naive Bayes

0. Intro:

Although the computations that go into it are very tedious Naive Bayes is one of the more accessible ML classification algorithms out there. The reason for that is it is easy to understand, and it makes sense, intuitively speaking.

However, the sklearn implementations of Naive Bayes are built (excluding GaussianNB) with certain assumptions, which make them tough to use without a lot of pre-processing, as we explored in this post. This time, I’d like to implement the Naive Bayes algorithm idea for a general input dataset. I will not optimize the code, so it won’t be naturally scalable or anything, the goal is just to hack something together quickly.

1. Refresher on Naive Bayes

Let’s start with revisiting the algorithm. In the following dataset, there are 4 columns to predict whether someone will play tennis on that day. Let’s take a quick look at this small dataset:

from __future__ import division
import pandas as pd

# A really simple dataset to demonstrate the algorithm
data = pd.read_csv(
    'https://raw.githubusercontent.com/petehunt/c4.5-compiler/master/example/tennis.csv',
    usecols=['outlook', 'temp', 'humidity', 'wind', 'play']
)

data

	outlook	temp	humidity	wind	play
0	Sunny	Hot	High	Weak	No
1	Sunny	Hot	High	Strong	No
2	Overcast	Hot	High	Weak	Yes
3	Rain	Mild	High	Weak	Yes
4	Rain	Cool	Normal	Weak	Yes
5	Rain	Cool	Normal	Strong	No
6	Overcast	Cool	Normal	Strong	Yes
7	Sunny	Mild	High	Weak	No
8	Sunny	Cool	Normal	Weak	Yes
9	Rain	Mild	Normal	Weak	Yes
10	Sunny	Mild	Normal	Strong	Yes
11	Overcast	Mild	High	Strong	Yes
12	Overcast	Hot	Normal	Weak	Yes
13	Rain	Mild	High	Strong	No

Given this dataset, we can use Bayesian thinking to predict whether or not someone will play tennis on a new day, given the weather. Specifically, Naive Bayes proceeds as follows:

Let’s say, on day 14, the weather is

outlook	temp	humidity	wind
Overcast	Mild	Normal	Weak

We want to compute the following probabilities:

$P(yes \,|\, overcast, mild, normal, weak)$ $P(no \,|\, overcast, mild, normal, weak)$

Instead of computing these exactly, we compute proxies of these. Since this part is supposed to be a refresher only, I’ll only include the computation of the first case. It goes like this:

And the terms on the right side of the equation can be derived from looking at the dataset and counting. For example,

$P(mild \,|\, yes)$

is just the count of total days with play = yes and temperature = mild, divided by the days with play = yes. Let’s compute;

mild = data.temp == 'Mild'
yes = data.play == 'Yes'

print('P(mild | yes) = {}/{}'.format((mild&yes).sum(), yes.sum()))

P(mild | yes) = 4/9

The final term in the formula above is $P(y)$. That’s even simpler to compute, since it is just the count of lines with yes divided by the total number of lines.

Let’s put the pieces together by computing proxies for both probabilities with code. We get;

yes = data.play == 'Yes'

o = data.outlook == 'Overcast'
m = data.temp == 'Mild'
n = data.humidity == 'Normal'
w = data.wind == 'Weak'

ans = {
    'yes': yes.sum(),
    'no': (~yes).sum()
}
for elt in [o, m, n, w]:
    ans['yes'] *= (yes & elt).sum() / yes.sum()
    ans['no'] *= ((~yes) & elt).sum() / (~yes).sum()

print('P(yes | o, m, n, w) approximately is {}'.format(ans['yes']))
print('P(no | o, m, n, w) approximately is {}'.format(ans['no']))

P(yes | o, m, n, w) approximately is 0.79012345679
P(no | o, m, n, w) approximately is 0.0

Before we end this section, notice that the algorithm thinks there’s 0 chance that the person does not play tennis under these circumstances. Not only that’s a strong opinion, but also it results in numerical difficulties in a real world implementation. To remedy this, one uses Laplace smoothing.

2. Implementation

2.a. Preprocessing

You may have noticed that in our toy dataset, every attribute was categorical. There were no continuous numerical, nor ordinal fields. Also, unlike sklearn’s NB algorithms, we did not make any assumptions about the incoming training dataset. Columns were (thought to be) pairwise independent.

As a design choice, we will present a version of Naive Bayes that works with such a training set. Every column will be categorical, and we will implement the computations we carried out above. In practice though, not every training data will consist ofcategorical columns only. To that end, let’s first code a Discretizer object. The Discretizer will do the following:

Takes in a continuous pandas series and bins the values.
Remembers the cutoffs so that it can apply the same transformations later to find which bin the new value belongs to.

Feel free to skip over the code for the usage.

from operator import itemgetter
from bisect import bisect_right

import numpy as np


class Discretizer(object):

    def __init__(self, upperlim=20, bottomlim=0, mapping=False):
        self.mapping = mapping
        self.set_lims(upperlim, bottomlim)

    @property
    def cutoffs(self):
        return [i[0] for i in self.mapping]

    def set_lims(self, upperlim, bottomlim):
        if not self.mapping:
            self.bottomlim = bottomlim
            self.upperlim = upperlim
        else:
            vals = sorted(np.unique(map(itemgetter(1), self.mapping)))
            self.bottomlim = vals[0]
            self.upperlim = vals[-1]
        assert self.bottomlim < self.upperlim

    def fit(self, continuous_series, subsample=None):
        self.mapping = []
        continuous_series = pd.Series(continuous_series).reset_index(drop=True).dropna()
        if subsample is not None:
            n = len(continuous_series)*subsample if subsample < 1 else subsample
            continuous_series = np.random.choice(continuous_series, n, replace=False)
        ranked = pd.Series(continuous_series).rank(pct=1, method='average')
        ranked *= self.upperlim - self.bottomlim
        ranked += self.bottomlim
        ranked = ranked.map(round)
        nvals = sorted(np.unique(ranked))  # sorted in case numpy changes
        for nval in nvals:
            cond = ranked == nval
            self.mapping.append((continuous_series[cond].min(), int(nval)))

    def transform_single(self, val):
        if not self.mapping:
            raise NotImplementedError('Haven\'t been fitted yet')
        elif pd.isnull(val):
            return None
        i = bisect_right(self.cutoffs, val) - 1
        if i == -1:
            return 0
        return self.mapping[i][1]

    def transform(self, vals):
        if isinstance(vals, float):
            return self.transform_single(vals)
        elif vals is None:
            return None
        return pd.Series(vals).map(self.transform_single)

    def fit_transform(self, vals):
        self.fit(vals)
        return self.transform(vals)

The api is similar to sklearn, with fit, transform and fit_transform methods. Let’s see the usage on a simple example. You can check if the values of y are binned correctly by looking at the cutoffs.

D = Discretizer(upperlim=5)

y = np.random.uniform(0, 1, 1000)
D.fit(y)
print('cutoffs:')
print(D.cutoffs[1:])
print('-'*30)
print('values:')
print(y[:4])
print('-'*30)
print('bins:')
print(D.transform(y[:4]))

cutoffs:
[0.10316905194019832, 0.28798638771620189, 0.48254030703944994, 0.69792871673000578, 0.89006354032939106]
------------------------------
values:
[ 0.50908875  0.00113411  0.73404021  0.3087637 ]
------------------------------
bins:
0    3
1    0
2    4
3    2
dtype: int64

Now comes the part where we apply the Discretizer object to the whole dataset. To that end, we will define a NaiveBayesPreprocessor object. If a field is discrete (i.e. categorical), it will leave it (mostly) untouched (in reality, it will eliminate the values that does not occur more than 1% of the time). If the field is continuous, it will bin it as above.

class NaiveBayesPreprocessor(object):
    """
    Don't pass in Nans. fill with keyword.
    """

    OTHER = '____OTHER____'
    FILLNA = '____NA____'

    def __init__(self, alpha=1.0, min_freq=0.01, bins=20):
        self.alpha = alpha  # Laplace smoothing
        self.min_freq = min_freq  # drop values occuring less frequently than this
        self.bins = bins  # number of bins for continuous fields

    def learn_continuous_transf(self, series):
        D = Discretizer(upperlim=self.bins)
        D.fit(series)
        return D

    def learn_discrete_transf(self, series):
        vcs = series.value_counts(dropna=False, normalize=True)
        vcs = vcs[vcs >= self.min_freq]
        keep = set(vcs.index)
        transf = lambda r: r if r in keep else self.OTHER
        return transf

    def learn_transf(self, series):
        if series.dtype == np.float64:
            return self.learn_continuous_transf(series)
        else:
            return self.learn_discrete_transf(series)

    def fit(self, X_orig, y=None):
        """
        Expects pandas series and pandas DataFrame
        """
        # get dtypes
        self.dtypes = defaultdict(set)
        for fld, dtype in X_orig.dtypes.iteritems():
            self.dtypes[dtype].add(fld)

        X = X_orig
        # X = X_orig.fillna(self.FILLNA)
        # get transfs
        self.transformations = {
            fld: self.learn_transf(series)
            for fld, series in X.iteritems()}

    def transform(self, X_orig, y=None):
        """
        Expects pandas series and pandas DataFrame
        """
        X = X_orig.copy()
        # X = X_orig.fillna(self.FILLNA)
        for fld, func in self.transformations.items():
            if isinstance(func, Discretizer):
                X[fld] = func.transform(X[fld])
            else:
                X[fld] = X[fld].map(func)
        return X

    def fit_transform(self, X, y=None):
        self.fit(X)
        return self.transform(X)

We will see the preprocessor in action. Before that though, now is the time to code up the NaiveBayesClassifier.

2.b. NaiveBayesClassifier

The following class will implement the vanilla Naive Bayes algorithm as seen above. I will try to stick to a sklearn-like api, with methods such as fit, predict, predict_proba etc. Once again, code is not optimal at all, the goal is to get to something working quickly.

class NaiveBayesClassifier(object):

    def __init__(self, alpha=1.0, class_priors=None, **kwargs):
        self.alpha = alpha
        self.class_priors = class_priors

    def get_class_log_priors(self, y):
        self.classes_ = y.unique()
        if self.class_priors is None:
            self.class_priors = y.value_counts(normalize=1)
        elif isinstance(self.class_priors, str) and self.class_priors == 'equal':
            raise NotImplementedError
        self.class_log_priors = self.class_priors.map(np.log)

    def get_log_likelihoods(self, fld):
        table = self.groups[fld].value_counts(dropna=False).unstack(fill_value=0)
        table += self.alpha
        sums = table.sum(axis=1)
        likelihoods = table.apply(lambda r: r/sums, axis=0)
        log_likelihoods = likelihoods.applymap(np.log)
        return {k if pd.notnull(k) else None: v for k, v in log_likelihoods.items()}

    def fit(self, X, y):
        y = pd.Series(y)
        self.get_class_log_priors(y)
        self.groups = X.groupby(y)
        self.log_likelihoods = {
            fld: self.get_log_likelihoods(fld)
            for fld, series in X.iteritems()
        }

    def get_approx_log_posterior(self, series, class_):
        log_posterior = self.class_log_priors[class_]  # prior
        for fld, val in series.iteritems():
            # there are cases where the `val` is not seen before
            # as in having a `nan` in the scoring dataset,
            #   but no `nans in the training set
            # in those cases, we want to not add anything to the log_posterior

            # This is to handle the Nones and np.nans etc.
            val = val if pd.notnull(val) else None

            if val not in self.log_likelihoods[fld]:
                continue
            log_posterior += self.log_likelihoods[fld][val][class_]
        return log_posterior

    def decision_function_series(self, series):
        approx_log_posteriors = [
            self.get_approx_log_posterior(series, class_)
            for class_ in self.classes_]
        return pd.Series(approx_log_posteriors, index=self.classes_)

    def decision_function_df(self, df):
        return df.apply(self.decision_function_series, axis=1)

    def decision_function(self, X):
        """
        returns the log posteriors
        """
        if isinstance(X, pd.DataFrame):
            return self.decision_function_df(X)
        elif isinstance(X, pd.Series):
            return self.decision_function_series(X)
        elif isinstance(X, dict):
            return self.decision_function_series(pd.Series(X))

    def predict_proba(self, X, normalize=True):
        """
        returns the (normalized) posterior probability

        normalization is just division by the evidence. doesn't change the argmax.
        """
        log_post = self.decision_function(X)
        if isinstance(log_post, pd.Series):
            post = log_post.map(np.exp)
        elif isinstance(log_post, pd.DataFrame):
            post = log_post.applymap(np.exp)
        else:
            raise NotImplementedError('type of X is "{}"'.format(type(X)))
        if normalize:
            if isinstance(post, pd.Series):
                post /= post.sum()
            elif isinstance(post, pd.DataFrame):
                post = post.div(post.sum(axis=1), axis=0)
        return post

    def predict(self, X):
        probas = self.decision_function(X)
        if isinstance(probas, pd.Series):
            return np.argmax(probas)
        return probas.apply(np.argmax, axis=1)

    def score(self, X, y):
        preds = self.predict(X)
        return np.mean(np.array(y) == preds.values)

3. Example usage

3.1. Wine quality dataset

First example will work with the following dataset, hosted @ https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/.

We will load it locally, and predict if a given wine is red or white. The reader would need to do minimal prework to get the dataset hosted in the link above in this format.

Let’s take a quick look at the dataset:

winedata = pd.read_csv('~/metis/github/ct_intel_ml_curriculum/data/Wine_Quality_Data.csv')
print(winedata.shape)
print('-'*30)
print(winedata.color.value_counts(normalize=True))
winedata.sample(3)

(6497, 13)
------------------------------
white    0.753886
red      0.246114
Name: color, dtype: float64

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	quality	color
2196	7.0	0.29	0.37	4.9	0.034	26.0	127.0	0.99280	3.17	0.44	10.8	6	white
5327	6.2	0.20	0.33	5.4	0.028	21.0	75.0	0.99012	3.36	0.41	13.5	7	white
2911	9.6	0.25	0.54	1.3	0.040	16.0	160.0	0.99380	2.94	0.43	10.5	5	white

Let’s see what this dataset becomes once we transform it with the NaiveBayesPreprocessor

NaiveBayesPreprocessor().fit_transform(winedata).sample(3)

	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	quality	color
5998	7	6	4	18	12	13	14	14	5	17	4	5	white
3876	13	0	9	2	3	8	6	6	19	3	12	6	white
2823	12	5	15	8	3	10	8	2	13	12	18	7	white

Observe that the last column (which is categorical), is left untouched.

Now let’s try the Naive Bayes algorithms on this.

from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from collections import defaultdict
from sklearn.pipeline import Pipeline

pipe = [('preprocessor', NaiveBayesPreprocessor(bins=20)),
        ('nb', NaiveBayesClassifier())]
pipe = Pipeline(pipe)
nbs = {'_vanilla_': pipe,
       'Multinomial': MultinomialNB(),
       'Bernoulli': BernoulliNB(),
       'Gaussian': GaussianNB()}

X, y = winedata[winedata.columns[:-1]], winedata[winedata.columns[-1]]

rs = ShuffleSplit(test_size=0.25, n_splits=5)
scores = defaultdict(list)
for train_index, test_index in rs.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    for key, est in nbs.items():
        est.fit(X_train, y_train)
        scores[key].append(est.score(X_test, y_test))

pd.DataFrame(scores)

	Bernoulli	Gaussian	Multinomial	_vanilla_
0	0.764308	0.973538	0.920615	0.987077
1	0.754462	0.969231	0.913846	0.983385
2	0.777231	0.971077	0.924308	0.988308
3	0.777846	0.966154	0.915077	0.990769
4	0.780308	0.965538	0.928615	0.989538

As you see above, Bernoulli and Multinomial Naive Bayes algorithms aren’t performing well, simply because this dataset isn’t suitable for their use, as explored in the previous post. GaussianNB performs OK, but is beaten by our implementation of NB. To satisfy the curious among us, let’s throw GradientBoostingClassifier and RandomForestClassifier (without parameter tuning) at this;

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

nbs = {'GBT': GradientBoostingClassifier(), 'RF': RandomForestClassifier()}
scores = defaultdict(list)
for train_index, test_index in rs.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    for key, est in nbs.items():
        est.fit(X_train, y_train)
        scores[key].append(est.score(X_test, y_test))

pd.DataFrame(scores)

	GBT	RF
0	0.994462	0.994462
1	0.993846	0.995692
2	0.995077	0.993846
3	0.993231	0.993231
4	0.993231	0.993846

Yeah these powerful algorithms yield better results. Let’s move on to our second example.

3.2. Human Activity Recognition using Smartphones

Dataset and description can be found @ https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones

activity_data = pd.read_csv('~/metis/github/ct_intel_ml_curriculum/data/Human_Activity_Recognition_Using_Smartphones_Data.csv')
print(activity_data.shape)
print('-'*30)
print(activity_data.Activity.value_counts(normalize=True))
activity_data.sample(3)

(10299, 562)
------------------------------
LAYING                0.188756
STANDING              0.185067
SITTING               0.172541
WALKING               0.167201
WALKING_UPSTAIRS      0.149917
WALKING_DOWNSTAIRS    0.136518
Name: Activity, dtype: float64

	tBodyAcc-mean()-X	tBodyAcc-mean()-Y	tBodyAcc-mean()-Z	tBodyAcc-std()-X	tBodyAcc-std()-Y	tBodyAcc-std()-Z	tBodyAcc-mad()-X	tBodyAcc-mad()-Y	tBodyAcc-mad()-Z	tBodyAcc-max()-X	...	fBodyBodyGyroJerkMag-skewness()	fBodyBodyGyroJerkMag-kurtosis()	angle(tBodyAccMean,gravity)	angle(tBodyAccJerkMean),gravityMean)	angle(tBodyGyroMean,gravityMean)	angle(tBodyGyroJerkMean,gravityMean)	angle(X,gravityMean)	angle(Y,gravityMean)	angle(Z,gravityMean)	Activity
5882	0.275918	-0.018486	-0.100284	-0.993155	-0.958860	-0.962892	-0.993935	-0.953637	-0.960433	-0.934424	...	-0.605982	-0.868852	0.048834	0.094434	-0.922991	0.715411	-0.833794	0.212639	0.007696	STANDING
1263	0.324749	-0.012294	-0.053416	-0.336888	0.157011	-0.387988	-0.409819	0.152768	-0.426130	-0.069544	...	-0.252935	-0.653851	-0.294646	-0.449551	0.601912	-0.138953	-0.790079	0.242435	0.002180	WALKING
9466	0.287556	-0.018570	-0.108500	-0.984497	-0.981637	-0.987341	-0.985297	-0.982858	-0.987223	-0.927790	...	-0.656437	-0.892907	0.034444	0.195809	0.352398	-0.039287	0.411160	-0.307855	-0.682102	LAYING

3 rows × 562 columns

Because this dataset has negative numbers in it, MultinomialNB will not work with it. We can add the minimum value to it and make it work, but again, it just doesn’t make a lot of sense to do that.

pipe = [('preprocessor', NaiveBayesPreprocessor(bins=20)),
        ('nb', NaiveBayesClassifier())]
pipe = Pipeline(pipe)
nbs = {'_vanilla_': pipe,
       # 'Multinomial': MultinomialNB(),
       'Bernoulli': BernoulliNB(),
       'Gaussian': GaussianNB()}

X, y = activity_data[activity_data.columns[:-1]], activity_data[activity_data.columns[-1]]

rs = ShuffleSplit(test_size=0.25, n_splits=5)
scores = defaultdict(list)
for train_index, test_index in rs.split(X):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]
    for key, est in nbs.items():
        est.fit(X_train, y_train)
        scores[key].append(est.score(X_test, y_test))

pd.DataFrame(scores)

	Bernoulli	Gaussian	_vanilla_
0	0.850485	0.725049	0.793398
1	0.840388	0.699417	0.792621
2	0.846214	0.716893	0.764660
3	0.852039	0.776311	0.801165
4	0.853592	0.739806	0.801942

This time, BernoulliNB does well. This is because it binarizes the dataset prior to fitting the Bernoulli Naive Bayes, and the threshold it uses to binarize is 0. Incidentally, this works well with predicting the activity. In the prior example, this had almost no added value since everything was nonnegative.

At this point, one could do some parameter tuning, play with the possible bin value etc. In any case, this dataset is not a great dataset for the Naive Bayes type algorithms, but I wanted to see how this implementation does in such an example.

In the next post, I will explore a weighted version of this implementation of Naive Bayes, and use it as a weak learner in a boosting scheme.

taylanbil.github.io:wq!

Math, Data, Python, Machine Learning @ Facebook