Naive Bayes Parameter Estimation
Maximum likelihood estimates
Consider the Index data:
n 1 2 3 4 5 6 7 8
yn -1 -1 -1 +1 + 1 + 1 + 1 +1
xn[1] -1 -1 1 + 1 + 1 +1 -1 -1
xn[2] 1 2 3 1 1 1 2 3
xn[3] 1.0 2.0 3.0 1.0 1.0 1.0 2.0 3.0
Naive Bayes The Data above shows a training set Dtr = {(xn, yn 2 {+1, 1})}8 n=1 for binary classification, where xn[1] 2 {+1, 1} (so binary), xn[2] 2 {1, 2, 3} (so categorical), and xn[3] 2 R (please use Gaussian distributions). Let us fit a Naive Bayes model to it; i.e., p(y|x) / p(x[1]|y)p(x[2]|y)p(x[3]|y)p(y). derive the maximum likelihood estimates (by MLE) for the parameters
The following questions.
Q 1. What is the value of p(Y = +1)?
Q 2. What is the value of p(x[1] = 1|Y = +1)?
Q 3. What is the value of p(x[2] = 1|Y = +1)?
Q 4. What is the value of p(x[2] = 3|Y = 1)?
Q 5. What is the value of p(x[3] = 1.5|Y = +1)?
Q 6. What is the value of p(x[3] = 0.0|Y = 1)?
Q 7. What is the predicted label ˆy given x = [1, 3, 2.5]>?
Answer:
To derive the maximum likelihood estimates for the Naive Bayes model parameters, we need to estimate the prior probabilities and the conditional probabilities of each feature given the class label. We can use the following formulas:
p(Y=y) = count(Y=y) / N
p(x[i]=v|Y=y) = count(x[i]=v, Y=y) / count(Y=y)
where count(X=x,Y=y) denotes the number of examples in the training set with feature X=x and class label Y=y, and N is the total number of examples in the training set.
- The value of p(Y=+1) is:
count(Y=+1) / N = 5/8 = 0.625
- The value of p(x[1]=1|Y=+1) is:
count(x[1]=1, Y=+1) / count(Y=+1) = 1/5 = 0.2
- The value of p(x[2]=1|Y=+1) is:
count(x[2]=1, Y=+1) / count(Y=+1) = 0/5 = 0
- The value of p(x[2]=3|Y=1) is:
count(x[2]=3, Y=1) / count(Y=1) = 2/3 = 0.6667
- The value of p(x[3]=1.5|Y=+1) is calculated using a Gaussian distribution:
p(x[3]=1.5|Y=+1) = N(x[3]=1.5; mean=1.2, var=0.16) = 2.776e-06
where N(x; mean, var) denotes the probability density function of a Gaussian distribution with mean and variance.
- The value of p(x[3]=0.0|Y=1) is also calculated using a Gaussian distribution:
p(x[3]=0.0|Y=1) = N(x[3]=0.0; mean=-1, var=4) = 0.0564
7. To predict the label ˆy given x = [1, 3, 2.5], we need to compute the posterior probabilities for each class and choose the one with the highest probability:
p(Y=+1|x) = p(Y=+1) * p(x[1]=1|Y=+1) * p(x[2]=3|Y=+1) * p(x[3]=2.5|Y=+1) * p(Y=+1|x[1]=1,x[2]=3,x[3]=2.5)
= 0.625 * 0.2 * 0.6667 * N(x[3]=2.5; mean=1.6, var=0.16) * p(x[2]=3|Y=+1)
= 0.625 * 0.2 * 0.6667 * 0.362 * 0.6
= 0.00915
p(Y=-1|x) = p(Y=-1) * p(x[1]=1|Y=-1) * p(x[2]=3|Y=-1) * p(x[3]=2.5|Y=-1) * p(Y=-1|x[1]=1,x[2]=3,x[3]=2.5)
= 0.375 * 0.5 * 0 * N(x[3]=2.5; mean=-0.5, var=0.25) * p(x[2]=3|Y=-1)
= 0.375 * 0.5 * 0 * 0.7979 * 0
= 0
Therefore, the predicted label ˆy given x = [1, 3, 2.5] is Y=-1.
Naive Bayes as linear classifiers
Question
Given a training set Dtr = {(xn 2 RD, yn 2 {+1, 1})}N n=1 for binary classification, let us fit a Naive Bayes model to it; i.e., p(y|x) / p(y) QD d=1 p(x[d]|y), where p(x[d]|y) is a Gaussian distribution N (µd,y, 2 d). That is, we assume that for each dimension d, the two one-dimensional Gaussian distributions (one for each class) share the same variance 2 d. For p(y), let p(+1) = .
Please show
1. p(y|x) can be re-written as 1, 1 + ey(w>x+b) (y 2 {+1, 1})
2. What the corresponding w and b are. Your answers should be based on the notations µd,y, d, and . For w 2 RD, you may simply write the expression of w[d] for a specific d. Similar to the HW #2 GDA question, this question shows you that under certain assumptions, Naive Bayes will lead to a linear classifier. Please write down your derivation with no more than 15 lines for each answer. Hint-1: You do not need to perform maximum likelihood estimation. You can assume that µd,y and d are given and fixed. Hint-2: p(x) = p(y = +1, x) + p(y = 1, x), which comes from the marginal probability.
Answer
- Starting with Bayes’ rule:
p(y|x) = p(x|y)p(y) / p(x)
We assume that the dimensions are independent, therefore:
p(x|y) = p(x[1]|y)p(x[2]|y)…p(x[D]|y)
Using the Gaussian assumption, we have:
p(x[d]|y) = (1 / sqrt(2pi(sigma[d]^2))) * exp(-((x[d] – mu[d,y])^2) / (2 * sigma[d]^2))
where mu[d,y] is the mean of dimension d for class y, and sigma[d]^2 is the shared variance of dimension d.
Substituting p(x|y) back in Bayes’ rule, we get:
p(y|x) = (1 / p(x)) * (p(y) * (p(x[1]|y) * p(x[2]|y) * … * p(x[D]|y)))
To make a linear classifier, we use the log-odds ratio trick:
log(p(y|x) / (1-p(y|x))) = log(p(y|x)) – log(1-p(y|x))
= log(p(y)) + log(p(x[1]|y)) + log(p(x[2]|y)) + … + log(p(x[D]|y)) – log(p(x[1]|-y)) – log(p(x[2]|-y)) – … – log(p(x[D]|-y))
= log(p(y)) – log(p(-y)) + (log(p(x[1]|y)) – log(p(x[1]|-y))) + (log(p(x[2]|y)) – log(p(x[2]|-y))) + … + (log(p(x[D]|y)) – log(p(x[D]|-y)))
Note that p(-y) = 1-p(y), so we can simplify the expression:
log(p(y|x) / (1-p(y|x))) = log(p(y)) – log(1-p(y)) + (log(p(x[1]|y)) – log(p(x[1]|-y))) + (log(p(x[2]|y)) – log(p(x[2]|-y))) + … + (log(p(x[D]|y)) – log(p(x[D]|-y)))
= log(p(y)) – log(1-p(y)) + sum(d=1 to D) [log(p(x[d]|y)) – log(p(x[d]|-y))]
We can recognize the expression inside the summation as the log of the likelihood ratio:
log(p(x[d]|y) / p(x[d]|-y)) = log(N(x[d]; mu[d,+1], sigma[d]^2) / N(x[d]; mu[d,-1], sigma[d]^2))
where N(x[d]; mu, sigma^2) is the Gaussian probability density function with mean mu and variance sigma^2 evaluated at x[d].
So we can write:
log(p(y|x) / (1-p(y|x))) = log(p(y)) – log(1-p(y)) + sum(d=1 to D) [log(N(x[d]; mu[d,+1], sigma[d]^2) / N(x[d]; mu[d,-1], sigma[d]^2))]
The left-hand side of this expression is a linear function of x, which means that p(y|x) is a logistic function of a linear function of x:
p(y|x) = 1 / (1 + exp(-(w>x + b)))
where w[d] = log(N(x[d]; mu[d,+1], sigma[d]^2) / N(x[d]; mu[d,-1], sigma[d]^2
Using Bayes’ rule and the assumption that the two one-dimensional Gaussian distributions for each class share the same variance, we can write:
p(y|x) ∝ p(x|y) p(y) ∝ p(+1) ∏d N(x[d] | µd,+1, 2d)^(yd+1)/2 * p(-1) ∏d N(x[d] | µd,-1, 2d)^(yd-1)/2 ∝ exp(w>x + b) / [1 + exp(w>x + b)]
where we have defined w[d] = (µd,+1 – µd,-1) / 2d and b = -1/2 ∑d (µd,+1^2 – µd,-1^2) / 2d – log(p(+1)/p(-1)).
Thus, p(y|x) can be re-written as 1 / (1 + exp(-w>x – b)), which is the logistic regression model. This shows that under certain assumptions, Naive Bayes can lead to a linear classifier.
Note that the derivation assumes that the class prior is known and fixed to p(+1) = . If we estimate the class prior from the training data, then b should be adjusted accordingly.
Equivalent problems for linear SVM
Q:
Given a linearly separable training set {(xn 2 RD, yn 2 {+1, 1})}N n=1, the following optimization problem min w,b 1 2 w>w, s.t. minn yn(w>xn + b) = 1 (1) is equivalent to min w,b 1 2 w>w, s.t. 8n, yn(w>xn + b) 1, (2) in the sense that they have the same solution. Please show that, if (wˆ, ˆb) is a solution to Equation 2, then minn yn(wˆ>xn+ ˆb) = 1. In other words, (wˆ, ˆb) will also be a solution to Equation 1.
Please show your derivation and reasoning with no more than 15 lines. Hint-1: Note that, Equation 2 is a relaxed optimization problem of Equation 1 since if w satisfies minn yn(w>xn+b) = 1, then it satisfies 8n, yn(w>xn+b) 1. In other words, {w s.t. minn yn(w>xn+b)=1} ⇢ {w s.t. 8n, yn(w>xn+b) 1}. This also tells that if (wˆ(1), ˆb(1)) is a solution of Equation 1 and (wˆ(2), ˆb(2)) is a solution of Equation 2, then 1 2w(2)> w(2) 1 2w(1)> w(1).
Hint-2: You may prove by contradiction. That is, if minn yn(wˆ>xn + ˆb) > 1, you can find another pair (w¯, ¯b) such that minn yn(w¯>xn+¯b) = 1 and 1 2w¯>w¯ < 1 2wˆ>wˆ, which contradicts with the claim that (wˆ, ˆb) is a solution of Equation 2. To apply this, you must show what (w¯, ¯b) will be (e.g., in terms of (wˆ, ˆb)). Extension: Note that, by solving this problem, you also show that if (wˆ, ˆb) is a solution to Equation 2, then 9n such that yn(wˆ>xn + ˆb) = 1. We call these examples the support vectors.
To show that if (wˆ, ˆb) is a solution to Equation 2, then minn yn(wˆ>xn+ˆb) = 1, we will prove by contradiction. Suppose that minn yn(wˆ>xn+ˆb) > 1, then there exists a point xi such that yi(wˆ>xi+ˆb) > 1.
We can construct a new weight vector w¯ and bias ¯b such that w¯ = kwˆ, ¯b = kˆb and k > 1. Since (wˆ, ˆb) is a solution to Equation 2, we have:
yn(wˆ>xi+ˆb) ≥ 1 Multiplying both sides by k, we get:
yn(kwˆ>xi+kˆb) ≥ k which implies that:
minn yn(kwˆ>xi+kˆb) ≥ k Thus, (w¯, ¯b) also satisfies the constraint in Equation 2. However, we can choose k large enough so that w¯ violates the constraint in Equation 1:
minn yn(w¯>xi+¯b) = minn yn(kwˆ>xi+kˆb) = kyi(wˆ>xi+ˆb) > 1 This contradicts the assumption that (wˆ, ˆb) is a solution to Equation 1, and therefore we must have minn yn(wˆ>xi+ˆb) = 1.
This result also implies that there exist examples in the training set that satisfy the equality constraint in Equation 2, and these examples are called support vectors.
Answer
To show that if (wˆ, ˆb) is a solution to Equation 2, then minn yn(wˆ>xn+ˆb) = 1, we will prove by contradiction. Suppose that minn yn(wˆ>xn+ˆb) > 1, then there exists a point xi such that yi(wˆ>xi+ˆb) > 1.
We can construct a new weight vector w¯ and bias ¯b such that w¯ = kwˆ, ¯b = kˆb and k > 1. Since (wˆ, ˆb) is a solution to Equation 2, we have:
yn(wˆ>xi+ˆb) ≥ 1 Multiplying both sides by k, we get:
yn(kwˆ>xi+kˆb) ≥ k which implies that:
minn yn(kwˆ>xi+kˆb) ≥ k Thus, (w¯, ¯b) also satisfies the constraint in Equation 2. However, we can choose k large enough so that w¯ violates the constraint in Equation 1:
minn yn(w¯>xi+¯b) = minn yn(kwˆ>xi+kˆb) = kyi(wˆ>xi+ˆb) > 1 This contradicts the assumption that (wˆ, ˆb) is a solution to Equation 1, and therefore we must have minn yn(wˆ>xi+ˆb) = 1.
This result also implies that there exist examples in the training set that satisfy the equality constraint in Equation 2, and these examples are called support vectors.
Hinge-loss for linear SVM
Q:
Hinge-loss for linear SVM given a training set {(xn 2 RD, yn 2 {+1, 1})}N n=1, the following optimization problem min w,b,{⇠n} 1 2 w>w + CX n ⇠n, s.t. 8n, yn(w>xn + b) 1 ⇠n, ⇠n 0 (3) is equivalent to min w,b 1 2 w>w + CX n max{1 yn(w>xn + b), 0}, (4) in the sense that they have the same solution.
Let us first rewrite the constraint in Equation 3 as 8n, ⇠n 1 yn(w>xn + b), ⇠n 0. That is, 8n, ⇠n max{1 yn(w>xn + b), 0}. Please show that if (wˆ, ˆb, {ˆ⇠n}) is a solution to Equation 3, then 8n, ˆ⇠n= max{1yn(wˆ>xn +ˆb), 0}. Please show your derivation and reasoning with no more than 15 lines. Hint-1: Note that, Equation 3 is a relaxed optimization problem of Equation 4. Can you tell this now following the hints in section 3? Hint-2: You may prove by contradiction.
If 9n, ˆ⇠n> max{1 yn(wˆ>xn +ˆb), 0}, then you can find another (w¯, ¯b, {¯⇠n}) such that for the same n, ¯⇠n= max{1 yn(w¯>xn+¯b), 0} and 1 2w¯>w¯ +C P n ¯⇠n < 1 2wˆ>wˆ +C P n ˆ⇠n, which contradicts with the claim that (wˆ, ˆb, {ˆ⇠n}) is a solution to Equation 3. To apply this, you must show one such tuple (w¯, ¯b, {¯⇠n}) for n (e.g., in terms of (wˆ, ˆb, {ˆ⇠n})).
Answer
We rewrite the constraint in Equation 3 as:
∀n, ξn≥1−yn(w⊤xn+b), ξn≥0
That is, $\xi_n$ is the amount by which the $n$-th example violates the margin.
Assume that $(w^\wedge, b^\wedge, {\xi_n^\wedge})$ is a solution to Equation 3. Then, $\forall n$, we have $\xi_n^\wedge \geq 1 – y_n(w^\wedge x_n + b^\wedge)$ and $\xi_n^\wedge \geq 0$. We can see that $\xi_n^\wedge$ is equal to the amount by which the $n$-th example violates the margin (if it does), and 0 otherwise. Hence, $\xi_n^\wedge = \max{0, 1 – y_n(w^\wedge x_n + b^\wedge)}$. Thus, we have $0 \leq \xi_n^\wedge \leq 1$ for all $n$.
Now, we prove that $\xi_n^\wedge = \max{0, 1 – y_n(w^\wedge x_n + b^\wedge)}$ for all $n$. To see this, assume to the contrary that there exists an $n$ such that $\xi_n^\wedge > \max{0, 1 – y_n(w^\wedge x_n + b^\wedge)}$. Since $\xi_n^\wedge \geq 0$, we have $\xi_n^\wedge > 1 – y_n(w^\wedge x_n + b^\wedge)$. Therefore, $y_n(w^\wedge x_n + b^\wedge) > 1 – \xi_n^\wedge$. Since $\xi_n^\wedge \leq 1$ and $y_n \in {-1, +1}$, we have $1 – \xi_n^\wedge \leq y_n(w^\wedge x_n + b^\wedge)$. Therefore, $y_n(w^\wedge x_n + b^\wedge) > y_n(w^\wedge x_n + b^\wedge)$, which is a contradiction.
Thus, we have $\xi_n^\wedge = \max{0, 1 – y_n(w^\wedge x_n + b^\wedge)}$ for all $n$, which means that $(w^\wedge, b^\wedge)$ satisfies the hard margin constraint $\forall n, \ y_n(w^\wedge x_n + b^\wedge) \geq 1$, and hence it is a solution to Equation 4.