Classification and Representation

Classification

To attempt classification, one method is to use linear regression and map all predictions greater than $0.5$ as a $1$, and all less than $0.5$ as a $0$. However, this method doesn’t work well because classification is not actually a linear function.

The classificaion problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values.

For now, we will focus on the binary classification problem in which $y$ can take only two values, $0$ and $1$. (Most of what we say here will also generalize to the multiple-class case.)

For instance, if we are trying to build spam classifier for email, then $x^{(i)}$ may be some features of a piece of email, and $y$ may be $1$ if it is a piece of spam mail, and $0$ otherwise.

Hence $y \in { 0, 1 }$. $0$ is also called the negative class, and $1$ the positive class, and they are sometimes also denoted by the symbols “-“ and “+”. Given $x^{(i)}$, the corresponding $y^{(i)}$ is also called the label for the training example.

Hypothesis Representation

We could approach the classification problem ignoring the fact that $y$ is discrete-values, and use our old linear regression algorithm to try to predict $y$ given $x$. However, it is easy to construct examples where this method performs very poorly. Intuitively, it also doesn’t make sense for $h_{\theta}(x)$ to take values larger than $1$ or smaller than $0$ when we know that $y \in { 0, 1 }$.

To fix this, let’s change the form for our hypothesis $h_{\theta}(x)$ to satisfy $0 \ge h_{\theta}(x) \ge 1$. This is accomplished by plugging $\theta^Tx$ into the Logistic Function.

Our new form uses the Sigmoid Function, also called the Logistic Function:

$$ h_{\theta}(x) = g(\theta^Tx) \\ z = \theta^Tx \\ g(z) = \frac{1}{1+ e^{-z}} $$

The following image shows us what the sigmoid function looks like:

The function $g(z)$ shown here maps any real number to the $(0, 1)$ interval, making it useful for transforming an arbitrary-values function into a function better suited for classification.

$h_{\theta}(x)$ will give us the probability that our input is $1$.

For example, $h_{\theta}(x) = 0.7$ gives us a probability of %70%% that our output is $1$. Our probability that our prediction is $0$ is just the complement of our probability that it is 1 (e.g, if probability that it is $1$ is $70\%$, then the probability that it is $0$ is $30\%$).

$$ \begin{align} &h_{\theta}(x) = P(y = 1 | x; 0) = 1 - P(y = 0 | x; 0) \\ &P(y = 0 | x; 0) + P(y = 1 | x; 0) = 1 \end{align} $$

Decision Boundary

In order to get our discrete $0$ or $1$ classification, we can translate the output of the hypothesis function as follows:

$$ h_{\theta} \ge 0.5 \to y = 1 \\ h_{\theta} \le 0.5 \to y = 0 $$

The way our logistic function $g$ behaves is that when its output is greater than or equal to zero, its output is greater than or equal $0.5$:

$$ g(z) \ge 0.5 \\ when \quad z \ge 0 $$

Remember:

$$ z = 0, \quad e^0 \Rightarrow g(z) = \frac{1}{2} \\ z \to \infty, \quad e^{-\infty} \to 0 \Rightarrow g(z) = 1 \\ z \to -\infty, \quad e^{\infty} \to \infty \Rightarrow g(z) = 0 $$

So if our input to $g$ is $\theta^TX$, then that means:

$$ h_{\theta}(x) = g(\theta^Tx) \ge 0.5 \\ when \quad \theta^Tx \ge 0 $$

From these statements we can now say:

$$ \begin{align} &\theta^Tx \ge 0 \Leftarrow y = 1 \\ &\theta^Tx < \text{div} \ 0 \Leftarrow y = 0 \end{align} $$

The decision boundary is the line that separates the area where $y = 0$ and where $y = 1$. It is created by our hypothesis function.

Example:

$$ \theta = \begin{bmatrix} 5 \\ -1 \\ 0 \end{bmatrix} \\ y = 1 \quad if \quad 5 + (-1)x_1 + 0x_2 \ge 0 \\ 5 - x_1 \ge 0 \\ -x_1 \ge -5 \\ x_1 \le 5 $$

In this case, our decision boundary is a straight vertical line placed on the graph, where x_1 = 5, and everything to the left of that denotes $y = 1$, while everything to the right denotes $y = 0$.

Again, the input to the sigmoid function $g(z)$ (e.g. $\theta^TX$ deosn’t need to be linear, and could be a function that describes a circle (e.g. $z = \theta_0 + \theta_1x_1^2 + \theta_2x_2^2 $ or any shape to fit our data.

THE NOTES WERE BASED ON THIS COURSE BY ANDREW NG.