Classification and Representation
Classification
To attempt classification, one method is to use linear regression and map all predictions greater than $0.5$ as a $1$, and all less than $0.5$ as a $0$. However, this method doesn’t work well because classification is not actually a linear function.
The classificaion problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values.
For now, we will focus on the binary classification problem in which $y$ can take only two values, $0$ and $1$. (Most of what we say here will also generalize to the multiple-class case.)
For instance, if we are trying to build spam classifier for email, then $x^{(i)}$ may be some features of a piece of email, and $y$ may be $1$ if it is a piece of spam mail, and $0$ otherwise.
Hence $y \in { 0, 1 }$. $0$ is also called the negative class, and $1$ the positive class, and they are sometimes also denoted by the symbols “-“ and “+”. Given $x^{(i)}$, the corresponding $y^{(i)}$ is also called the label for the training example.
Hypothesis Representation
We could approach the classification problem ignoring the fact that $y$ is discrete-values, and use our old linear regression algorithm to try to predict $y$ given $x$. However, it is easy to construct examples where this method performs very poorly. Intuitively, it also doesn’t make sense for $h_{\theta}(x)$ to take values larger than $1$ or smaller than $0$ when we know that $y \in { 0, 1 }$.
To fix this, let’s change the form for our hypothesis $h_{\theta}(x)$ to satisfy $0 \ge h_{\theta}(x) \ge 1$. This is accomplished by plugging $\theta^Tx$ into the Logistic Function.
Our new form uses the Sigmoid Function, also called the Logistic Function:
The following image shows us what the sigmoid function looks like:
The function $g(z)$ shown here maps any real number to the $(0, 1)$ interval, making it useful for transforming an arbitrary-values function into a function better suited for classification.
$h_{\theta}(x)$ will give us the probability that our input is $1$.
For example, $h_{\theta}(x) = 0.7$ gives us a probability of %70%% that our output is $1$. Our probability that our prediction is $0$ is just the complement of our probability that it is 1 (e.g, if probability that it is $1$ is $70\%$, then the probability that it is $0$ is $30\%$).
Decision Boundary
In order to get our discrete $0$ or $1$ classification, we can translate the output of the hypothesis function as follows:
The way our logistic function $g$ behaves is that when its output is greater than or equal to zero, its output is greater than or equal $0.5$:
Remember:
So if our input to $g$ is $\theta^TX$, then that means:
From these statements we can now say:
The decision boundary is the line that separates the area where $y = 0$ and where $y = 1$. It is created by our hypothesis function.
Example:
In this case, our decision boundary is a straight vertical line placed on the graph, where x_1 = 5, and everything to the left of that denotes $y = 1$, while everything to the right denotes $y = 0$.
Again, the input to the sigmoid function $g(z)$ (e.g. $\theta^TX$ deosn’t need to be linear, and could be a function that describes a circle (e.g. $z = \theta_0 + \theta_1x_1^2 + \theta_2x_2^2 $ or any shape to fit our data.