Real World Data
Confusion Matrix
Measuring Classifiers
Bias and Variance
BIAS is how far removed the mean of your predicted values is from the “real” answer.
VARIANCE is how scattered your predicted values are from the “real” answer.

low variance relative to these observations and high bias

high variance, low bias
Increasing K in KNearestNeighbors decreases variance and increases bias (by averaging together more neighbors)
A single decision tree is prone to overfitting  high variance
 but a random forest decreases that variance
KFold Cross Validation
 split your data into K randomlyassigned segments
 reserve one segment as your test data
 train on the combined remaining K1 segments and measure performance against the test set
 repeat for each segment
 take the average of the Ksquared scores
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
clf = svm.SVC(kernel="linear", C=1).fit(X_train, Y_train)
clf.score(X_test, Y_test)
0.9666666666666667
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores
array([0.96666667, 1. , 0.96666667, 0.96666667, 1. ])
scores.mean()
0.9800000000000001
clf = svm.SVC(kernel="poly", C=1).fit(X_train, Y_train)
clf.score(X_test, Y_test)
0.9
Dealing with Outliers
import matplotlib.pyplot as plt
incomes = np.random.normal(27000, 15000, 10000)
incomes = np.append(incomes, [1000000000])
plt.hist(incomes, 50)
plt.show()
incomes.mean()
126959.41748252927
def reject_outliers(data):
u = np.median(data)
s = np.std(data)
filtered = [a for a in data if (u  2 * s < a < u + 2 * s)]
return filtered
filtered = reject_outliers(incomes)
plt.hist(filtered, 50)
plt.show()
np.mean(filtered)
26972.11342427752
Data
OVERSAMPLING
Duplicate samples from the minrity class
Can be done at random
UNDERSAMPLING
Instead of creating more positive samples, remove negative ones
Throwing data away is usually not the best approach