Predictive Models - Regression

Linear Regression

Least squares minimizes the sum of squared errors

This is the same as maximizing the likelihood of observed data if you start thinking of the problem in terms of probabilities and probability distribution functions

Sometimes called “maximum likelihood estimation”

import numpy as np
import matplotlib.pyplot as plt

pageSpeed = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = 100 - (pageSpeed + np.random.normal(0, 0.1, 1000)) * 3

plt.scatter(pageSpeed, purchaseAmount)
plt.show()

png

from scipy import stats

slope, intercept, r_value, p_value, std_err = stats.linregress(pageSpeed, purchaseAmount)

R squared value shows a really good fit

r_value ** 2
0.9902693736994506
def predict(x):
    return slope * x + intercept

fitline = predict(pageSpeed)

plt.scatter(pageSpeed, purchaseAmount)
plt.plot(pageSpeed, fitline, c="r")
plt.show()

png

Polynomial Regression

np.random.seed(2020)

pageSpeed = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = np.random.normal(50.0, 10.0, 1000) / pageSpeed

plt.scatter(pageSpeed, purchaseAmount)
plt.show()

png

x = np.array(pageSpeed)
y = np.array(purchaseAmount)

p4 = np.poly1d(np.polyfit(x, y, 4))
p4
poly1d([   2.43355681,  -33.88400278,  168.02869346, -354.05850288,
        286.61089763])
xp = np.linspace(0, 7, 100)
plt.scatter(x, y)
plt.plot(xp, p4(xp), c="r")
plt.show()

png

from sklearn.metrics import r2_score

r2 = r2_score(y, p4(x))
r2
0.6980067595161712

Multiple regression

import pandas as pd

df = pd.read_excel("http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls")
df.head()
Price Mileage Make Model Trim Type Cylinder Liter Doors Cruise Sound Leather
0 17314.103129 8221 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 1
1 17542.036083 9135 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
2 16218.847862 13196 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
3 16336.913140 16342 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 0
4 16339.170324 19832 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 1
df1 = df[["Mileage", "Price"]]

bins = np.arange(0, 50000, 10000)
groups = df1.groupby(pd.cut(df1["Mileage"], bins)).mean()
groups.head()
Mileage Price
Mileage
(0, 10000] 5588.629630 24096.714451
(10000, 20000] 15898.496183 21955.979607
(20000, 30000] 24114.407104 20278.606252
(30000, 40000] 33610.338710 19463.670267
groups["Price"].plot.line()
<matplotlib.axes._subplots.AxesSubplot at 0x292b9325488>

png

import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
X = df[["Mileage", "Cylinder", "Doors"]]
y = df["Price"]
X[["Mileage", "Cylinder", "Doors"]] = scale.fit_transform(
    X[["Mileage", "Cylinder", "Doors"]])
R:\Work\Anacond\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
R:\Work\Anacond\lib\site-packages\pandas\core\indexing.py:965: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
X
Mileage Cylinder Doors
0 -1.417485 0.52741 0.556279
1 -1.305902 0.52741 0.556279
2 -0.810128 0.52741 0.556279
3 -0.426058 0.52741 0.556279
4 0.000008 0.52741 0.556279
... ... ... ...
799 -0.439853 0.52741 0.556279
800 -0.089966 0.52741 0.556279
801 0.079605 0.52741 0.556279
802 0.750446 0.52741 0.556279
803 1.932565 0.52741 0.556279

804 rows × 3 columns

est = sm.OLS(y, X).fit()
est.summary()
OLS Regression Results
Dep. Variable: Price R-squared (uncentered): 0.064
Model: OLS Adj. R-squared (uncentered): 0.060
Method: Least Squares F-statistic: 18.11
Date: Wed, 28 Oct 2020 Prob (F-statistic): 2.23e-11
Time: 14:40:09 Log-Likelihood: -9207.1
No. Observations: 804 AIC: 1.842e+04
Df Residuals: 801 BIC: 1.843e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Mileage -1272.3412 804.623 -1.581 0.114 -2851.759 307.077
Cylinder 5587.4472 804.509 6.945 0.000 4008.252 7166.642
Doors -1404.5513 804.275 -1.746 0.081 -2983.288 174.185
Omnibus: 157.913 Durbin-Watson: 0.008
Prob(Omnibus): 0.000 Jarque-Bera (JB): 257.529
Skew: 1.278 Prob(JB): 1.20e-56
Kurtosis: 4.074 Cond. No. 1.03



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

y.groupby(df.Doors).mean()
Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly more doors doesn’t mean higher price.

So it’s pretty useless as a predictor here.

scaled = scale.transform([[45000, 8, 4]])
predicted = est.predict(scaled[0])
scaled, predicted
(array([[3.07256589, 1.96971667, 0.55627894]]), array([6315.01330583]))

Multi-Level Models