The following notebook performs simple and multivariate linear regression for an air pollution dataset, comparing the results of a maximum-likelihood regression with a manual gradient descent implementation.
Practical Lab: Linear Regression by using the gradient descent algorithm¶
Utilities¶
In [1]:
from sklearn import datasets # donnees
import os # rep de travail
import pandas as pd # data analysis
from scipy import stats # stat desc
import matplotlib.pyplot as plt # graphiques
import numpy as np # maths
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score
from IPython.display import display, Image
Data¶
In [2]:
#-- Reading the (training) data in a data frame
df = pd.read_csv("pm25_train_data.csv",delimiter=";")
df.dropna(inplace=True)
df.head()
Out[2]:
PM2.5 | SO2 | NO2 | CO | O3 | temperature | pressure | dew point | rainfall | windspeed | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 24.0 | 7.0 | 13.0 | 300.0 | 74.0 | 3.9 | 1027.3 | -19.7 | 0.0 | 5.1 |
1 | 93.0 | 25.0 | 76.0 | 900.0 | 22.0 | 2.7 | 1027.3 | -16.4 | 0.0 | 2.7 |
2 | 117.0 | 77.0 | 99.0 | 1600.0 | 14.0 | 13.8 | 1012.5 | -13.3 | 0.0 | 1.1 |
3 | 58.0 | 12.0 | 14.0 | 400.0 | 77.0 | 14.2 | 1018.9 | -13.9 | 0.0 | 2.7 |
4 | 226.0 | 104.0 | 136.0 | 2299.0 | 15.0 | 11.9 | 1009.7 | -7.5 | 0.0 | 1.3 |
In [3]:
#-- Save the explanatory variables in a variable X (and their names in a varaible feature_names), and the target variable in Y
feature_names = list(df.columns[1:])
print("Feature Names: ",feature_names)
# extract target
Y = np.array(df["PM2.5"])
# extract variables
X = np.array(df[feature_names])
Feature Names: ['SO2', 'NO2', 'CO', 'O3', 'temperature', 'pressure', 'dew point', 'rainfall', 'windspeed']
Analysing and selecting data¶
In [4]:
#-- Display some statistics on the data by using the describe function on the dataframe
df.describe()
Out[4]:
PM2.5 | SO2 | NO2 | CO | O3 | temperature | pressure | dew point | rainfall | windspeed | |
---|---|---|---|---|---|---|---|---|---|---|
count | 11160.000000 | 11160.000000 | 11160.000000 | 11160.000000 | 11160.000000 | 11160.000000 | 11160.000000 | 11160.000000 | 11160.000000 | 11160.000000 |
mean | 144.785609 | 21.803471 | 44.614596 | 1165.918100 | 74.123981 | 17.943513 | 1009.810802 | 2.826747 | 0.046918 | 2.235968 |
std | 102.926739 | 26.880259 | 32.895568 | 1010.439512 | 51.904421 | 10.751609 | 10.075603 | 13.450111 | 0.535652 | 1.337821 |
min | 3.000000 | 0.856800 | 2.000000 | 100.000000 | 0.214200 | -6.800000 | 984.500000 | -31.300000 | 0.000000 | 0.000000 |
25% | 71.000000 | 4.000000 | 19.000000 | 500.000000 | 34.000000 | 8.200000 | 1001.800000 | -8.200000 | 0.000000 | 1.300000 |
50% | 120.000000 | 12.000000 | 36.000000 | 900.000000 | 66.000000 | 20.000000 | 1009.300000 | 3.100000 | 0.000000 | 1.900000 |
75% | 192.000000 | 28.000000 | 62.000000 | 1500.000000 | 103.000000 | 27.400000 | 1017.600000 | 15.000000 | 0.000000 | 2.700000 |
max | 844.000000 | 224.000000 | 273.000000 | 10000.000000 | 345.000000 | 39.800000 | 1036.300000 | 28.500000 | 31.200000 | 12.900000 |
In [5]:
# PM2.5 vs C0
df.plot.scatter("CO","PM2.5")
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc537bd22d0>
In [6]:
# plot scatter for each variable
for i in feature_names:
df.plot.scatter("PM2.5",i,title="PM2.5 vs. "+str(i))
In [7]:
# correlation matrix
df.corr()
Out[7]:
PM2.5 | SO2 | NO2 | CO | O3 | temperature | pressure | dew point | rainfall | windspeed | |
---|---|---|---|---|---|---|---|---|---|---|
PM2.5 | 1.000000 | 0.485462 | 0.572813 | 0.636648 | -0.218340 | -0.246685 | 0.140336 | -0.029720 | -0.037032 | -0.187595 |
SO2 | 0.485462 | 1.000000 | 0.598753 | 0.630310 | -0.287325 | -0.366533 | 0.231831 | -0.242011 | -0.041511 | -0.163247 |
NO2 | 0.572813 | 0.598753 | 1.000000 | 0.744382 | -0.417748 | -0.217173 | 0.146548 | 0.014724 | -0.009562 | -0.347697 |
CO | 0.636648 | 0.630310 | 0.744382 | 1.000000 | -0.323917 | -0.273665 | 0.139458 | 0.041599 | 0.022054 | -0.313852 |
O3 | -0.218340 | -0.287325 | -0.417748 | -0.323917 | 1.000000 | 0.683293 | -0.555046 | 0.459740 | -0.030031 | 0.100689 |
temperature | -0.246685 | -0.366533 | -0.217173 | -0.273665 | 0.683293 | 1.000000 | -0.811459 | 0.810234 | -0.001144 | -0.093390 |
pressure | 0.140336 | 0.231831 | 0.146548 | 0.139458 | -0.555046 | -0.811459 | 1.000000 | -0.716093 | -0.030913 | 0.062114 |
dew point | -0.029720 | -0.242011 | 0.014724 | 0.041599 | 0.459740 | 0.810234 | -0.716093 | 1.000000 | 0.087952 | -0.350106 |
rainfall | -0.037032 | -0.041511 | -0.009562 | 0.022054 | -0.030031 | -0.001144 | -0.030913 | 0.087952 | 1.000000 | -0.035374 |
windspeed | -0.187595 | -0.163247 | -0.347697 | -0.313852 | 0.100689 | -0.093390 | 0.062114 | -0.350106 | -0.035374 | 1.000000 |
In [8]:
#-- Select the explanatory variables for the simple linear regression, then the multiple linear regression, and display the scatter plots
plt.figure(figsize=(10,5))
plt.subplot(131)
plt.scatter(df["PM2.5"],df["CO"])
plt.title("PM2.5 vs. CO")
plt.subplot(132)
plt.scatter(df["PM2.5"],df["NO2"],c="orange")
plt.title("PM2.5 vs. NO2")
plt.subplot(133)
plt.scatter(df["PM2.5"],df["SO2"],c="green")
plt.title("PM2.5 vs. SO2")
plt.show()
In [9]:
#-- Extract the data and creates two X matrices that will be used for the regression (have a look at page 26 to know the form of X):
#---- Xs for simple lin reg and Xm for multiple lin reg
from sklearn.model_selection import train_test_split
df = df[["PM2.5","CO","NO2","SO2"]]
X_train_single, X_test_single, y_train_single, y_test_single = train_test_split(df["CO"], df["PM2.5"], test_size=0.1, random_state=42)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(df[["CO","NO2","SO2"]], df["PM2.5"], test_size=0.1, random_state=42)
# transfer to np arrays
X_train_single = np.array(X_train_single)
X_test_single = np.array(X_test_single)
y_train_multi = np.array(y_train_multi)
y_train_single = np.array(y_train_single)
X_train_multi = np.array(X_train_multi)
X_test_multi = np.array(X_test_multi)
y_test_multi = np.array(y_test_multi)
y_train_multi = np.array(y_train_multi)
#-- Check the size of both matrices
print("X_train_single:\t",len(X_train_single),type(X_train_single))
print("X_test_single:\t",len(X_test_single),type(X_test_single))
print("X_train_multi:\t",len(X_train_multi),type(X_train_multi))
print("X_test_multi:\t",len(X_test_multi),type(X_test_multi))
## Hint: use stack/hstack/vstack
X_train_single: 10044 <class 'numpy.ndarray'> X_test_single: 1116 <class 'numpy.ndarray'> X_train_multi: 10044 <class 'numpy.ndarray'> X_test_multi: 1116 <class 'numpy.ndarray'>
In [10]:
#--- Write the standardisation function to mean-center the X data
from sklearn.linear_model import LinearRegression
def standardisation(X):
"""
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)
return(scaler.transform(X))
"""
scaled = (X-X.mean(axis=0))/X.std(axis=0,ddof=1)
return(scaled)
#-- Test 1 - simple lin reg
X_train_single = standardisation(X_train_single)
print("Single:\n",stats.describe(X_train_single))
#-- Test 2 - multiple lin reg
X_train_multi = standardisation(X_train_multi)
print("\nMulti:\n",stats.describe(X_train_multi))
Single: DescribeResult(nobs=10044, minmax=(-1.0594583269265279, 8.78194240057042), mean=-2.3345191437757175e-17, variance=1.0, skewness=2.3847267827581526, kurtosis=9.414789477257468) Multi: DescribeResult(nobs=10044, minmax=(array([-1.05945833, -1.29954115, -0.78008697]), array([8.7819424 , 6.8883484 , 7.57199728])), mean=array([-2.33451914e-17, -1.32112561e-16, -3.07732069e-17]), variance=array([1., 1., 1.]), skewness=array([2.38472678, 1.22296976, 2.363561 ]), kurtosis=array([9.41478948, 1.77233902, 7.08565568]))
In [11]:
#-- Preparing the matrix used for the regression linear when using the gradient descent algorithm
# adding 1 col at beginning
def add_ones(array):
ones = np.full((array.shape[0]), 1)
return(np.c_[ones,array])
X_train_single = add_ones(X_train_single)
X_train_multi = add_ones(X_train_multi)
In [12]:
#-- To compare the results of the gradient descent algorithm, we will first implement an exact solution with the maximum likelihood
#Formulae recall: (X^T X)^-1 X^T Y
def coef_ml(X,Y):
beta = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)),X.T),Y)
return(beta)
#-- Test 1 - simple reg
beta_single = coef_ml(X_train_single,y_train_single)
print("beta single: \t",beta_single)
#-- Test 2 - multiple reg
#-- (We can also use the native functions of Scikit-Learn, but they are more parameters that needs to be tuned)
beta_multi = coef_ml(X_train_multi,y_train_multi)
print("beta multiple: \t",coef_ml(X_train_multi,y_train_multi))
# output: n(input)+1
beta single: [145.02815611 65.1422156 ] beta multiple: [145.02815611 44.21803698 20.05823086 9.41672601]
Gradient descent algorithm¶
In the following we will implement several functions to apply linear regression. These functions should be generic and work for any number of explanatory varaibles. You should be able to apply them to Xs and Xm standardized variables.
WARNING: parameters of the functions needs to be completed
In [13]:
#-- Model
def f(X,beta):
return(np.dot(X,beta))
#-- Test 1 - simple reg
Yhat1 = f(X_train_single,beta_single)
print(Yhat1[:5])
#-- Test 2 - multiple reg
Yhat3 = f(X_train_multi,beta_multi)
print(Yhat3[:5])
[114.86667203 212.0016187 140.76932447 212.0016187 95.43968269] [117.14728052 230.27222273 146.12890938 198.47826959 90.11001981]
Loss¶
In [14]:
display(Image(filename='cost_function.png'))