import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import plotly.express as px
Linear Progression
In this blog post, we will showcase a simple linear regression model using a real dataset named the “Auto Insurance” dataset. The objective is to illustrate the application of simple linear regression, specifically predicting the total payment for all claims in thousands of currency (y) based on the total number of claims (x) in the dataset.
Importing Libraries
In this section, essential libraries for data manipulation, mathematical operations, and visualization are imported.
Exploring Dataset and Visualization
In this section, the “insurance.csv” file is loaded into a pandas DataFrame. The exploration of the data is initiated by printing information about the DataFrame, its columns, and the first 10 rows. Additionally, data visualization is performed through the creation of box plots for the ‘X’ and ‘Y’ columns, as well as a scatter plot depicting the relationship between ‘X’ and ‘Y’.
= pd.read_csv('insurance.csv')
data print(data.columns)
10) data.head(
Index(['X', 'Y'], dtype='object')
X | Y | |
---|---|---|
0 | 108 | 392.5 |
1 | 19 | 46.2 |
2 | 13 | 15.7 |
3 | 124 | 422.2 |
4 | 40 | 119.4 |
5 | 57 | 170.9 |
6 | 23 | 56.9 |
7 | 14 | 77.5 |
8 | 45 | 214.0 |
9 | 10 | 65.3 |
= px.box(data['X'], points = 'all')
fig = f'Distribution of X',title_x=0.5, yaxis_title= "Number of Insurance Claims")
fig.update_layout(title fig.show()
= px.box(data['Y'], points = 'all')
fig = f'Distribution of Y',title_x=0.5, yaxis_title= "Amount of Insurance Paid")
fig.update_layout(title fig.show()
= px.scatter(x = data['X'], y=data['Y'])
fig = "Number of Claims", yaxis_title="Payment in Claims", height = 500, width = 700)
fig.update_layout(xaxis_title=True, linewidth=2, linecolor='black', mirror=True)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline fig.show()
Calculating Statistics, Covariance and Coefficient of Linear Progression
In this section, the mean and variance of the ‘X’ and ‘Y’ columns are computed and printed. Additionally, the covariance between ‘X’ and ‘Y’ is calculated. Subsequently, the coefficients for the linear regression model are computed.
= np.mean(data['X'])
mean_x = np.mean(data['Y'])
mean_y = np.var(data['X'])
var_x = np.var(data['Y'])
var_y print('x stats: mean= %.3f variance= %.3f' % (mean_x, var_x))
print('y stats: mean= %.3f variance= %.3f' % (mean_y, var_y))
x stats: mean= 22.905 variance= 536.658
y stats: mean= 98.187 variance= 7505.052
def covariance(x, y):
= np.mean(x)
mean_x = np.mean(y)
mean_y = 0.0
covar for i in range(len(x)):
+= (x[i] - mean_x) * (y[i] - mean_y)
covar return covar/len(x)
= covariance(data['X'], data['Y'])
covar_xy print(f'Cov(X,Y): {covar_xy}')
Cov(X,Y): 1832.0543461829182
= covar_xy / var_x
b1 = mean_y - b1 * mean_x
b0
print(f'Coefficents:\n b0: {b0} b1: {b1} ')
Coefficents:
b0: 19.99448575911478 b1: 3.413823560066368
# Predicting and Visualizing the Results
In this segment, the ‘Y’ values are predicted using the calculated coefficients, and a plot is generated to visually compare the actual and predicted values. This code effectively executes a simple linear regression on the dataset, aiming to model the relationship between a single feature (independent variable ‘X’) and a real-valued target (dependent variable ‘Y’).
= data['X'].values.copy()
x = b0 + b1 * x
y_hat = data['Y'].values
y import plotly.graph_objects as go
= go.Figure()
fig
=data['X'], y=data['Y'], name='train', mode='markers', marker_color='rgba(152, 0, 0, .8)'))
fig.add_trace(go.Scatter(x=data['X'], y=y_hat, name='prediction', mode='lines+markers', marker_color='rgba(0, 152, 0, .8)'))
fig.add_trace(go.Scatter(x
=0.5, xaxis_title= "Number of Claims", yaxis_title="Payment in Claims")
fig.update_layout(title_x=True, linewidth=2, linecolor='black', mirror=True)
fig.update_xaxes(showline=True, linewidth=2, linecolor='black', mirror=True)
fig.update_yaxes(showline fig.show()
Conclusion
The fundamental objective of linear regression is to establish a model representing the relationship between a single feature, denoted as the independent variable ‘X,’ and a real-valued target, referred to as the dependent variable ‘Y.’ The underlying assumption is that a linear relationship exists between ‘X’ and ‘Y,’ and the model endeavors to identify the optimal fit line that best captures this relationship among the data points. The aim is to create a linear equation that can predict ‘Y’ based on the values of ‘X’ with the least amount of error. SOURCE