reading-notes

Linear Regression

Can you explain the basic concept of linear regression and its purpose in the context of machine learning and data analysis?

The basic concept of linear regression is its a way to find a straight line, or relationships, within a set of scattered data points and variables, which is then used to predict new data points. Following are some common terms used when working with linear regression:

Within the context of machine learning and data analysis, linear regression is used when we need to predict data based on other variables or features. Such as predicting how a person’s happiness relates to their education level.

Describe the process of implementing a linear regression model using Python’s Scikit Learn library, including the necessary steps and functions.

  1. We first need to import the LinearRegression Class from sklearn.linear_model, numpy and matplotlib.pylot. This LinearRegression Class has default parameters. ```python from sklearn.linear_model import LinearRegression import numpy as np import matplotlib.pyplot as plt

default parameters for show

sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True)


    - ```fit_interceptbool, default=True``` : Calculate the intercept for the model. If set to False, no intercept will be used in the calculation.

    - ```normalizebool, default=False``` : Converts an input value to a boolean. 

    - ``` copy_Xbool, default=True ``` : Copies the X value. If True, X will be copied; else it may be overwritten.

2. we need a random array of data, in this we're using a numpy array. And plotting with matplotlib 

```python
rnstate = np.random.RandomState(1)
x = 10 * rnstate.rand(50)
y = 2 * x - 5 + rnstate.randn(50)
plt.scatter(x, y);
plt.show()
  1. Create a linear regression model based the positioning of the data and Intercept, and predict a Best Fit::
model = LinearRegression(fit_intercept=True)
model.fit(x[:, np.newaxis], y)

xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])

  1. Plot the estimated linear regression line with matplotlib::
plt.scatter(x, y)
plt.plot(xfit, yfit);
plt.show()

What is the purpose of splitting the dataset into train and test sets, and how does this contribute to the evaluation of a machine learning model’s performance?

The purpose of splitting the dataset into train and tests sets is because we want the training data to contain a known output which a model learns, and the test data is used in order to test the model’s predictions.

In a machine learning model’s performance, when we split the dataset, we are doing this to prevent overfitting or underfitting, which will affect the predictability in the model. Overfitting/underfitting refers to the process of fitting the model on the train data.

When we have overfitting it means that the model is trained too well and is fit too close to the training dataset, when underfitting its not trained well enough and misses predictability and trends with the data.

Things I want to know more about

References

How To Run Linear Regressions In Python Scikit-learn

Linear Regression in Python

Train/Test Split and Cross Validation in Python