Machine learning is a subset of artificial intelligence that focuses on building systems that learn from and make decisions based on data. This project showcases a fundamental machine learning technique known as linear regression, implemented using the Gradient Descent algorithm. The goal is to understand the relationship between dependent and independent variables in a dataset and use this relationship to make predictions.
Linear regression is an algorithm that provides a linear relationship between an independent variable and a dependent variable to predict the outcome of future events. It is a statistical method used in data science and machine learning for predictive analysis.
The goal is to find the formula (or equation) of a line to predict the value of the output (dependent/outcome) variable based on the input (independent/predictor) variable with maximum accuracy or minimum error.
In the above figure,
X-axis: Independent variable
Y-axis: Output / dependent variable
Line of regression: Best fit line for a model
Line is plotted for the given data points that suitably fit all the issues. Hence, it is called the βbest fit line.β The goal of the linear regression algorithm is to find this best fit line seen in the above figure.
Suppose the equation of the best-fitted line is given by
y = ax + b
Where,
y: dependent variable
x: independent variable
a: slope or regression coefficient
b: y-intercept
then, the regression coefficients formula is given as follows:
here,
n refers to the number of data points in the given data sets.
Gradient descent is one of the most famous techniques in machine learning and used for training all sorts of neural networks. But gradient descent can not only be used to train neural networks, but many more machine learning models. In particular, gradient descent can be used to train a linear regression model!
An iterative optimization algorithm that tries to find the optimum value (Minimum/Maximum) of an objective function. It is one of the most used optimization techniques in machine learning projects for updating the parameters of a model in order to minimize a cost function.
Where,
The main aim of gradient descent is to find the best parameters of a model that give the highest accuracy on training. By parameters, I mean the coefficients or weights
Where,
a: learning rate
m: total data
π0: bias, y-intercept
π1: Weight, slope
- Initialize Parameters
Start with initial values for the parameters, which can be set to zeros.
π0 = 0
π1 = 0
- Compute Gradients
where,
pred = theta0 + (theta1 * x)
tempTheta0 = (sum(pred) - y) / Decimal(len(x)))
tempTheta1 = (sum(pred) - y) * x) / Decimal(len(x)))
- Update Parameters
theta0 -= learning_rate * tempTheta0
theta1 -= learning_rate * tempTheta1
- Calculate Cost Function
mse = sum(((theta0 + (theta1 * x)) - y) ** 2) / Decimal(len(x))
- Repeat
- Repeat steps 2-5 for a number of iterations of your choice, or infinitely until the parameters converge or reach a predefined stopping threshold.
- Convergence means that the parameters are no longer changing significantly with further iterations, indicating that the algorithm has found the minimum of the cost function.
Overfitting occurs when a model learns the training data too well, resulting in poor performance on new, unseen data. Overfitting is not acceptable because it compromises the modelβs ability to generalize from the training data to other data.
In this implementation, the stopping threshold and learning rate are used to avoid overfitting:
- Stopping Threshold: By defining a threshold for minimal changes in the cost function or parameter values, we can prevent the model from continuing to learn the noise in the training data, which helps avoid overfitting.
- Learning Rate (Ξ±): A carefully chosen learning rate ensures that the model learns efficiently without making too large updates to the parameters, which can cause overfitting or divergence.
The learning rate πΌ, number of iterations, and stopping threshold can be defined by the developer. These hyperparameters may vary from model to model based on the specific requirements and nature of the data.
- Learning Rate (πΌ): Controls the size of the steps taken to reach the minimum of the cost function.
- Number of Iterations: Defines how many times the gradient descent steps are repeated.
- Stopping Threshold: Determines when to stop the iterations based on minimal change in the cost function or parameter values.
These settings can be adjusted to optimize the modelβs performance and ensure it converges appropriately without overfitting.
Before starting the gradient descent algorithm, data needs to be normalized. Normalization scales the features of your data to a specific range, often [0, 1] or [-1, 1], to ensure they have a consistent scale.
For min-max normalization:
- Faster Convergence: Normalized features lead to more efficient gradient descent steps, reducing the number of iterations needed.
- Numerical Stability: Prevents overflow/underflow issues by keeping feature values within a manageable range.
- Equal Influence: Ensures all features contribute equally to the gradient updates.
def normalize(x):
x = (x - x.min()) / (x.max() - x.min())
return x
To avoid errors and increase accuracy, it is better to denormalize the data at the end of the algorithm. After getting the predictions using normalized parameters, denormalize the predicted values to get more accurate values in their original scale. This helps to maintain the integrity of the data and provides more meaningful predictions.
def denormalize(x, original_x):
x = x * (max(original_x) - min(original_x)) + min(original_x)
return x
To evaluate the precision of the linear regression model, I used the
-
$TSS$ (Total Sum of Squares) measures the total variance in the target variable:
Here,
-
$RSS$ (Residual Sum of Squares) measures the variance that the model fails to explain:
Here,
Using the Python code, the
y_mean = np.mean(y)
y_pred = theta0 + (theta1 * x_data)
tss = np.sum((y - y_mean) ** 2)
rss = np.sum((y_pred - y) ** 2)
r_squared = 1 - (rss / tss)
The
To isolate the packages installed from your local system and to maintain package consistency in each launch, I implemented two approaches: Docker with X11 forwarding and a Python virtual environment (venv). Both methods ensure that you can display graphical outputs from Python applications without any issues, including those displayed via MobaXterm.
- Docker and X11 Forwarding approach can be found in the main branch.
- Python Virtual Environment approach can be found in the submit_version branch.
What is X11 Forwarding?
X11 forwarding allows you to run applications with a graphical user interface (GUI) on a remote machine while displaying the GUI on your local machine. This is useful for running applications in a Docker container and viewing the GUI on your host system.
-
X11 Server: Ensure you have an X11 server running on your host system. You can use:
- macOS: XQuartz
- Windows: MobaXterm
- Linux: An X11 server is usually pre-installed.
-
Run the Script
sh launch.sh setup
- When prompted
- Confirm that your X11 server is ready.
- Specify whether your X11 server is launched locally or remotely.
- For local: The script will automatically set the IP address of your host.
- For remote: Enter the IP address of the host running the X11 server when prompted.
This setup ensures that the GUI from the Docker container will be forwarded to your host system's X11 server.
launch.ft_linear_regression.mp4
- To Train the Model
sh launch.sh train
- To Predict Prices
sh launch.sh predict
- To Calculate Precision
sh launch.sh precise
- To Clean the Environment (if needed)
sh launch.sh clean
What is venv?
venv is a tool in Python that creates an isolated environment for your Python projects. This means that all dependencies and packages are installed in an isolated directory, avoiding conflicts with other projects and system-wide packages
- Run the Script
source ./launch.sh setup
source ./launch.sh install
This command will:
- Create a virtual environment in the venv directory (if it doesn't already exist).
- Activate the virtual environment.
- Install the necessary Python packages (matplotlib, numpy, pillow, and pandas).
launch.ft_linear_regression.mp4
- To Train the Model
source ./launch.sh train
- To Predict Prices
source ./launch.sh predict
- To Calculate Precision
source ./launch.sh precise
- To Clean the Environment (if needed)
source ./launch.sh clean