Week 1 : Exercise

Comprehension questions

These questions should be answerable using only the core resources above. Write your answers in a google doc.

****These questions should be answerable using only the core resources above. Write your answers to the following questions in a google doc.

In a neural network, a _____ is a number that defines how much one neuron should influence another. Answer : Weight
In a neural network, a _____ is a number that adds a constant amount to a neuron's activation. Answer: Bias
What's the value of A, B and C in this neural network that uses the sigmoid function?

A=0.6456563 , B= 0.4255575 , C=0.93801480

What does the cost function represent when training a neural network? Answer : The cost function also known as the loss function is a mathematical equation that involves the ground truth or target value and model’s prediction based on the objective function/ hypothesis. Both target value or ground truth and model’s prediction is input to the cost function, which outputs the loss or error in the model’s prediction. This error or loss is the relative metric which tells how much the model has learnt from the data. It’s not absolute meaning you can not compare loss of one model with loss of another model.
Why does minimising the cost function improve the neural network's performance? Answer: Minimizing the cost function improves the neural network's performance because it reduces the difference between the predicted outputs and the actual target values. This is achieved through an iterative process where optimization algorithms, such as gradient descent, adjust the network's weights and biases. By calculating the gradient of the cost function with respect to these parameters, the network can make small adjustments that reduce the error. As the cost function value decreases, the predictions of the neural network become more accurate, leading to better performance on the task at hand. This process is fundamental to training neural networks and is known as backpropagation.
How does gradient descent work - i.e. minimise the cost function? (at a high level) Answer: Gradient descent is an optimization algorithm used to minimize the cost function in neural networks. At a high level, it works by calculating the gradient (or derivative) of the cost function with respect to the network's parameters (weights and biases). The gradient indicates the direction and rate of the steepest increase in the cost function. By moving in the opposite direction of the gradient, we can reduce the cost. In each iteration, the weights and biases are updated by subtracting a fraction of the gradient (controlled by the learning rate), gradually leading to a minimum cost and improved network performance.
Does gradient descent guarantee finding the best model? Why or why not? Answer: Gradient descent does not guarantee finding the best model due to issues like local minima, saddle points, the choice of learning rate, and the initial weights.

Local Minima: The cost function may have multiple local minima. Gradient descent can get stuck in a local minimum, which is not the best possible minimum (global minimum). This is particularly true for complex, non-convex cost functions commonly found in deep learning.
Saddle Points: The cost function might have saddle points where the gradient is zero, but these are not minima. Gradient descent can get stuck in these flat regions, making it difficult to continue optimizing the cost function.
Learning Rate: The choice of learning rate significantly impacts the effectiveness of gradient descent. If the learning rate is too high, the algorithm may overshoot the minimum and diverge. If it is too low, the algorithm may take too long to converge or get stuck in a suboptimal point.
Initial Weights: The initial values of the weights can influence the outcome of gradient descent. Poor initialization can lead to convergence to poor local minima or slow convergence rates.
Stochastic Nature: In the case of stochastic gradient descent (SGD), the random sampling of the data can cause fluctuations that might prevent convergence to the best model. However, this randomness can also help in escaping local minima, making SGD a commonly used variant.

Advanced techniques such as momentum, adaptive learning rates (e.g., Adam, RMSprop), and ensemble methods are often used to improve the chances of finding a better model.

Explain how this excerpt from our website might be turned into data to train an LLM: “Axiom Futures was founded in March 2024 in India”

Answer: To turn this excerpt from the website into data to train an LLM, we will have to consider the type of LLM we want to build. In this case, let’s assume we are building a casual LLM where text generation i.e next token prediction is the goal. next token predition is a self-supervised learning / Autoregressive task. The output of the previous sequence of tokens is appended to previous sequence and used as input sequence for next token prediction.

Since, we don’t need any separate label to be done for this task, we can go about performing data processing, which involves following steps.