When we train a deep learning model, we need to set a loss function for minimizing the error. The loss function indicates how much each variable contributes to the value to be optimized in the problem.
In deep learning context, the Loss function is a function that measures the quality of a particular set of parameters based on how well the output of the network agrees with the ground truth labels in the training data.
- The loss function is used to guide the training process in order to find a set of parameters that reduce the value of the loss function
- loss function = cost function = objective function = error function
- Loss function can be written as an average over loss functions for individual training examples:
Empirical Risk Minimization (ERM)
Let a loss function be given that penalizes deviations between the true class and the estimated one. The Empirical Risk (the average loss or error of an estimator) of a decision strategy is the total loss:
It should be minimized with respect to the decision strategy .
For regression problem, network predicts continuous, and numeric variables. Loss functions for regression problem includes absolute value, square error, etc.
L1-Norm (Absolute Value)
L1-norm loss function minimizes the sum of the absolute differences between the target value and the estimated values .
- Produces sparser solutions
- Good in high dimensional spaces
- Prediction speed
- Robust: less sensitive to outliers
- Produces sparser solutions
- Unstable solution (possibly multiple solutions)
- Computational inefficient on non-sparse cases
L2-Norm (Square Error, Euclidean Loss)
L2-norm loss function minimize the sum of the square of the differences between the target value and the estimated values .
- More precise and better than L1-norm
- Penalizes large errors more strongly
- Stable solution (always one solution)
- Sensitive to outliers
- Computational efficient due to having analytical solutions
For classification problem, network predicts categorical variables. Loss function for classification problem includes hinges loss, cross-entropy loss, etc.
Square loss is more commonly used in regression, but it can be utilized for classification by re-writing as a function .
The square loss function is both convex and smooth and matches the 0–1 when and when .
The hinge loss is used for maximum-margin classification task, most notably for support vector machines (SVMs). For an intended output , a classifier score which should be the raw output of the classifier’s decision function, not the predicted class label.
For example, in linear SVMs, where are the parameters of the hyperplane and is the point to classify. When and have the same sign ( predicts the right class) and , the hinge loss . When they have opposite sign, increases linearly with .
Logistic loss displays a similar convergence rate to the hinge loss function, and since it is continuous, gradient descent methods can be utilized. The logistic loss function is defined as
Cross Entropy Loss
Cross entropy loss is commonly used loss function for deep neural network training. It is closely related to the Kullback-Leibler divergence between the empirical distribution and the predicted distribution. It is not naturally represented as a product of the true label and the predicted value, but it is convex and can be minimized using stochastic gradient descent methods. Using the alternative label convention so that , the cross entropy loss is defined as,