"Loss function" is one of the most basic concepts today in deep learning.Despite that,it is actually not necessarily a good programming abstraction whendesigning general-purpose systems. A system should not assume thata model always comes together with a "loss function".

"Loss Function": Separation of Logic

"Loss function" may mean different things in different systems.The version I'm going to criticize is the most common one that looks like below:

Bad

Worse

def trainer_bad(data, model, loss_func):
  while True:
    inputs = next(data)
    predictions = model(inputs)
    loss = loss_func(predictions, inputs)
    # compute gradients and update model

def trainer_worse(data, model, loss_func):
  while True:
    inputs, labels = next(data)
    predictions = model(inputs)
    loss = loss_func(predictions, labels)
    # compute gradients and update model

The key property of the bad "loss function" abstraction is:Users are asked to provide a "loss function" that's executed after the "model / forward logic".Such abstraction appears in a few open source systems: Keras model.compile(loss=),fast.ai Learner(loss_func=), Lingvo BaseModel.ComputeLoss.

The main problem is not with the function itself, but that the users' algorithm logic is forced to separate into two parts: model and loss_func.

As an alternative, trainer_good below no longer separates "loss_func" from the model, and has equal functionalitieswith trainer_bad.

def trainer_good(data, model):
  while True:
    inputs: Any = next(data)
    loss: Scalar = model(inputs)  # or losses: dict[str, Scalar]
    # compute gradients and update model

In this article, I want to argue that this is a better design because:

Separating out "loss function" can be troublesome for many reasons, so we should not force it.
Users can still split their model into two parts if they like, but they don't have to.
There is not much value to let the trainer be aware of the separation.

(Apparently, trainer_good == partial(trainer_bad, loss_func=lambda x, y: x).So trainer_bad can still be used - we just set loss_func to a no-op if we don't like it.But trainer_good is cleaner.)

Problems of a Forced Separation

It's true that the separation can be useful to certain types of models.But it's not always the case, and enforcing it can be harmful instead.

Duplication between "Model" and "Loss"

The separation is not convenient for a model with many optional losses.Take a multi-task model for example:

class Module(nn.Module):
  def forward(self, inputs):
    out = {}
    if has_task1:
      out["out1"] = # get outputs for task 1
    if has_task2:
      out["out2"] = # get outputs for task 2
    # ...
    return out

def loss_func(inputs, outputs):
  losses = {}
  if has_task1:
  # Or: if "out1" in outputs:
    losses["loss1"] = # get loss for task 1
  if has_task2:
    losses["loss2"] = # get loss for task 2
  # ...
  return losses

class Module(nn.Module):
  def forward(self, inputs):
    losses = {}
    if has_task1:
      losses["task1"] = # get loss for task 1
    if has_task2:
      losses["task2"] = # get loss for task 2
    # ...
    return losses

# "get loss for task" on the right contains
# the logic of "get outputs for task" + "get
# loss for task" on the left.

The right one is simpler in that it does not duplicate thebranches that enable different tasks/losses.In reality, these conditions can be more complex than a simple if,and branching is generally less straightforward to maintain.So it's beneficial to not have to repeat the logic.

Note: If you think a wrapper likemulti_loss_func({"task1": loss_func1, "task2": loss_func2})will help (like what Keras supports), it is not going to work wellbecause it doesn't know how to route the inputs/outputs to loss functions.

"Loss" is not Independent of "Model"

One may argue that separating "loss" from "model" is nice becausethen we can easily switch different loss functions independent of "model".However, in many algorithms, loss computation is simply not independent ofthe model and should not be switched arbitrarily.This could be due to:

Loss computation depends on internal states computed during model.forward, e.g.:
- Loss needs to know which part of training data is sampled during forward.
- Some predicted auxiliary attributes control whether a sample in a batch should participate in losses.
- Losses such as activation regularization should naturally happen during forward.
In these cases, forcing a separation of "loss" and "model" will let "model" return its internal states, causing an abstraction leak.

Different loss functions expect different representations of model's predictions. These representations could be:

Discrete vs. one-hot encoding of class labels.
Boxes as absolute coordinates, or as reference anchors plus offsets.
Segmentation masks as polygons, binary bitmasks, or many other formats.

Since conversion between representations may be expensive or lossy, we'd want the model toproduce the exact representation needed by loss computation.Therefore, a separation would not make model independent of losses.On the contrary, it's even worse because loss-relatedlogic will be unnaturally split like this:

Separation

No Separation

class Model(nn.Module):
  def __call__(self, inputs):
    hidden_representation = layers(inputs)
    if use_loss1:
      # Use proper representation for loss1
      return predict1(hidden_representation)
    if use_loss2:
      return predict2(hidden_representation)

def loss_func(predictions, inputs):
  if use_loss1:
    return loss1(pred, inputs)
  if use_loss2:
    return loss2(pred, inputs)

class Model(nn.Module):
  def __call__(self, inputs):
    hidden_representation = layers(inputs)
    if use_loss1:
      pred = predict1(hidden_representation)
      return loss1(pred, inputs)
    if use_loss2:
      pred = predict2(hidden_representation)
      return loss2(pred, inputs)

Separation also increases the chances of potential bugs, wherea loss function is called on the wrong representation.In the version with no separation, it's very clearthat the losses are computed using the right representation.

No Clean Separation

One may argue that the separation is helpful because it's nice to let the "model"return the same data in training and inference.This makes sense for simple models where training and inference share most of the logic.For example, in a standard classification model shown below,we can let the "model" object return logits, which will be useful in bothtraining and inference.

But many models don't have a clean separation like this.In theory, training and inference only have to share (some) trained weights,but don't necessarily have to share any logic.Many object detection models, for example, do not compute "predictions" in trainingand do not compute losses in inference.A simplified diagram of Region-Proposal Network (RPN)of a two-stage detector looks like this during training:

Any attempt to split a complicated algorithm like this into "model" and "loss function" will:

Force "model" to return its internal algorithmic details that are not useful anywhere,except in a corresponding "loss function". This is an abstraction leak.
Make code harder to read, because logic is separated in an unnatural way.
Lead to higher memory usage (if executing eagerly), because some internal stateshave to be kept alive until "loss function" is called after the "model".

Therefore, it's unrealistic to expect there is a nice separation, and that "model" can producea consistent format in both training and inference.A better design is to include loss computation in the model's training-mode forward, i.e., let model outputlosses in training, but predictions in inference.

Trainer Does Not Need to Know about the Separation

Separation

No Separation

def trainer(model, loss_func, ...):
  # model: data -> outputs
  # loss_func: outputs -> losses

def trainer(model, ...):
  # model: data -> losses

In the "no separation" design, users provide a "model" that returns losses.This model internally can still have separation of "loss function" and "forward logic"as long as it makes sense.However, this trainer is no longer aware of the separation.It makes a difference because the trainer can no longer obtain the "outputs".

Will this become a limitation of the "no separation" design, if we'd like to do something with"outputs"? My answer is:

For 99% of the use cases where the "outputs" don't directly affect the training loop structure,trainer doesn't need to know about "outputs".
For some use cases where the trainer does something (not affecting loop structure) with "outputs",a proper design would move such responsibility elsewhere.
- For example, writing "outputs" to tensorboard shouldn't be a responsibility of trainer. A commonapproach is to use a context-based systemthat allows writing to tensorboard from anywhere during training, allowing users to simply callwrite_summary(outputs) in their model.
For other obscure use cases, they probably should write their custom trainers anyway.

Summary

Design is always a trade-off.Adding assumptions to a system might result in some benefits, but at the same time can cause trouble when the assumption isn't true.Finding a balance in between is difficult and often subjective.

The assumption that models have to come together with a separate "loss function", in my opinion, brings more trouble than it's worth.

宙飒天下网