YOLOv3 — Implementation with Training setup from Scratch

22 min readMar 21, 2021

For such a popular paper there are still few explained implementations of the YOLOv3 architecture completely from scratch. I’ll do my best to add something useful to the list. The code is written together with Aladdin Persson and can be found on github. You can also download pretrained weights on the Pascal-VOC that obtain 78.1 MAP here for the implementation below.

Prerequisites

Understanding the major parts in YOLOv1
Coding in PyTorch
Familiarity with convolutional networks and their training

With this article I hope to convey:

Understanding of the key ideas necessary for implementing and training YOLOv3 from scratch in PyTorch
Complete code to use for training of YOLOv3
The relevant details of the algorithm to succeed if you choose to make you own implementation of YOLOv3

The code is completely runnable if you download a utils.py and config.py file from the github above containing a few supporting functions and constants not specific to the YOLOv3 model.

Disclaimer: there are minor differences between this implementation and the original and I will point them out when we get to them.

Understanding the model

Let’s begin by understanding the fundamentals of the model. The YOLO (You only look once) algorithm is based on the idea that we divide the image into a grid with side S. The grid size depends upon which YOLO version we are implementing as well as the input image size but the details be clearer when we implement it. Each grid cell is responsible for making predictions of bounding boxes. You may then wonder what happens if an object covers several grid cells, will all predict a bounding box for the object? YOLO solves this by making only the cell containing the object’s midpoint responsible for predicting the bounding box. This means that only one grid is responsible for each object’s bounding box in the image. One drawback of this is that there can only be on bounding box in each grid cell. In YOLOv2 and forward they mitigate the issue by the making several bounding box predictions in the same grid cell. They also introduce anchor boxes which is an idea also seen in previous object detection papers such as Faster RCNN.

The network makes a prediction for each grid cell.

An anchor box is essentially a set of a width and a height chosen to represent a part of the training data. For example a standing rectangle may suit a human while a wide rectangle is a better fit for a car. Using anchor boxes is a way of encoding knowledge about the training data into the model to help the model make appropriate predictions. It has been discussed whether this is actually desirable and there are more recent end-to-end approaches where anchor boxes are not used. The questions is then how to choose the anchors and an early approach was to hand design the anchors boxes by studying the training data, however, the authors of YOLOv2 found that using K-means clustering to generate them yielded better results. The anchors are used to allow the model to anchor its prediction to a predetermined box. The model will thus predict how much the true bounding box is offset in comparison with the anchor. This is one of the major differences from the original YOLO model. Each grid cell will have several anchor boxes and each anchor box can make one bounding box prediction. Each bounding box prediction will also be coupled with an object score as well as class predictions. The object score should reflect product of the probability that there is an object in the bounding box and the intersection over union between the predicted bounding box and the actual object. That means if there is no object in the grid cell corresponding to the specific anchor the target is zero and otherwise it is the intersection over union between the predicted box and the target bounding box.

The predictions from the model, tx, ty, tw and th, are offsets to the anchors and will be converted to bounding boxes according to the following equations

where pw and ph are the width and height of the corresponding anchor boxes and {bx, by, bw, bh} is the resulting bounding box.

In YOLOv3 the backbone network is DarkNet-53 and its structure can be understood from the following table. This network was pretrained on ImageNet and is used as a feature extractor in the YOLOv3 model. The paper, however, completely skips detailing the following 53 convolutional layers in the YOLOv3 model where the actual prediction of bounding boxes takes place in the model.

The prediction of bounding boxes happens on three different places in the networks on three different scales. In this context a scale means the grid size, S, which we divide the image into. In YOLOv3 we predict bounding boxes on three different grid sizes. The intuition behind this is that on a coarser grid larger objects can more easily be detected and vice versa for smaller objects on finer grids. We therefore also divide the anchor boxes we have found such that we assign all the smallest anchors to the last and finest scale and the largest anchor boxes to the coarsest grid. In YOLOv3 the grid sizes used are [13, 26, 52] for an image size of 416x416. If you use another image size the first grid size will be the image size divided by 32 and the others will be a multiple of two of the previous one. The details of the model will be clear when we implement it but the following image gives great insight into the model architecture.

Image by Ayoosh Kathuria (check out his Medium)

The backbone network is a standard convolutional network quite similar to previous Darknet versions with the addition of residual connections. It is really after layer 53 that the interesting parts happen. As the image visualizes there are three downward paths corresponding to predictions of three different grid scales. The network then continues forward from the place it was before the prediction path. After the first and second scale prediction paths there is an upscaling layer to double the size of the feature map and concatenates the feature mapes with a route from a previous layer along the channel dimension. The image details which convolutional layers the routes come from but we will instead use a trick to find them in our implementation.

We are now ready to start actually coding the model. All model details are found in the configuration file for YOLOv3 on Joseph Redmon’s Github who is the author of the paper.

Coding the model

This is the part of the YOLOv3 implementation that I spent both the least and the most time on debugging. I found it manageable to make the model work but it took some time to correct details to make sure the original weights could be loaded.
Everything in this section will be in a model.py file on Github. Let’s start with the imports:

First we will define the architecture building blocks in a list as a way of parsing the original config file that majorly increases the readibility and grasp of the complete model.

Defining the building blocks

We will now define the most common building blocks of the architecture as separate classes to avoid repeating code over and over again. Each tuple signifies a convolutional block with batch normalization and leaky relu added to it.

This layer also allows us to toggle the bn_act to false and skip the batch normalization and activation function which we will use in the last layer before output. In the case where we use batch normalization the bias term of the convolutional layer will have to effect but occupying VRAM.

We then define the residual block which is essentially a combination of two convolutional blocks with a residual connection. The number of channels will be halved in the first convolutional layer and then doubled again in the second. The input size will therefore be maintained through the residual block. As in the CNNBlock we will have an argument to allow us to skip the residual connection which we will use in parts of the architecture.

The last predefined block we will use is the ScalePrediction which is the last two convolutional layers leading up to the prediction for each scale. Here the image of the architecture above actually is slightly incorrect and this block includes the downward path except for the loss function. We will reshape the output such that it has the the shape (batch size, anchors per scale, grid size, grid size, 5 + number of classes) where 5 refers to the object score and four bounding box coordinates. To obtain this shape we have to permute the output such that the class predictions end up in the last dimension.

Putting it all together to YOLOv3

We will now put it all together to the YOLOv3 model for the detection task. Most of the action takes place in the _create_conv_layers function where we build the model using the blocks defined above. Essentially we will just loop through the config list that we created above and add the blocks defined above in the correct order.

The trickiest part here is in the case in where there is an "S"in the config list which means that we are on the last layers towards a prediction on a specific scale. In these cases we will have three convolutional layers (one residual block and one convolutional block) following the same pattern on all prediction scales. To avoid creating a mess in the config list it is easiest to just add them here before the ScalePrediction.

It should also be noted that we triple the in_channels after we add the upsamling layer and this is due to the route that we will concatenate in the forward propagation that has twice as many channels as the output from the upsampling layer.

This leads us into the structure of the forward function. In the first if statement we check if the layer is a ScalePrediction block and in this case we will append its output to a list and later on compute the loss for each of the predictions separetely. We will then continue on in the model from the place the ScalePrediction started.

I earlier mentioned that we will use a trick to find the layers that are routed forward. The second if-statement will take care of this and find the route layers specified in the image of the architecture above without us keeping track of unnecessarily complicated indices. The two routes will be the outputs from the residual blocks in the config list which have 8 repeats which we found by just reading the original model configuration carefully. When we encounter an upsamling layer we will concatenate the output with the last route previously found following the image of the architecture above.

Before we move on to the data loading I’ll add a test function below that acts as a sanity check that the model at least outputs the correct shapes.

Loading the data

In the dataset class we will load an image and the corresponding bounding boxes, perform augmentation using the Albumentations library and then create the matrix form of the target that will be used to compute the loss. If you are not familiar with Albumentations it is a library for data augmentation with official support for PyTorch that can be used for augmentation for detection, segmentation and other tasks which requires that augmentations are performed both on the image and the target.

We earlier mentioned that each scale will have anchor boxes associated with them and in the data loading we will compute which cell and which anchor that should be responsible for the particular target bounding box. Everything in this section will be in a dataset.py file.

Imports

Most of the imports we are using are standard for the dataset class in PyTorch with the additional Albumentations package for the data augmentation. The imports from the utils, however, require some additional explanation. In a utils.py file, that you can find on Github, we will store some functions for handling bounding boxes conversions, non-max suppression and mean average precision. The only function that we will use in the data loading is the intersection over union function taking as input two tensors with the width and height of bounding boxes and outputting the corresponding intersection over union. The other files we import from the utils are only for checking that the data loading actually works. Plotting images and bounding boxes each time you modify the dataset class or augmentations can save you a lot of debugging time.

Data Format

The part of the data loading that is different from image classification is the way we process the bounding boxes and format them such that they can be inputted to the model. The data loading below assumes that the data is formatted such that you have a folder with all images, a folder with a text file for each image detailing the bounding boxes and one or several csv files for the train, development and test set. The text file for an image should be formatted such that each row corresponds to a bounding box of the image with class label, x coordinate y coordinate, width, height in that specific order. The bounding box coordinates should be relative to the image such that if an object has midpoint in the middle of the image and covers it in half in both width and height we would specify: class label 0.5 0.5 0.5 0.5, on a row in the text file. In the csv file you want to specify the image file name and the text file name in two different columns.

If you just want to get started without having to format the data you can download the Pascal-VOC dataset from Kaggle here where the data is already formatted.

Even if your dataset is not formatted this way it should be manageable to modify the data loading such that you can still make the training labels the same way.

Dataset Class Overview

In a Pytorch dataset there are three building blocks: the init-method, the dataset length and the __getitem__-method.

The most important part in the dataset class is how we handle the anchor boxes. We will specify the anchor boxes in the following manner

where each tuple corresponds to the width and the height of a anchor box relative to the image size and each list grouping together three tuples correspond to the anchors used on a specific prediction scale. The first list contains the largest anchor boxes which will be used for prediction on the coarsest grid where its presumably easier to predict larger bounding boxes. The following lists containing medium and small anchor boxes will be used for the medium and finest grid following the same reasoning. The anchors above are the ones used in the original paper but have beeen scaled to be relative to the image.

Even if you are training on another dataset these anchors will probably work quite well, however, if your dataset is very different from MSCOCO you would probably generate your own anchor boxes and then it is probably wise to assign the anchor boxes to the different scales by their size as was done in the paper. In this case you would collect data of the widths and heights of the bounding boxes in your dataset and run these through K-means clustering with the intersection of union as the distance measure. The resulting centroids would be your anchor boxes.

Below is the complete dataset class. We will load an image and its bounding boxes and perform augmentations on both. For each bounding box we will then assign it to the grid cell which contains its midpoint and decide which anchor is responsible for it by determining which anchor the bounding box has highest intersection over union with. Exactly how we build the targets is explained more in depth below the code.

In the init-metod we will just combine the list above to a tensor of shape (9,2) by self.anchors = torch.tensor(anchors[0] + anchors[1] + anchors[2])corresponding to each anchor box on all scales. We will also specify an ignore-threshold which will be used when building the targets as is explained below.

The second challenging part of the data loading is in the getitem-method where we will load the image and the corresponding text file for the bounding boxes and process it such that we can input it to the model. For data augmentation we use the Albumentations library which requires the image and bounding boxes to be numpy arrays. The bounding boxes are also expected to be in the format [x, y, width, height, class label] which is different from how we have formatted it in the text file and we therefore use np.roll to change this. The reason for this inconsistency is that the text files are structured the same way as in the original implementation and if you are formatting a custom dataset you may consider modifying this if you are also using Albumentations.

Here it should be noted that if you download Pascal-VOC or MS COCO dataset from the official sites or from Joseph Redmon’s website you may run into some out of range issues when using Albumentations depending on how you convert the labels to the format x, y, width, height where (x,y) signifies the object’s midpoint. If you do, make sure you have converted the labels as is specified in this Github issue and you will save a couple of hours of debugging.

Building targets

When we load the labels for a specific image it will only be an array with all the bounding boxes and to be able to calculate the loss we want to format the targets similarily to the model output. The model will output predictions on three different scales so we will also build three different targets. Each target for a particular scale and image will have shape (number of anchors // 3, grid size, grid size, 6) where 6 corresponds to the object score, four bounding box coordinates and class label. We make two assumptions which are that there is only one label per bounding box and that there is an equal number of bounding boxes on each scale. We start with initializing the three different target tensors to zeros with targets = [torch.zeros((self.num_anchors // 3, S, S, 6)) for S in self.S] where self.S is a list with the different grid sizes e.g. for an image size of 416x416 we have S=[13, 26, 52] or more general we have S = [image_size// 32, image_size//16, image_size//8] since at the prediction state the feature map will have be downscaled with the factors in the denominator.

The next step is to loop through all the bounding boxes in this particular image. If you have a lot of bounding boxes this will be quite expensive but haven’t yet figured out a way to remove this step without taking shortcuts when assigning the anchor boxes. Let me know if you have any ideas on how to optimize this! We will then compute the intersection over union between the target’s width and height and all the anchor boxes and sort the result such that the index of the anchor with the largest intersection over union with the target box appears first in the list.

We will then loop through the nine indices to assign the target to the best anchors. Our goal is to assign each target bounding box to an anchor on each scale i.e. in total assign each target to one anchor in each of the target matrices we intialized above. In addition we will also check if an anchor is not the most suitable for the bounding box but it still has an intersection over union higher than 0.5 as is specified in the ignore_iou_thresh and then we will mark this target such that no loss is incurred for the prediction of this anchor box. From my understanding the reasoning behind this is that during inference this anchor could also make valid predictions on similar objects and non-max suppression will remove surplus bounding boxes. During training we therefore do not want to force the particular anchor to predict that there is not an object. We first compute which cell the bounding box belongs to by i, j = int(S * y), int(S * x) and then we check if the anchor we are currently at is taken in this cell by anchor_taken = targets[scale_idx][anchor_on_scale, i, j, 0]. As you can probably imagine it is relatively uncommon for most datasets to have two objects with midpoint in the same cell of such similar size that they fit the same anchor box, however, if you run this through a couple of hundred examples you'll notice it occurs several times on for example the Pascal-VOC dataset. In addition to checking if the particular anchor is taken, we also check if the current bounding box already has an anchor on this particular prediction scale. We only want one target anchor on each scale to allow for specialization between the anchor boxes such that they focus on prediction different kinds of objects.

If we find an anchor that is unoccupied and our current bounding box does not have an anchor on the scale which the anchor belongs to, we want to assign this anchor to the bounding box. First we will set the object score on this anchor to 1 by: targets[scale_idx][anchor_on_scale, i, j, 0] = 1, to indicate that there is an object in this cell. We then compute the box coordinates relative to the cell such the midpoint (x,y) states where in the cell the object is and the width and the height corresponds to how many cells the bounding box covers. This is computed by:

We will then add the bounding box coordinates as well as the class label to the cell and the anchor box indicated by i, j and anchor_on_scale respectively. Lastly we will update the flag has_anchor[scale_idx] to True to indicate that the particular prediction scale now has an anchor.

Only doing the data loading in the way above would be sufficient. In the YOLOv3 paper they, however, also check if the anchor we are currently at has an intersection over union greater than ignore_iou_thresh = 0.5 and then they do not incur loss for this anchor's prediction. We will do this by setting the object score of the anchor in the object cell to -1 i.e. targets[scale_idx][anchor_on_scale, i, j, 0] = -1. In the loss function we will later make sure that no loss is incurred for these anchors.

To make sure that the data loading works it is beneficial to plot a few examples with augmentations added to them and the bounding boxes. The code below should do the trick, possibly with some modifications depending on how you structure the data.

YOLOv3 loss function

In the original YOLO paper the author states the loss function and the same expression can be found in articles on YOLOv2 or v3 which is at best a simplification compared to the actual implementation. If you are familiar with the original YOLO loss you will recognize all parts below but they are tweaked to match the idea with the anchor boxes. The loss function can be divided into four parts and I will go through each separately and then combine them in the end.

First we will form two binary tensors signifying where in what cells using which anchors that have objects assigned to them and not.

The reason for not only using one of these is that we in the data loading assign the anchors which we should ignore to -1. Indexing only the indices above in all parts of the loss function will make sure that we do not incur any loss on these anchors. I will state all parts of the loss also as mathematical formulas based on the way they are implemented in the code. They are just translations from the code for those who find it easier to understand the loss in that format so don’t worry if they’re not your cup of tea.

No Object Loss

For the anchors in all cells that do not have an object assigned to them i.e. all indices that are set to one in noobj we want to incur loss only for their object score. The target will be all zeros since we want these anchors to predict an object score of zero and we will apply a sigmoid function to the network outputs and use a binary crossentropy loss. In code we have that

where self.bce refers to an instance of the PyTorch BCEWithLogitsLoss() which applies the sigmoid function and then calculates the binary crossentropy loss.

In mathematics we have that

where N is the batch size, i, j signifies the cell where and a the anchor index and

is a binary tensor with ones on anchors not assigned to an object. The output from the network is denoted t, the target is denoted y and the sigmoid function is given by

Object Loss

For the anchors that have an object assigned to them we want them to predict a appropriate bounding box for the object. When building the target tensors we assigned these anchors to have an object score to 1. One idea is to then just do similarily as in the no object loss and train the network to output large values in the cells and anchors for which we have assigned a target bounding box. This would, however, mean that no matter how horrible a bounding box prediction the network makes it would still try to predict a high object score. During inference we are guided by the object score when choosing which bounding boxes to output and if we do as proposed the object score would actually not reflect how likely it actually is that there is an object in the outputted bounding box. The idea in the YOLOv3 paper instead that the object score that the model predicts should reflect the intersection over union between the prediction and the target bounding box. It is slightly unclear how this is actually implemented originally and I have seen several different versions in others’ code. In our implementation we will during training time calculate the intersection over union between the target bounding boxes and the predicted bounding boxes in the output and use this as the target for the object score. This does not seem to slow down training noticeably.

In the code we will convert the model predictions to bounding boxes according to the formulas in the paper

where pw and ph are the anchor box dimensions and (bx, by, bw, bh) is the resulting bounding box relative to the cell. We will then calculate the intersection over union with the target that we defined in the dataset class and lastly as in the no object loss above apply the binary cross entropy loss between the object score predictions and the calculated intersection over union. Note that the loss will only be applied to the anchors assigned to a target bounding box signified by indexing by obj.

The mathematical formula will be similar to the one above

with

where b is the bounding box computed above and

corresponds to the binary tensor with ones for the anchors assigned to a target bounding box.

Box Coordinates Loss

For the box coordinates we will simply use a mean squared error loss in the positions where there actually are objects. All predictions where there is no corresponding target bounding box will be ignored. We will apply a sigmoid function to the x and y coordinates to make sure that they are between [0,1] but instead of converting the widths and heights as above we want to compute the ground truth value that the network should predict. We find it by inverting the formula above for the bounding boxes.

where the yw and yh are the target width and height. We will then apply the mean squared error loss between the targets and predictions.

The equivalent formula is given by

where we use the ground truth labels calculated above for what values the model should predict.

Class Loss

We will only incur loss for the class predictions where there actually is an object. Our implementation differs slightly from the paper’s in the case of a class loss and we will use a cross entropy loss to compute the class loss. This assumes that each bounding box only has one label. The YOLOv3 motivates that it does not want to have this limitation and instead uses an binary cross entropy such that several labels can be assigned to a single object e.g. woman and person.

where self.entropy refers to an instance of PyTorch's CrossEntropyLoss() with combines the softmax function and negative loglikelihood loss. This corresponds to

where

is the prediction for the correct class c.

Complete YOLOv3 Loss

I will not attempt to put the entire loss function in a single formula as this only creates an unnecessarily complicated expression when each part can be understood and computed separately. The total loss is computed by

or equivalently

where each

is a constant signifying the importance of each part of the loss. It seems that the original implementation uses 1 for all constants but during training we found better convergence by modifying them.

The complete code for the loss function is found below and the code is placed in a separate loss.py file.

Training the model

The training configuration is completely contained in the config.py file that can be found on Github. This is where we specify the image size, dataset paths, augmentations, learning rate and all other constants. I will not include it here and if you implement YOLOv3 you can just copy it from above or write you own training configuration.

What we instead will focus on is building the training loop which should be quite straightforward. Everything from here will be placed in a train.py file which we can then run to train the model. First we will define the imports where we will import our previously defined modules and in addition a couple of helper functions from the utils.py file you can find on Github.

We will then define a training function which will train the network for one epoch. We will take as input the model, the data loader, the optimizer the loss function, a scaler for mixed precision training and scaled anchors such that each anchor is relative to the prediction scale. Originally the anchors are relative to the entire image but to the loss we want to input them relative to the cell and this is accomplished by scaling them with the grid size of the prediction scale.

We calculate the total loss as the sum of the losses for each prediction scale, three of them in total. We use mixed precision training to train the model.

We have now come to the part where we are ready to actually train the model. The main function will take care of setting up the model, loss function, data loaders etc. and in each epoch we will run the train function defined above. Once every ten epochs we will evaluate the model by checking the mean average precision on the test loader. Note that this can be costly if your model’s performance is bad because there may be many false positives that non max suppression and mean average precision have to loop through.

We have now reached the end of this YOLOv3 implementation and if you feel that everything is crystal clear then: Wow I’ve really outdone myself. It is more likely that you have to reiterate this and possibly others’ implementations if your goal is to implement YOLOv3 yourself. Anyhow, I hope that you take with you some key implementational details of YOLOv3 from this article and if you have any lingering thoughts, leave a comment!

Originally published at Github.