Read 2.0 Prepare the Cityscapes database for Training Segmentation Models

Prototxt files are included which can be used to train the various models. Note that these models may differ somewhat from the original models as they have been modified for end use in the DPU IP. Some of the types of modifications that were made to these models include:

  • Replacing the un-pooling layer with deconvolution layer in the decoder module
  • Replacing all PReLU with ReLU
  • Removing spatial dropout layers
  • Replace Batchnorm layers with a merged Batchnorm + Scale layer
  • Position Batchnorm layers in parallel with ReLU
  • In UNet-full/Unet-lite models Batchnorm/scale layer combinations were inserted before relu layers (after d0c, d1c, d2c, and d3c) as the DPU doesn't support the data flow from Convolution to both the Concat and relu simultaneously

If further analysis is desired, the model prototxt files have been included so they can simply be diff'd from the original caffe prototxt file.

In terms of augmentation, the mean values from the dataset and a scale factor of 0.022 are applied to the input layers for each model.

3.0.1 Training the Models from Scratch

When training from scratch, it is necessary to train ESPNet and ENet models in two stages: For ESPNet, we will train a model similar to the (c) ESPNet-C architecture which is shown in the figure below:



This essentially removes the decoder stage that is present in the (d) ESPNet model, and in place of that decoder stage, a single deconvolution layer is added to resize up 8x back to the original input size which matches the annotation size.

For ENet, a similar approach is taken and we train only the encoder stage by removing the decoder portion of the model and adding a single deconvolution layer to resize by a factor of 8x up to the original label size which matches the annotation size.

The FPN, Unet, and Unet-lite models can all be trained end to end, so the encoder/decoder two stage training process is not necessary for those models (though a similar process could be employed if desired and it may end up producing better results).

The pre-trained ESPNet/ENet encoder models were trained for 20K Iterations with an effective batch size of 50, and lr_base 0.0005. Note that larger batch sizes can also be used and may ultimately produce more accurate results, though training time would be increased for larger batch sizes.

If you happen to encounter a situation where you receive a "CUDA out of memory" error, try reducing the batch size in the train_val.prototxt or train_val_encoder.prototxt and increase the corresponding iter_size in the solver.prototxt to maintain the same effective batch size. Note again that all these models were trained with GTX 1080ti GPUs which have 12GB of memory. If your GPU has less memory, you may need to adjust the batch sizes to fit within your GPU memory limits.

After this first step has been completed, we can train the full ESPNet prototxt using the weights from the first step to fine tune the model.

Note that the initial training step for these models takes about 36 hours on my Xeon workstation using a GTX 1080ti graphics card. Training the ESPNet (ESPNet-C) and ENet Encoder Models

The encoder models have been included with the caffe distribution. The files which will be needed as a starting point for this are the solver_encoder.prototxt and the train_val_encoder.prototxt. These files are located under the Segment/workspace/model/espnet and Segment/workspace/model/enet paths respectively, and can be used for training the encoder only portions of these networks.

The full model prototxt files are also available under these paths and it is recommended to compare the two files using a text editor to understand what has been removed from the full model for the encoder only portion.

If you would like to skip training the encoder portion of the model, I have included a pre-trained encoder model for both networks which are stored under the Segment/workspace/model/espnet/encoder_models or Segment/workspace/model/enet/encoder_models directories.

  1. The first step to train these models is to open the solver_encoder.prototxt file for the associated model. It is important to understand the training parameters and paths for this file. Notice the lines containing the "net: " definition and "snapshot_prefix: ".

The first line specifies a relative path to where the train_val_encoder.prototxt exists and the second should point to an existing directory where you would like the model snapshots to be stored. Note that relative paths have been specified so that if run from $CAFFE_ROOT, then no modifications should be needed to run the training.

Prototxt file

Note also how the other hyper-parameters are set in the solver prototxt. The base_lr, max_iter, iter_size, and device_id are all important training parameters.

The base_lr is probably the most important parameter and if it is set to big or too small, the training process will never converge. For this tutorial, it has been found that a size of 0.0005 is an appropriate value for training the models.

The iter_size is used to determine the effective batch size. If the batch size is set in the train_val_encoder.prototxt file to '5' in the input layer, then the iter_size essentially applies a multiplier to that batch size by not updating the weights until iter_size number of batches have been completed. For example, if the iter_size is set to 10, then 10 x 5 = 50 is the effective batch size. Batch size has a significant effect on the convergence of the training process as well as the accuracy of the model, so it is important to maintain a larger batch size when training the full models. In the case of this tutorial, this parameter is used to enable the training process to maintain a larger effective batch size where there is a limited amount of GPU memory.

The device_id parameter specifies the device id of the GPU card which will be used to accelerate the training process. If you have only one GPU, specify '0', however, multiple GPUs can also be used by using a comma separated list and you can also train multiple models on different GPUs.

As noted before the max_iter parameter determines how many times the model will see the training data during the training process. If a dataset has N images and batch size is B, and P is the number of epochs, then the relationship between epochs and iterations is defined as:

Iterations = (N * P) / B

Since the training dataset is around 3000 images, we can re-arrange this equation to calculate the number of epochs by:

(20K*50) /3K = 333 epochs.

2.  Once the solver_encoder.prototxt has been verified, the model can be trained by changing directory to $CAFFE_ROOT and running one of the following commands:

For ESPNet:

./build/tools/caffe train \
–solver ../workspace/model/espnet/solver.prototxt \
–weights ../workspace/model/espnet/final_models/pretrained.caffemodel \
2>&1 | tee ../workspace/model/espnet/caffe-fine-tune-full.log

For ENet:

./build/tools/caffe train \
–solver ../workspace/model/enet/solver.prototxt \
–weights ../workspace/model/enet/final_models/pretrained.caffemodel \
2>&1 | tee ../workspace/model/enet/caffe-fine-tune-full.log

If errors occur regarding missing libcudart or similar, run sudo ldconfig /usr/local/cuda/lib64, then retry.

I have included an example log file from my console output during the training process for ESPNet which is stored under Segment/workspace/model/ESPNet/encoder_models/train_encoder_log.txt. You should see something similar during the training process.

Once the training process has completed, the full model can be trained which uses these pre-trained weights as a starting point for the encoder portion of the model. Training the Full Models

The full models for ENet, ESPNet, FPN, Unet-Full, and Unet-Lite have been included with the caffe distribution. The files which will be needed as a starting point for this are the solver.prototxt and the train_val.prototxt. These files are located under the Segment/workspace/model/espnet, Segment/workspace/model/enet, Segment/workspace/model/FPN, Segment/workspace/model/unet-full, and Segment/workspace/model/unet-lite paths respectively, and can be used for training the full networks.

Since FPN, Unet-full, and Unet-lite can be trained end-to-end from scratch, there is no need to train the encoder portion separately. Generally for training the full models, a larger batch size is desirable as it helps the model to approximate the full dataset better than a smaller batch size. For this tutorial, I have used batch sizes >= 100 for training the full models.

  1. Just like with the encoder training, the first step to train the full models is to open the associated solver.prototxt file and view the properties of the various hyper-parameters. Note again that relative paths are used for the "net" and "snapshot_prefix" parameters, so if the training is run from $CAFFE_ROOT, the intended locations should be used for these files.

  2. Once the solver.prototxt has been verified, the model can be trained by changing directory to $CAFFE_ROOT and running one of the following commands (assuming the pretrained models are used, otherwise specify the name of your caffemodel):

For ESPNet:

./build/tools/caffe train \
–solver ../workspace/model/espnet/solver.prototxt \
–weights ../workspace/model/espnet/encoder_models/pretrained_encoder.caffemodel \
2>&1 | tee ../workspace/model/espnet/final_models/train_log.txt

For ENet:

./build/tools/caffe train \
–solver ../workspace/model/enet/solver.prototxt \
–weights ../workspace/model/enet/encoder_models/pretrained_encoder.caffemodel \
2>&1 | tee ../workspace/model/enet/final_models/train_log.txt

For FPN:

./build/tools/caffe train \
–solver ../workspace/model/FPN/solver.prototxt  \
2>&1 | tee ../workspace/model/FPN/final_models/train_log.txt

For Unet-Full:

./build/tools/caffe train \
–solver ../workspace/model/unet-full/solver.prototxt  \
2>&1 | tee ../workspace/model/unet-full/final_models/train_log.txt

For Unet-Lite:

./build/tools/caffe train \
–solver ../workspace/model/unet-lite/solver.prototxt \
2>&1 | tee ../workspace/model/unet-lite/final_models/train_log.txt

If errors occur regarding missing libcudart or similar, run sudo ldconfig /usr/local/cuda/lib64, then retry.

I have included log files for each of the networks which show the output of the training process:

Note that these are stored respectively at:






You should see something similar during the training process for your full models.

In general, training the full models is quite time consuming, in many cases >72 hours per model using my ML workstation.

3.0.2 Training the Models using Transfer Learning

If you would like to accelerate the process of training the models, you can also train from transfer learning using the existing models that I have provided.

The pre-trained full models exist at the following paths:

For ESPNet:


For ENet:


For FPN:


For UNet-Full:


For UNet-Lite:


The following steps can be used to either use transfer learning to retrain only the encoder portion or the full model. The caffemodel that is passed as an argument to the training step should be the appropriate model depending on what the desired approach is:

  • If you intend to use transfer learning with the encoder portion only, then use the pre-trained model under the encoder_models directory for the associated network. After this step, you can train the full model using the output of this step as the input weights to train the full model.

  • If you intend to use transfer learning with the full model, then use the pre-trained model under the final_models directory for the associated network.

  1. Just like with the training from scratch steps, the first step to train the model is to open the associated solver.prototxt file and verify the hyper-parameters.

  2. Once the solver.prototxt has been verified, the models can be trained by changing directory to $CAFFE_ROOT and running one of the following commands (modify the weights argument to specify the desired caffemodel for this step):

For ESPNet:

./build/tools/caffe train \
–solver ../workspace/model/espnet/solver.prototxt \
–weights ../workspace/model/espnet/final_models/pretrained.caffemodel \
2>&1 | tee ../workspace/model/espnet/caffe-fine-tune-full.log

For ENet:

./build/tools/caffe train \
–solver ../workspace/model/enet/solver.prototxt \
–weights ../workspace/model/enet/final_models/pretrained.caffemodel \
2>&1 | tee ../workspace/model/enet/caffe-fine-tune-full.log

The equivalent commands can be used to perform transfer learning on the other models as well.

If errors occur regarding missing libcudart or similar, run sudo ldconfig /usr/local/cuda/lib64, then retry.

At this point, the model training has been completed and you can proceed to the next step which to is evaluate the floating point model, however, if you are interested about the performance/training of the pre-trained models, please take a gander at the next section: "3.1.0 About the Pre-Trained Models".

3.1.0 About the Pre-Trained Models

The pre-trained models included with this tutorial have been trained for various # of iterations and with various batch sizes. Note that training all of the models end to end took about 3-4 weeks on my Xeon server with 2x GTX 1080ti graphics cards.

The full settings used for training these models are captured in the log files and solver prototxt files. The initial training approach is outlined in the following table, and from this it can be seen that an attempt was made to train the models for ~1000 epochs each. This extended amount of training allows for exploratory options when picking a suitable model for deployment.

Trained models

Each of the pre-trained models achieves different levels of accuracy in terms of mIOU and some of this variation is due to the training parameters used. An initial effort was made to keep the total training epochs around 1000 for each model and the effective batch size around 170-175, with the exception of Unet-full as it was an exceptionally large model, so a reduced batch size (and therefore number of epochs) was used to speed up the training process.

Note again that the intention of this tutorial is not to benchmark different models against each other, or even to show a model that works exceptionally well, but rather to show how different segmentation models can be trained, quantized, then deployed in Xilinx SoCs while maintaining the floating point model accuracy.

As the training progressed, regular mIOU measurements were taken using decent_q (don't worry if you don't understand this yet, it's covered in section 5 - part 3) to score the models against the Cityscapes validation dataset (500 images). When viewing the plot, recall again that ENet and ESPNet had separate encoder training, so the reduced number of iterations shown in this plot do not visualize that fact.

Training Progress

It can be seen from the plot that the model with the highest number of iterations does not necessarily correspond to the highest mIOU. You can also see from the fluctuations in mIOU that perhaps it might be possible to achieve better results by adjusting the learning rate and lr_policy, or by training some of the models for more iterations. In general, the models with the highest mIOUs were included as the pre-trained model for each respective network in this tutorial.

  • For ENet -> 6K iteration model
  • For ESPNet -> 18K iteration model
  • For FPN -> 10K iteration model
  • For Unet-Lite -> 10K/13K iteration models
  • For Unet-Full -> 16K iteration model

Note that ESPNet continued to increase in mIOU at 12K iterations, so an additional 8K iterations of training were performed to find a higher mIOU model. Additional exploratory training was done for some of the other models as well, but the final models included as pre-trained are captured in the table below which shows the training snapshot used as well as the mIOU as was measured for the floating point model, the quantized model on the host machine, and the model deployed on the ZCU102 hardware. Again, don't worry if it isn't yet clear how these results were achieved. The latter sections in this tutorial explain how to measure the mIOU for each of these scenarios.

Training shapshot

The results as seen in the table are a bit confusing for Unet-lite, but essentially, an issue was encountered when using the 1024x512 images size for evaluating the 13K iteration model against the cityscapes validation dataset on the ZCU102 which caused an issue with the DPU timing out. The 10K model had lower accuracy in terms of mIOU, but did not exhibit this issue, so the 10K model was used for the evalution with a size of 1024x512, and the 13K iteration model is used with an input size of 512x256 when playing back the video as it produces better accuracy and works ok with the smaller input size. This issue is currently being investigated and the tutorial will be updated when a solution when one is found.


Read 4.0 Quantizing and Compiling the Segmentation networks for DPU implementation

About Jon Cory

About Jon Cory

Jon Cory is located near Detroit, Michigan and serves as an Automotive focused Machine Learning Specialist Field Applications Engineer (FAE) for Xilinx.  Jon’s key roles include introducing Xilinx ML solutions, training customers on the ML tool flow, and assisting with deployment and optimization of ML algorithms in Xilinx devices.  Previously, Jon spent two years as an Embedded Vision Specialist (FAE) with a focus on handcrafted feature algorithms, Vivado HLS, and ML algorithms, and six years prior as a Xilinx Generalist FAE in Michigan/Northern Ohio.  Jon is happily married for four years to Monica Cory and enjoys a wide variety of interests including music, running, traveling, and losing at online video games.