“U-Net: Convolutional Networks for Biomedical Image Segmentation” is a famous segmentation model not only for biomedical tasks and also for general segmentation tasks, such as text, house, ship segmentation.
Summary
- Proposed Solution
- Present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently.
- The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.
- Contribution
- U-net can be trained end-to-end from very few images and outperforms the prior best method on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
- It is fast, segmentation of a 512x512 image takes less than a second on a recent GPU.
U-Net
Figure 1: U-net architecture(example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.
Network Architecture
U-net consits of a contracting path (left side) and an expansive path (right side).
- Contracting path
- typical architecture of a convolutional network
- repeated application of two 3x3 convolutions (unpadded convolutions)
- each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling
- at each downsampling step, we double the number of feature channels
- Expansive path
- consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels
- a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions
- each followed by a ReLU
- the cropping is necessary due to the loss of border pixels in every convolution
At the final layer a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes. In total the network has 23 convolutional layers.
Training
The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross entropy loss function. The soft-max is defined as
\[p_k(x)=exp(a_k(x))/(\sum_{k'=1}^Kexp(a_{k'}(x)))\]The cross entropy then penalizes at each position the deviation of \(p_{l(x)}(x)\) from 1 using
\[E = \sum_{x\in Ω}w(x)log(p_{l(x)}(x))\]The seperation border is computed using morphological operations. The weight map is then computed as
\[w(x) = w_c(x) + w_0 \cdot exp(-\frac{(d_1(x)+d_2(x))^2}{2\sigma^2})\]