“Harmonious Attention Network for Person Re-Identification” suggests a joint learning of soft pixel attention and hard regional attention for person re-identification tasks. It is in arxiv yet and the authors are from Queen Mary University of London and Vision Semantics Ltd.
- Problem Statement
- Existing person re-identification (re-id) methods either assume the availability of well-aligned person bounding box images as model input or rely on constrained attention selection mechanisms to calibrate misaligned images.
- They are therefore sub-optimal for re-id matching in arbitrarily aligned person images potentially with large human pose variations and unconstrained auto-detection errors.
- Research Objective
- Show the advantages of jointly learning attention selection and feature representation in a CNN by maximising the complementary information of dfferenct levels of visual attention subject to re-id discriminative learning constraints.
- Proposed Solution
- Proposed Harmonious Attention CNN (HA-CNN) model for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person re-id in uncontrolled (misaligned) images.
- Formulate a idea of jointly learning multi-granularity attention selection and feature representation for optimising person re-id.
- Propose Harmonious Attention Convolutional Neural Network (HA-CNN) to simultaneously learn hard region-level and soft pixel-level attention.
- Introduce a cross-attention interaction learning scheme for further enhancing the compatibility between attention selection and feature representation given re-id discriminative constraints.
Harmonious Attention Network
Harmonious Attention Convolutional Neural Network (HA-CNN) aims to concurrently learn a set of harmonious attention, global and local feature representations for maximising their complementary benefit and compatibility in terms of both discrimination power and architecture simplicity.
Figure 1: The Harmonious Attention Convoluntional Neural Network.
HA-CNN has multi-branch network architecture and its objective is to minimise the model complexity therefore reduce the network parameter size whilst maintaining the optimal network depth. There are two branches in HA-CNN:
- Local branch: Each stream aims to learn the most discriminative visual features for one of local image regions of a person bounding box image.
- Use three Inception-B blocks
- Every stream ends with global average pooling and one fully-connected (FC) layer
- Use cross-entropy classification
- Global branch: This aims to learn the optimal global level features from the entire person image.
- Use three Inception-A and Inception-B blocks
- The network ends with a global average pooling layer and a fully-connected (FC) layer
- Use cross-entropy classification
Harmonious Attention Learning
Figure 2: The structure of each Harmonious Attention module consists of (a) Soft Attention which includes (b) Spatial Attention (pixel-wise) and (c) Channel Attention (scale-wise), and (d) Hard Regional Attention (part-wise). Layer type is indicated by background colour: grey for convolutional (conv), brown for global average pooling, and blue for fully-connected layers. The three items in the bracket of a conv layer are: filter number, filter shape, and stride. The ReLU  and Batch Normalisation (BN)  (applied to each conv layer) are not shown for brevity.
Soft Spatial-Cannel Attention
- Spatial Attention
- First, it uses global cross-channel averaging pooling layer
- Second, it uses a conv layer of 3x3 filter with stride 2
- Channel Attention
- 4-layers squeeze-and-excitation sub-network
- It use global average pooling and two convolutional layers
Finally, given the largely independent nature between spatial (inter-pixel) and channel (inter-scale) attention, the authors propose to learn them in a joint but factorised way as:
Hard Regional Attention
- The hard attention learning aims to locate latent (weakly supervised) discriminative T regions/parts (e.g. human body parts).
- This regional attention is modeled by learning a transformation matrix below.
- It allows for image cropping, translation, and isotropic scaling operations by varying two scale factors () and the 2-D spatial position ().
- Use pre-defined region size by sixing and for limiting the model complex.
- Therefore, the effective modelling part of is only and , with the output dimension as .
- The hard regional attention is enforced on that of the corresponding network block to generate T different parts which are subsequently fed into the corresponding streams of the local branch.
Cross-attention Interaction Learning
At the l-th level, the authors utilise the global-branch feature of the k-th region to enrich the corresponding local-branch feature by tensor addition as
where is computed by applying the hard regional attention of the (l+1)-th level’s HA attention module.