“Harmonious Attention Network for Person Re-Identification” suggests a joint learning of soft pixel attention and hard regional attention for person re-identification tasks. It is in arxiv yet and the authors are from Queen Mary University of London and Vision Semantics Ltd.

# Summary

• Problem Statement
• Existing person re-identification (re-id) methods either assume the availability of well-aligned person bounding box images as model input or rely on constrained attention selection mechanisms to calibrate misaligned images.
• They are therefore sub-optimal for re-id matching in arbitrarily aligned person images potentially with large human pose variations and unconstrained auto-detection errors.
• Research Objective
• Show the advantages of jointly learning attention selection and feature representation in a CNN by maximising the complementary information of dfferenct levels of visual attention subject to re-id discriminative learning constraints.
• Proposed Solution
• Proposed Harmonious Attention CNN (HA-CNN) model for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person re-id in uncontrolled (misaligned) images.
• Contribution
• Formulate a idea of jointly learning multi-granularity attention selection and feature representation for optimising person re-id.
• Propose Harmonious Attention Convolutional Neural Network (HA-CNN) to simultaneously learn hard region-level and soft pixel-level attention.
• Introduce a cross-attention interaction learning scheme for further enhancing the compatibility between attention selection and feature representation given re-id discriminative constraints.

# Harmonious Attention Network

Harmonious Attention Convolutional Neural Network (HA-CNN) aims to concurrently learn a set of harmonious attention, global and local feature representations for maximising their complementary benefit and compatibility in terms of both discrimination power and architecture simplicity.

Figure 1: The Harmonious Attention Convoluntional Neural Network.

HA-CNN has multi-branch network architecture and its objective is to minimise the model complexity therefore reduce the network parameter size whilst maintaining the optimal network depth. There are two branches in HA-CNN:

1. Local branch: Each stream aims to learn the most discriminative visual features for one of local image regions of a person bounding box image.
• Use three Inception-B blocks
• Every stream ends with global average pooling and one fully-connected (FC) layer
• Use cross-entropy classification
2. Global branch: This aims to learn the optimal global level features from the entire person image.
• Use three Inception-A and Inception-B blocks
• The network ends with a global average pooling layer and a fully-connected (FC) layer
• Use cross-entropy classification

## Harmonious Attention Learning

Figure 2: The structure of each Harmonious Attention module consists of (a) Soft Attention which includes (b) Spatial Attention (pixel-wise) and (c) Channel Attention (scale-wise), and (d) Hard Regional Attention (part-wise). Layer type is indicated by background colour: grey for convolutional (conv), brown for global average pooling, and blue for fully-connected layers. The three items in the bracket of a conv layer are: filter number, filter shape, and stride. The ReLU [15] and Batch Normalisation (BN) [10] (applied to each conv layer) are not shown for brevity.

### Soft Spatial-Cannel Attention

1. Spatial Attention
• First, it uses global cross-channel averaging pooling layer
• Second, it uses a conv layer of 3x3 filter with stride 2
2. Channel Attention
• 4-layers squeeze-and-excitation sub-network
• It use global average pooling and two convolutional layers

Finally, given the largely independent nature between spatial (inter-pixel) and channel (inter-scale) attention, the authors propose to learn them in a joint but factorised way as:

### Hard Regional Attention

• The hard attention learning aims to locate latent (weakly supervised) discriminative T regions/parts (e.g. human body parts).
• This regional attention is modeled by learning a transformation matrix below.
• It allows for image cropping, translation, and isotropic scaling operations by varying two scale factors ($s_h, s_w$) and the 2-D spatial position ($t_x, t_y$).
• Use pre-defined region size by sixing $s_h$ and $s_w$ for limiting the model complex.
• Therefore, the effective modelling part of $A^l$ is only $t_x$ and $t_y$, with the output dimension as $2 \times T$.
• The hard regional attention is enforced on that of the corresponding network block to generate T different parts which are subsequently fed into the corresponding streams of the local branch.

### Cross-attention Interaction Learning

At the l-th level, the authors utilise the global-branch feature $X^{(l,k)}_G$ of the k-th region to enrich the corresponding local-branch feature $X^{(l,k)}_L$ by tensor addition as

where $X^{(l,k)}_G$ is computed by applying the hard regional attention of the (l+1)-th level’s HA attention module.

Tags:

Categories:

Updated: