Segmentation with Structured and Contextual Features

Too Far to See? Not Really!

--- Pedestrian Detection with Scale-aware Localization Policy



Xiaowei Zhang,
Li Cheng, @Machine Learning For Bioimage Analysis Group, BII, A*STAR,Singapore
and Xiaowei Zhang, Bo Li, Hai-miao Hu, @Beihang University, China
Segment 2D and 3D Filaments by Learning Structured and Contextual Features
 
  • Motivation

  • A major bottleneck of pedestrian detection lies on the sharp performance deterioration in the presence of small-size pedestrians as being relatively far from the camera. As presented in Fig.1, a typical image often contains multiple pedestrians of different scales, and current detection performance varies significantly over scales: The state-of-the-art detectors typically work reasonably well with large size pedestrians where the objects are near the camera, also referred to as near-scale; In regard to small size (i.e. far-scale) ones, their performance becomes however considerably worse. Take one latest effort, MS-CNN for example, it has been reported that empirically their detector is capable of achieving 3.30% log-average miss rate for near-scale pedestrians (higher than 80 pixels in height) in Caltech Pedestrian Benchmark, the error rate however increases to 60.51% for medium- and far-scale pedestrians (lower than 80 pixels in height).  

    Pipeline of our approach.

    Figure 1: In pedestrian detection, a typical input image usually contains multiple pedestrian instances over different scales. (a) An input image from the Caltech benchmark. (b) The scale distribution of pedestrian heights from the same Caltech dataset. One can observe that far-scale instances in fact dominate the distribution. (c) and (d) display exemplar visual appearance between near- and far-scale instances, as well as the corresponding neuronal feature representations from the proper layers.

  • Our Approach

  • To address this challenge, we propose in this paper an active pedestrian detector that explicitly operates over multiple-layer neuronal representations of the input still image. More specifically, convolutional neural nets such as ResNet and Faster R-CNNs are exploited to provide a rich and discriminative hierarchy of feature representations as well as initial pedestrian proposals. Here each pedestrian observation of distinct size could be best characterized in terms of the ResNet feature representation at a certain layer of the hierarchy; Meanwhile initial pedestrian proposals are attained by Faster R-CNNs techniques, i.e. region proposal network and follow-up region of interesting pooling layer employed right after the specific ResNet convolutional layer of interest, to produce joint predictions on the bounding-box proposals; locations and categories (i.e. pedestrian or not). This is engaged as input to our active detector where for each initial pedestrian proposal, a sequence of coordinate transformation actions is carried out to determine its proper x-y 2D location & layer of feature representation, or eventually terminated as being background. Empirically our approach is demonstrated to produce overall lower detection errors on widely-used benchmarks, and it works particularly well with far-scale pedestrians. For example, compared with 60.51% log-average miss rate of the state-of-the-art MS-CNN for far-scale pedestrians (those below 80 pixels in bounding-box height) of the Caltech benchmark, the miss rate of our approach is 41.85%, with a notable reduction of 18.68%.

    Pipeline of our approach.

    Figure 1: The flowchart of our proposed approach. Multi-layer representations of ResNet are respectively utilized to compile pedestrian proposals of different sizes, which are then passed to our localization policy module to produce the final outputs.

  • Demo video

    Video

  • Related publications, code, and results

  1. Xiaowei Zhang, Li Cheng, Bo Li, and Hai-miao Hu. Too Far to See? Not Really! --- Pedestrian Detection with Scale-aware Localization Policy. In arXiv, 2017. [pdf] [Supplementary file] [Source code] [Detection results on Caltech, ETH, and TUD-Brussels benchmarks]