Multiple Target Localisation at over100FPS
Simon T aylor
https://www.wendangku.net/doc/a78513007.html,/~sjt59 T om Drummond
https://www.wendangku.net/doc/a78513007.html,/~twd20
Department of Engineering
University of Cambridge
Cambridge,UK
Abstract
This paper presents a method for fast feature-based matching which enables7in-dependent targets to be localised in a video sequence with an average total processing
time of7.46ms per frame.We extend recent work[14]on fast matching using His-
togrammed Intensity Patches(HIPs)by adding a rotation invariant framework and a tree-
based lookup https://www.wendangku.net/doc/a78513007.html,pared to state-of-the-art fast localisation schemes[15]we
achieve better matching robustness in under a quarter of the computation time and re-
quiring5-10times less memory.
1Introduction
Finding points in different views of a scene which correspond to the same real world loca-tions is a fundamental problem in computer vision,and a vital component of applications such as automated panorama stitching(e.g.[2]),image retrieval(e.g.[13])and object local-isation(e.g.[7]).
A common theme in many successful approaches to these problems is the extraction of a set of local features from images to be matched.An overall match between two images can then be established by combining information from many local feature correspondences. The use of information from many local matches adds redundancy and allows the methods to cope with partial occlusion and some incorrect correspondences.
The?rst stage of all state-of-the-art matching schemes is to apply interest region detec-tion to factor out common imaging transformations.The Harris corner detector[4]has been used frequently,but modern methods commonly use more expensive searches for scale-space interest regions such as the DoG detector[7].Finding af?ne-invariant interest regions has also been extensively studied[8,9].A canonical orientation can be assigned to an interest region,for example by considering the blurred gradient at the centre of the region[2].
The most basic representation of the interest region is obtained by extracting a pixel patch from the canonical frame that has been assigned.Simple patch-matching schemes such as Normalised Cross Correlation(NCC)or Sum-of-Squared Differences(SSD)do not perform well when subject to the small registration errors introduced by interest region detectors,and hence a more complicated matching scheme is usually employed.Two broad approaches have been studied in the literature.The?rst class of methods perform further processing on the extracted patches to compute a feature descriptor vector which is ideally equal for different views of the same feature.The second class of methods treats the matching problem c 2009.The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
BMVC 2009 doi:10.5244/C.23.58
Figure1:Four frames from a sequence demonstrating independent multiple object localisa-tion.No frame-to-frame tracking is performed;the objects are localised in each frame.The mean total computation time per frame is7.46ms using a single core of a2.4GHz CPU.
as one of classi?cation,and can obtain matches with very little computation on the input patches after classi?ers for each database feature have been learnt in an of?ine training stage.
David Lowe’s SIFT method[7]is one of the most common descriptor-based approaches. SIFT uses blurring and soft-bin histogramming of local gradients to extract a descriptor vector which is robust to the errors introduced by interest region detection and orientation assignment.Many other approaches to transforming an image patch into a feature vector have been proposed,such as GLOH[10],MOPS[2],and CS-LBP[5].Winder and Brown applied a learning approach to?nd optimal parameters for these types of descriptor[16].
Lepetit et al.[6]demonstrated the viability of the alternative matching-by-classi?cation approach by showing that an of?ine training phase could be used to train randomised tree classi?ers for features.The Ferns method[11]uses a different classi?er with improved performance.Both of these methods only require a small number of simple pixel tests on the runtime images to classify features,and hence require very little computation.However the classi?ers represent joint distributions of the tests and so have large memory requirements.
Both the SIFT and Ferns approaches as originally presented require too much memory and computation to be suitable for real-time applications on small devices such as mobile phones.Recent work by Wagner et al.[15]adapted both approaches to make them suitable for low-powered platforms.A key change to both methods was replacing the expensive scale-space search for interest regions with the very ef?cient?xed-scale FAST-9detector[12], and instead achieving scale invariance at runtime by ensuring the database contains features from multiple scales.Their optimised methods were both able to localise a planar target in a320×240image in a total frame time of around5ms on a desktop PC.The technique we propose achieves more robust localisation on the same test sequences and reduces both the total frame time and memory requirements by a factor of more than4.
Our approach is based on simple pixel patches extracted from around interest points. Although SSD-based matching is not robust when subject to small registration errors,reg-istration errors do not affect all pixels equally;samples from the interior of large regions of solid colour in a patch are more robust to registration errors.We employ a training phase to learn a model for the range of patches expected for each feature using independent per-pixel distributions,which we refer to as a Histogrammed Intensity Patch(HIP).This model allows runtime matching to use simple pixel patches whilst providing suf?cient viewpoint invari-ance to handle registration errors from interest point detection.The histograms are quantised to give a small binary representation that can be very ef?ciently matched at runtime.
In previous work[14]we introduced the approach and used the ef?cient FAST-9detector to factor out translation changes.We trained independent sets of features from many differ-
ent viewpoint bins covering scale,rotation and af?ne variations,which resulted in large
databases of around13,000features for a single target.Despite the large database size,a simple indexing scheme combined with the binary representation’s ef?cient matching score enabled localisation of a target in a total frame time of around1.5milliseconds.
This paper presents a number of signi?cant improvements to the approach.Firstly canon-ical orientation computation is added to the interest point detection stage,allowing the num-ber of features for a target to be reduced by a factor of around15.A novel two-pass approach to training accounts for errors related to interest point detection and orientation assignment, and allows us to choose ef?cient methods for those stages without sacri?cing robustness of the overall system.A tree-based matching scheme is introduced to exploit common infor-mation in different features and prevent the need for an exhaustive comparison against all database features.Finally a framework for rapid independent multiple-target localisation using HIPs is presented.
2Training Features for a Target
As in the Ferns[11]approach we utilise a training phase to reduce the amount of computation required at runtime.We use a training set of views of the entire target covering the range of viewpoints for which localisation is desired.The set of views is arti?cially generated by warping a single reference image.
For maximum runtime performance we avoid scale or af?ne-invariant interest region detectors and instead train independent sets of features for different viewpoint“bins”,each of which cover a small range of scale and af?ne viewpoint parameters.The experiments in this paper use9scale bins in total,3per octave.In practice we only use one range of af?ne parameters representing viewpoints centred on a direct view of the target but including out-of-plane rotations of up to40degrees in all directions.This single set of af?ne parameters enables reasonable matching beyond this range,but additional extreme af?ne viewpoint bins could be added if required.
Around1000images are generated for each viewpoint bin.Each image is generated by warping the reference image with a random camera-axis rotation and randomly chosen scale and af?ne parameters from within the range of the viewpoint bin.Additionally a small ran-dom gaussian blur and pixel noise is added so the training set more accurately represents the poor quality images likely at runtime.Our current implementation takes around20min-utes to generate all the training images for a typically-sized target,and uses large kernels convolved with the reference image to avoid aliasing artifacts.
The images in the training set are similar to the frames we expect at runtime,although as they are warped views of the entire target they are often larger than standard camera resolutions.To ensure all possible camera views contain suf?cient features for localisation we split the viewpoints into200×200pixel regions de?ned in a viewpoint reference frame which is taken as an unrotated view of the target from the centre of the viewpoint bin.
A two-stage training approach is used to identify repeatable features in each region and build feature models for them.The?rst stage is to run interest point detection and orienta-tion assignment on all of the training images.The position and orientation of each detection and the appearance of the surrounding image region is stored in a structure termed a subfea-ture.The second stage then clusters subfeatures based on position and orientation to identify repeatable features,and builds a Histogrammed Intensity Patch model for each feature by combining the appearance information from the subfeatures in each cluster.
Figure2:Left:The8×8sample grid used for the HIPs and the5sample locations selected for indexing,relative to the FAST-9interest point(shown by the grey circle).Right:The orientation assignment scheme uses a sum of the gradients between opposite pixels in the 16-pixel FAST ring.
2.1Selecting Repeatable Feature Positions and Orientations
Runtime performance considerations led us to select FAST-9[12]as the interest point de-tector.Typical approaches to assigning orientation require computationally expensive blur-ring[2]or histogramming[7]and would add signi?cant computation to the runtime process-ing.Instead we simply sum gradients computed between opposite pixels in the16-pixel ring used in FAST corner detection,as shown in Figure2.The directions are?xed so the x and y components of the orientation vector can be computed very quickly from weighted sums of the8pixel differences.
We run FAST-9on each training image within a viewpoint bin and represent the35 highest-scoring corners from each200×200region with subfeatures.Proportionally fewer subfeatures are extracted from smaller regions at the edges of the viewpoint reference frame. For smaller scale viewpoint bins where the entire target is under200×200pixels a35cor-ner minimum is enforced which effectively increases the feature density for these smaller targets.The orientation measure of Figure2is also computed at each detected corner.The position(x r,y r)and orientationθr of the subfeature in the coordinate system of the view-point reference frame can be computed as the warp used to generate the training image is known.
The appearance of the subfeature is represented by a sparsely-sampled quantised patch. We use a square8×8sampling grid centred on the interest point,with a1-pixel gap between samples,as shown in Figure2.Before sampling the pixel values the sampling grid is?rst rotated so that it is aligned with the detected orientation of the subfeature.The64samples are then extracted with bilinear interpolation,normalised for mean and standard deviation to give lighting invariance,and quantised into5intensity bins.The5-bit index value explained in Section3.2is also computed and stored in the subfeature.
The most repeatable feature positions and orientations for a viewpoint bin appear as dense clusters of subfeatures in the(x r,y r,θr)space when all the training images in the bin have been processed.Every subfeature is considered as the potential centre of a HIP feature, and the set of other subfeatures(from other training images)that lie within a certain distance of the centre is found.We manually decide the allowable distance;in this paper we allow2 pixels of localisation error and10degrees of orientation error.Sets of subfeatures given these allowed distances share enough similarity in appearance to be represented by a single HIP feature in a target database.The largest set of subfeatures represents the most repeatably-detected feature,and will be the?rst feature we select to add to the database.Sets continue
to be selected in a greedy manner,disregarding any which overlap already added HIPs.We continue adding features until the average number of subfeatures per training image region which are represented in the database has reached a speci?ed fraction of the number of subfeatures detected.We set this parameter to0.5in our experiments.This criterion will naturally compensate for inaccuracies in interest point detection and orientation assignment by adding multiple database features to represent a single cluster if the errors are too large. Hence our use of an inexpensive and inaccurate orientation assignment scheme may result in more features being added to the database but should not cause a major degradation in matching robustness.The trade-off between robustness of matching and database size can be made by adjusting the parameters for desired corner density and desired fraction of corners in the database.
2.2Creating HIPs from Subfeature Sets
After a particular set of subfeatures has been selected for addition to the database,quantised patches from the subfeatures are combined to give the Histogrammed Intensity Patch repre-sentation for the feature.The HIP model contains64independent histograms of5quantised intensity levels;one histogram for each sample of the quantised patches.The histograms are empirical distributions of the quantised intensity in a particular sample across all the subfeatures in the set.Thus samples which have a consistent quantised level in all of the constituent subfeatures will have sharply peaked histograms in the HIP.To save on memory and computation required for matching we quantise the histograms to a binary representa-tion.Histogram bins with probabilities less than5%are rare bins and represented by a1, and other bins by a0.A single HIP requires5bits for each of64samples,a total of40bytes. We can refer to each bit of a database HIP D as D i,j where i∈{0,...,63}is the sample number and j∈{0,...,4}is the quantised intensity level.
The process of extracting subfeatures from training images,?nding the largest subfeature sets and building the HIP representations takes less than5minutes on a desktop machine for a typical target with around10000images in the training set.
3Runtime Matching
We use a?xed-scale detector to avoid a dense scale space search at runtime but we still?nd it useful to build a sparse image pyramid by half-sampling the input image twice to obtain half and quarter-scale images.As well as having signi?cantly reduced blur,which improves repeatability of the?xed-scale FAST detector,these sub-sampled images also enable match-ing over a wider range of scales as features can still be detected in the reduced images if the target scale in the input image is larger than the trained https://www.wendangku.net/doc/a78513007.html,ing half-sampling by aver-aging4pixels for each1of the reduced image permits a very ef?cient SIMD implementation which produces both reduced images from a640×480frame in under0.1ms.
The runtime images are treated similarly to those from the training set:FAST-9is used to extract the highest scoring corners from the images,the orientation assignment scheme of Figure2is employed and a patch is extracted from the rotated sparse sampling grid and normalised for mean and standard deviation.We use bilinear interpolation with precomputed pixel positions and weights in2degree increments so the additional cost of rotation normal-isation is minimal,and?nd around150corners from the full scale image and75from each of the reduced-scale is suf?cient for excellent matching robustness.
For ef?cient matching the normalised patch is converted to a binary representation R. This is slightly different to the database feature D as it represents just a single patch.Like D,we use a320-bit value but R has exactly one bit set for each pixel,corresponding to the intensity bin for each sample in the patch:
R i,j=
1if B j
应聘测试题 姓名:应聘职位:日期: (首先非常感谢您来我公司面试,请用120分钟做好以下题目,预祝您面试顺利!) 一、选择题 1.在基于网络的应用程序中,主要有B/S与C/S两种部署模式,一下哪项不属于对于B/S模式的正确描述() A. B/S模式的程序主要部署在客户端 B. B/S模式与C/S模式相比更容易维护 C. B/S模式只需要客户端安装web浏览器就可以访问 D. B/S模式逐渐成为网络应用程序设计的主流 2.以下关于HTML文档的说法正确的一项是( ) A.与这两个标记合起来说明在它们之间的文本表示两个HTML文本B.HTML文档是一个可执行的文档 C.HTML文档只是一种简单的ASCII码文本 D.HTML文档的结束标记可以省略不写 3.BODY元素可以支持很多属性,其中用于定义已访问过的链接的颜色属性是()。A.ALINK B.CLINK C.HLINK D.VLINK
4.在网站设计中所有的站点结构都可以归结为( ) A.两级结构 B.三级结构 C.四级结构 D.多级结构 5.Dreamweaver中,模板文件的扩展名是( ) A. .htm B. .asp C. .dwt D. .css 6.Dreamweaver中,站点文件的扩展名是( ) A. .htm B. .ste C. .dwt D. .css 7.网页中插入的flash动画文件的格式是( ) A.GIF B.PNG C. SWF D.FLA 8.设置水平线效果的HTML代码是( ) A.
B. < hr noshade> C.
10.以下表示预设格式标签的是( ) A. B.C. D.
11.以下表示声明表格标签的是( ) A.