600字范文 > 多目标跟踪论文 Deep SORT 解读

多目标跟踪论文 Deep SORT 解读

时间：2021-04-07 05:12:48

多目标跟踪论文 Deep SORT 解读

flyfish

Deep SORT 是简称，原题目是《simple online and realtime tracking with a deep association metric》

由于目标识别（object detection）的进展，基于检测的跟踪（ tracking-by-detection）变成了多目标跟踪（multiple object tracking）的主导范式（ leading paradigm）。

paradigm：a model or example that shows how something works 。

Simple online and realtime tracking (SORT)

Track处理和状态估计（Track Handling and State Estimation）

状态估计:

我们的Track场景定义了8维的状态空间 ( u , v , γ , h , x ˙ , y ˙ , γ ˙ , h ˙ ) (u, v, \gamma, h, \dot{x}, \dot{y}, \dot{\gamma}, \dot{h}) (u,v,γ,h,x˙,y˙,γ˙,h˙),分别表示边框（bounding box）中心的位置、纵横比、高度、以及在图像坐标（image coordinate）中对应的各自速度。

边框的中心位置 ( u , v ) (u,v) (u,v),

纵横比 γ \gamma γ,例如显示器常见的宽高比，宽度和高度的比是 16:10，则宽度16，高度10

高度 h h h,有了高度，根据宽高比就能计算宽度

我们使用一个标准的Kalman filter得到 ( u , v , γ , h ) (u, v, \gamma, h) (u,v,γ,h)，边界坐标（bounding coordinates） ( u , v , γ , h ) (u, v, \gamma, h) (u,v,γ,h)作为直接观察对象的状态。这个标准的Kalman filter采用匀速运动模型（ constant velocity motion）和线性观测模型（ linear observation ）。

文中的state space按照计算机科学理解

状态空间作为一个简单的机器模型在计算机科学中很有用。可以用多元组（有限个元素组成的序列）[N, A, S, G]来定义，其中

N是由状态组成的集合。

A是连接集合N中所有状态的弧的集合。

S是一个集合N的非空子集合，其中包括启始状态。

G是一个集合N的非空子集合，其中包括目的状态。

简单理解就是可能状态的集合

文中的空间不仅仅有状态空间（state space）还有

appearance space

measurement space

image space

Track的三种状态

enum TrackState {Tentative = 1, Confirmed, Deleted};

在新创建的Track没有收集到足够的证据之前都是暂定（Tentative）状态

收集到足够证据之后是已确认（Confirmed）状态

不再alive的状态都是标记为已删除（Deleted）状态

alive状态包括Tentative和Confirmed

since the last successful measurement association 从上次成功测量关联

有两个索引向量也就是装对象的容器

一个是Track Indices T = { 1 , … , N } \mathcal{T}=\{1, \dots, N\} T={1,…,N}

一个是Detection indices D = { 1 , … , M } \mathcal{D}=\{1, \dots, M\} D={1,…,M}

Maximum Age A m a x = 30 A_{max}=30 Amax=30

typedef std::vector<DetectionObject> DetectionObjects;typedef std::vector<TrackObject> TrackObjects;

一个是Track对象向量，一个是Detection对象向量，这两个对象之间建立关联

对于每一个Track k，我们计算从上次成功测量关联的帧数,该帧数是 a k a_k ak。

此计数器在Kalman filter预测期间递增，当Track与测量相关联就重新设置为0。

超过预先设定的最大阈值 A m a x = 30 A_{max}=30 Amax=30的Track被认为已经离开场景并从Track 向量中删除。

对新Track出现的判断则是: 如果某次检测结果中的某个Detection始终无法与已经存在的Track进行关联，那么则认为可能出现了新Track。这些新的Track的前三帧的状态是Tentative。三帧后认为收集到足够的证据Track状态可以是Confirmed

在此期间，我们期望在每个时间步骤中都有一个成功的度量关联。

在前三帧中没有成功关联到度量的Track被删除。

再解释下

第一帧处理

假设程序从摄像头中读取了一帧，一帧里面有很多检测物体，这里为了方便说明先说只有一个物体，检测到物体之后，可视化的话就是带了一个边框，我们把这个目标检测结果叫Detection 目标检测完了，还需要目标跟踪。Detection 与 track 关联，如果没有与旧的Track建立关系，那么就新建一个Track，因为要跟踪多个物体就有很多Track，它们组成Track列表，先假设这个Track叫TrackA，这个TrackA状态为Tentative，处于暂定的不确定状态。

后续帧处理

紧接着开始检测第二帧的Detection ，当发现可以与旧的Track 即TrackA建立关联，那就继续

当Detection 没有与Track列表的任何一个Track建立关联的时候，与上面的第一帧处理一样。

当三帧的Detection 都与TrackA建立关联那么TrackA的状态就变成了Confirmed

代码类似这样

初始的时候int n_init_ = 3;在Track的状态设置为confirmed之前的连续检测帧数。

循环检测帧的时候，状态发生改变

当TrackA的状态已经是Confirmed状态，后面的每帧如果Detection与TrackA相关联的时候time_since_update_就设置为了0，后面每帧的Detection没有与TrackA相关联的时候

time_since_update_每帧都会加1，直到大于max_age_（30）的时候，TrackA的状态设置为Deleted。

建立关联或者相关联就像Detection给TrackA发心跳，30帧后TrackA还没有收到心跳，TrackA就要从Track列表删除。

如果是初始的时候Track的状态还是Tentative状态，连续3帧，TrackA还没收到心跳，TrackA也要从Track列表删除。

建立关联或者相关联文中用的 measurement association

measurement n. 测量；[计量] 度量；尺寸；量度制

association n. 协会，联盟，社团；联合；联想；关联

measurement的解释

measurement不仅仅包含描述一个边框，边框就像rect，这里是建立类似或者结果体能够转换成x,y,a,h (center x, center y, ration, h)和t,l,b,r (top,left,bottom,right)

我更熟悉的顺序是 left,top,right,bottom，还可以包括confidence 置信度等等主要是看代码实现的思路，confidence翻译成可信度，可靠度还是容易理解的

防止说度量关联或者测量关联不易理解，就写成了相关联

这就解释了 Track是如何建立的，在什么情况下要删除Track

hits_ += 1;time_since_update_ = 0;if (state_ == Tentative && hits_ >= n_init_){state_ = Confirmed;}

最大阈值 A m a x = 30 A_{max}=30 Amax=30

Track被删除的情况

int max_age_ = 30;void mark_missed(){if (state_ == Tentative){state_ = Deleted; }else if (time_since_update_ > max_age_){state_ = Deleted;}}

分配问题（Assignment Problem）

名词

Assignment n. 分配；任务；作业；功课

Mahalanobis distance：马氏距离

confidence interval：置信区间

mean：数学期望、均值、期望

cosine distance：更准确的单词是Cosinesimilarity，余弦相似度，余弦相似性

short-term ：短期反义词 long-term ：长期

on the one hand;on the other hand 一方面…在另一方面

intersection over union 也就是常说的IoU 交并比

文中的Invese χ 2 \chi^{2} χ2 distribution，英文里是inverse-chi-squared distribution或者inverted-chi-square distribution

中文可以翻译成逆卡方分布或倒卡方分布，它的倒数就是卡方分布

chi-square distribution或者 χ 2 \chi^{2} χ2distribution看上去字母像x实际是 chi ，卡方分布

若k个随机变量 Z 1 , … … . Z k Z_{1}, \ldots \ldots . Z_{k} Z1,…….Zk是相互独立，符合标准正态分布的随机变量（数学期望为0、方差为1），则随机变量Z的平方和

X = ∑ i = 1 k Z i 2 X=\sum_{i=1}^{k} Z_{i}^{2} X=∑i=1kZi2被称为服从自由度为 k 的卡方分布,记做

X ∼ χ 2 ( k ) X ∼ χ k 2 \begin{array}{l}{X \sim \chi^{2}(k)} \\ {X \sim \chi_{k}^{2}}\end{array} X∼χ2(k)X∼χk2

理解步骤一个算法先当做黑盒操作，知道输入输出，会用。再看内容流程，流程知道了之后，再看理论基础。

卡尔曼滤波是什么？有什么用？

什么是卡尔曼？什么是滤波？

《卡尔曼滤波原理及仿真应用》书上这么说的

滤波一词起源于通信理论,它是从含有干扰的接收信号中提取有用信号的一种技术。

利用一种手段抑制无用信号，增强有用信号的数字信号处理过程

是一种统计估计理论，用于预测动态系统的未来的变化趋势。

卡尔曼滤波器的操作包括两个阶段：预测与更新。

也就编写代码的时候，一个函数名字是predict，另一个是update

KalmanFilter::predict(KAL_MEAN &mean, KAL_COVA &covariance)

KalmanFilter::update(

const KAL_MEAN &mean,

const KAL_COVA &covariance,

const DETECTBOX &measurement)

吴军的谷歌方法论中描述登月中卡尔曼滤波的应用

登月首先要保证在月球上着陆的地点准确，而且要保证返回火箭和飞船能够在月球轨道上准确对接，这就要用到控制论了，这一点我们在前面介绍控制论时介绍过了。对于阿波罗登月控制方面贡献最大的科学家是数学家卡尔曼，他在维纳控制论的基础上提出了卡尔曼滤波，这从理论上保证了火箭能够准确无误地抵达登月的地点。

不过，在实现卡尔曼滤波的过程中又遇到了麻烦，当时控制阿波罗登月的大型计算机还没有今天一台智能手机快，因此完成不了卡尔曼滤波的计算，于是许多控制专家和计算机专家经过努力，在误差允许的范围内简化了数学模型，这才实现了登月的控制。

视频中连续两帧中的所有边框，该边框是目标检测模块检测出来，然后画个框包围起来的目标，第一帧所有边框的集合称为U，第二帧所有边框的集合称为V。U自己内部的边框只想和V联系，自己内部不需要什么联系，V也一样，内部不联系只想和U联系。相邻两帧的边框需要相互联系的，问题就出来了，问题是如何将相邻两帧的边框，U和V中尽量完美地两两匹配起来，而求解这个问题的最优解就要用到匈牙利算法

匈牙利算法是美国数学家哈罗德·库恩于1955年提出的算法，因为算法很大一部分是基于匈牙利数学家Dénes Kőnig和Jenő Egerváry的工作之上创建起来的，所以叫匈牙利算法。

之后詹姆士·芒克勒斯在1957年优化该算法，算法又有了新的名字叫 KM算法或者 Kuhn-Munkres Algorithm 或者 Munkres assignment algorithm。

匈牙利那两个数学家的名字只能抄维基，打字都不好打。

两边的边框的集合 U和V 专业术语叫做二分图（ bipartite graphs）

这种算法最后的运算都是采用矩阵运算的。

式子一

d ( 1 ) ( i , j ) = ( d j − y i ) T S i − 1 ( d j − y i ) d^{(1)}(i, j)=\left(\boldsymbol{d}_{j}-\boldsymbol{y}_{i}\right)^{\mathrm{T}} \boldsymbol{S}_{i}^{-1}\left(\boldsymbol{d}_{j}-\boldsymbol{y}_{i}\right) d(1)(i,j)=(dj−yi)TSi−1(dj−yi)

解释

问题：计算的马氏距离是谁和谁之间的距离？

计算detection中边框 d j ( u , v , r , h ) d_j (u, v, r, h) dj(u,v,r,h)和Track中的边框 y i y_i yi之间的马氏距离

计算 predicted Kalman states 和newly arrived measurements之间的马氏距离

for(size_t i = 0; i < track_indices.size(); i++) {Track& track = tracks[track_indices[i]];Eigen::Matrix<float, 1, -1> gating_distance = kf->gating_distance(track.mean, track.covariance, measurements, only_position);for (int j = 0; j < gating_distance.cols(); j++) {if (gating_distance(0, j) > gating_threshold) cost_matrix(i, j) = gated_cost;}}

式子二

b i , j ( 1 ) = 1 [ d ( 1 ) ( i , j ) ≤ t ( 1 ) ] b_{i, j}^{(1)}=\mathbb{1}\left[d^{(1)}(i, j) \leq t^{(1)}\right] bi,j(1)=1[d(1)(i,j)≤t(1)]

解释

t ( 1 ) = 9.4877 t^{(1)}=9.4877 t(1)=9.4877

double chi2inv95[10] = {0,3.8415,5.9915,7.8147,9.4877,11.070,12.592,14.067,15.507,16.919 };int gating_dim = (only_position == true?2:4);double gating_threshold = KalmanFilter::chi2inv95[gating_dim];bool only_position=false){int gating_dim = only_position ? 2 : 4;float gating_threshold = chi2inv95[gating_dim];

也就是选择的是9.4877

式子三

d ( 2 ) ( i , j ) = min ⁡ { 1 − r j T r k ( i ) ∣ r k ( i ) ∈ R i } d^{(2)}(i, j)=\min \left\{1-\boldsymbol{r}_{j}^{\mathrm{T}} \boldsymbol{r}_{k}^{(i)} | \boldsymbol{r}_{k}^{(i)} \in \mathcal{R}_{i}\right\} d(2)(i,j)=min{1−rjTrk(i)∣rk(i)∈Ri}

第 i i i个 track 和第 j j j个detection 在 appearance space中的最小余弦距离（余弦相似度）

外观描述器（appearance descriptor） r j \boldsymbol{r}_{j} rj， ∥ r j ∥ = 1 \left\|\boldsymbol{r}_{j}\right\|=1 ∥rj∥=1

R k = { r k ( i ) } k = 1 L k \mathcal{R}_{k}=\left\{\boldsymbol{r}_{k}^{(i)}\right\}_{k=1}^{L_{k}} Rk={rk(i)}k=1Lk

last L k = 100 L_{k}=100 Lk=100

nn_budget=100tracker::tracker(/*NearNeighborDisMetric *metric,*/float max_cosine_distance, int nn_budget,float max_iou_distance, int max_age, int n_init){this->metric = new NearNeighborDisMetric(NearNeighborDisMetric::METRIC_TYPE::cosine,max_cosine_distance, nn_budget);this->max_iou_distance = max_iou_distance;this->max_age = max_age;this->n_init = n_init;this->kf = new KalmanFilter();this->tracks.clear();this->_next_idx = 1;}

欧式距离或者余弦相似性

NearNeighborDisMetric::NearNeighborDisMetric(NearNeighborDisMetric::METRIC_TYPE metric,float matching_threshold, int budget){if(metric == euclidean) {_metric = &NearNeighborDisMetric::_nneuclidean_distance;} else if (metric == cosine) {_metric = &NearNeighborDisMetric::_nncosine_distance;} else {errMsg::getInstance()->out("nn_matching.cpp","NearestNeighborDistanceMetric::NearestNeighborDistanceMetric","Invalid metric; must be either 'euclidean' or 'cosine'", true);}this->mating_threshold = matching_threshold;this->budget = budget;this->samples.clear();}

式子四

b i , j ( 2 ) = 1 [ d ( 2 ) ( i , j ) ≤ t ( 2 ) ] b_{i, j}^{(2)}=\mathbb{1}\left[d^{(2)}(i, j) \leq t^{(2)}\right] bi,j(2)=1[d(2)(i,j)≤t(2)]

式子五

c i , j = λ d ( 1 ) ( i , j ) + ( 1 − λ ) d ( 2 ) ( i , j ) c_{i, j}=\lambda d^{(1)}(i, j)+(1-\lambda) d^{(2)}(i, j) ci,j=λd(1)(i,j)+(1−λ)d(2)(i,j)

为了构建关联问题，我们使用加权和来组合两个度量

λ \lambda λ是超参数（ hyperparameter ）

式子六

b i , j = ∏ m = 1 2 b i , j ( m ) b_{i, j}=\prod_{m=1}^{2} b_{i, j}^{(m)} bi,j=m=1∏2bi,j(m)

式子五和式子六会在下面用到

级联匹配(Matching Cascade)

名词

proposed adj. 被提议的；所推荐的

account for 对…做出解释；说明……的原因

sudden appearance changes 快速的appearance变化

static scene geometry 静态场景几何（看起来好生硬）

erroneous 形容词错误的

indices 分为Track Indices和Detection indices

matrix 分为gate matrix和 cost matrix

match结果分为

matches

unmatched_tracks

unmatched_detections

伪代码

Input:Trackindices T = { 1 , … , N } , Detectionindices D = { 1 , … , M } , Maximumage A max ⁡ \begin{array}{l}{\text { Input: Track indices } \mathcal{T}=\{1, \ldots, N\}, \text { Detection indices } \mathcal{D}=} {\quad\{1, \ldots, M\}, \text { Maximum age } A_{\max }}\end{array} Input:TrackindicesT={1,…,N},DetectionindicesD={1,…,M},MaximumageAmax

类似这样

std::vector track_indices,

std::vector detection_indices

1:Computecostmatrix C = [ c i , j ] usingEq. 5 2:Computegatematrix B = [ b i , j ] usingEq. 6 \begin{array}{l}{\text { 1: Compute cost matrix } C=\left[c_{i, j}\right] \text { using Eq. } 5} \\ {\text { 2: Compute gate matrix } B=\left[b_{i, j}\right] \text { using Eq. } 6}\end{array} 1:ComputecostmatrixC=[ci,j]usingEq.52:ComputegatematrixB=[bi,j]usingEq.6

上面的等式5和等式6都用上了

3:Initializesetofmatches M ← ∅ 4:Initializesetofunmatcheddetections U ← D \begin{array}{l}{\text { 3: Initialize set of matches } \mathcal{M} \leftarrow \emptyset} \\ {\text { 4: Initialize set of unmatched detections } \mathcal{U} \leftarrow \mathcal{D}}\end{array} 3:InitializesetofmatchesM←∅4:InitializesetofunmatcheddetectionsU←D

mathches的初始化和unmatched detection的初始化（看上面match分类）

5:for n ∈ { 1 , … , A max ⁡ } do 6 : Selecttracksbyage T n ← { i ∈ T ∣ a i = n } 7 : [ x i , j ] ← min ⁡ − cost ⁡ mat ⁡ ching ⁡ ( C , T n , U ) 8 : M ← M ∪ { ( i , j ) ∣ b i , j ⋅ x i , j > 0 } 9 : U ← U \ { j ∣ ∑ i b i , j ⋅ x i , j > 0 } a : e n d f o r b : r e t u r n M , U \begin{array}{l}{\text { 5: for } n \in\left\{1, \ldots, A_{\max }\right\} \text { do }} \\ {6 : \quad \text { Select tracks by age } \mathcal{T}_{n} \leftarrow\left\{i \in \mathcal{T} | a_{i}=n\right\}}\\ {7 : \quad} {\left[x_{i, j}\right] \leftarrow \min _{-} \operatorname{cost} \operatorname{mat} \operatorname{ching}\left(C, \mathcal{T}_{n}, \mathcal{U}\right)} \\ {8 : \quad } {\mathcal{M} \leftarrow \mathcal{M} \cup\left\{(i, j) | b_{i, j} \cdot x_{i, j}>0\right\}} \\ {9 : \quad} {\mathcal{U} \leftarrow \mathcal{U} \backslash\left\{j | \sum_{i} b_{i, j} \cdot x_{i, j}>0\right\}} \\ {a : { end for }} \\ {b : { return } \mathcal{M}, \mathcal{U}}\end{array} 5:forn∈{1,…,Amax}do6:SelecttracksbyageTn←{i∈T∣ai=n}7:[xi,j]←min−costmatching(C,Tn,U)8:M←M∪{(i,j)∣bi,j⋅xi,j>0}9:U←U\{j∣∑ibi,j⋅xi,j>0}a:endforb:returnM,U

上面的 for 到 end for 是一段代码,因为我我想排版对齐，所以就把10改成了a,11改成了b，就当16进制用了。

最后的匹配阶段的处理

vector<int> iou_track_candidates;iou_track_candidates.assign(unconfirmed_tracks.begin(), unconfirmed_tracks.end());vector<int>::iterator it;for(it = matcha.unmatched_tracks.begin(); it != matcha.unmatched_tracks.end();) {int idx = *it;if(tracks[idx].time_since_update == 1) { //push into unconfirmediou_track_candidates.push_back(idx);it = matcha.unmatched_tracks.erase(it);continue;}++it;}TRACHER_MATCHD matchb = linear_assignment::getInstance()->min_cost_matching(this, &tracker::iou_cost,this->max_iou_distance,this->tracks,detections,iou_track_candidates,matcha.unmatched_detections);

深度外观描述器 Deep Appearance Descriptor

预训练的网络是一个在大规模行人重识别（Person Re-Identification）的数据集上训练得到的。

Person Re-Identification，简称Person Re-ID，又简称 ReID，其他翻译

行人重检测，行人再识别，行人重识别

训练集包括

1,261个行人（pedestrian）

1,100,000张图像

CNN的结构如下

三列分别是

Name Patch Size/Strider Output Size

Conv 1 3 × 3 / 1 32 × 128 × 64 Conv 2 3 × 3 / 1 32 × 128 × 64 MaxPool 3 3 × 3 / 2 32 × 64 × 32 \begin{array}{lll}{\text { Conv } 1} & {3 \times 3 / 1} & {32 \times 128 \times 64} \\ {\text { Conv } 2} & {3 \times 3 / 1} & {32 \times 128 \times 64} \\ {\text { Max Pool } 3} & {3 \times 3 / 2} & {32 \times 64 \times 32}\end{array} Conv1Conv2MaxPool33×3/13×3/13×3/232×128×6432×128×6432×64×32

Residual 4 3 × 3 / 1 32 × 64 × 32 Residual 5 3 × 3 / 1 32 × 64 × 32 Residual 6 3 × 3 / 2 64 × 32 × 16 Residual 7 3 × 3 / 1 64 × 32 × 16 Residual 8 3 × 3 / 2 128 × 16 × 8 Residual 9 3 × 3 / 1 128 × 16 × 8 \begin{array}{lll}{\text { Residual } 4} & {3 \times 3 / 1} & {32 \times 64 \times 32} \\ {\text { Residual } 5} & {3 \times 3 / 1} & {32 \times 64 \times 32} \\ {\text { Residual } 6} & {3 \times 3 / 2} & {64 \times 32 \times 16} \\ {\text { Residual } 7} & {3 \times 3 / 1} & {64 \times 32 \times 16} \\ {\text { Residual } 8} & {3 \times 3 / 2} & {128 \times 16 \times 8} \\ {\text { Residual } 9} & {3 \times 3 / 1} & {128 \times 16 \times 8}\end{array} Residual4Residual5Residual6Residual7Residual8Residual93×3/13×3/13×3/23×3/13×3/23×3/132×64×3232×64×3264×32×1664×32×16128×16×8128×16×8

Dense 10 128 Batchand ℓ 2 normalization 128 \begin{array}{ll}{\text { Dense } 10} & {128} \\ {\text { Batch and } \ell_{2} \text { normalization }} & {128}\end{array} Dense10Batchandℓ2normalization128128

网络共有2,800,864个参数

Nvidia GeForce GTX 1050 mobile GPU下框出32个bounding boxes大约花费30ms

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。