## Machine Learning Methods For Solving Assignment Problems In Multiple Traget Tracking

Data association and track-to-track association, two fundamental problems in single-sensor and multi-sensor multi-target tracking, are instances of an NP-hard combination optimization problem known as the multidimensional assignment problem (MDAP).

Multi-target tracking with one or more sensors plays a significant role in many surveillance and robotics applications, A tracking algorithm provides higher-level systems with the ability to make real-time decisions based on an accurate picture of the surrounding environment. Within ITS, it can be used for pedestrian detection at intersections, self-driving cars, and for traffic surveillance. Multi-target tracking also has a myriad of other application ranging from general security systems to tracking cells in microscopy images. There are many different sensor modalities that can be used for these applications; the most common are video, radar, and LiDAR. As a motivating example, consider a vision system that tracks all traffic participants at an urban intersection. The real-time tracking data can be used for adaptive traffic signal control and optimize the flow of traffic at that intersection. However, urban traffic intersections contain numberous challenges for multi-target tracking. Heavy traffic occupying multiple lanes and unpredictable pedestrian motion makes for a cluttered scene with lots of occulusion, false alarms and missed detections. Variability in the appearance of the targets caused by poor lighting and weather conditions is especially problematic for visual tracking. On the other hand, new technologies such as vehicle-to-infrastructure communication enables vehicles to transmit information directly to traffic intersections, augmenting the data collected by traffic cameras and other sensors. However, tall buildings, trees, and other vehicles can increase GPS signal interference, a effects of multipath is still an ongoing area of research.

Prior to the proliferation of vision-based tracking, tracking methods primarily relies on kinematic data. A sensible intuition is that combining knematic information with the learned representations of high-dimensional sensor data will improve tracking performance. The aim of this survey is to review the algorithms used in data-driven multi-target tracking and discuss rectnltly proposed extensions. We believe that considering tracking from the perspective of an assignment problem is a good way to abstract away a lot of application-specific details and unify the many difference approaches.

The goal of data assignment is to identify a correspondence between a collection of new sensor measurements and preexisting tracks. New measurements can be generated by previously undetected targets, so care must be taken to not erroneously asign one of these measurements to a preexisting track. Likewise, the measurements that stem from clutter within the surveillance region must be identified to avioid false alarms. When there are multiple sensors, there is also the additional problem of track-to-track association. This problem seeks to find a correspondence between tracks of the same target that were generated by different sensors. In the track-to-track case, the problem becomes even harder as the sensors could produce vastly dfferent types of data. Note taht in this work, we use detections and measurements interchangeably; similarly, we equivocate targets with the term objects. We will attempt to be as consisitent as possible with our usage while also adhering to the norms of the different tracking communities when appropriate. For example, in vision-based tracking, the term detections is typically used instead of measurements.

Broadly speaking, algorithms for solving these two association tasks can be classified as either single-scan or multi-scan. A single-scan algorithm only uses track information from the most recent time step. whereas multi-scane algorithms use track information from multiple previous or future are closely spaced and there are a lot of false alarms and missed detections. Generally, multi-scan methods are preferable in situations where the objects of interest are closely spaced and there are a lot of false alarms and missed detections. However, delaying the association to leverage future information negatively affects the real-time capabilities of the tracker. The accuracy and precision of the tracks produced by multi-scan methods are usually superior and they offer fewer track ID switches, track breaks, and missed targets. Naturally, multi-scane methods are more computationally expensive and difficult to implement than their single-scan counterparts.

A common way to formualte these association tasks is as an assignment problem. The simples version is the 2D assignment problem, also known as bipartite matching or linear assignment (LAP), which seeks to match $m$ workers, e.g., tracks, to $n$ jobs, e.g., sensor measurements. This combinatorial optimization problem constrains the space of solution so that each trck is assigned to exactly one measurement, but measurements are allowed to not be assigned (i.e. false alarms) or to be assigned to a “dummy track” (i.e., a missed detection). The multidimensional extension to the assignment problem for track-to-track association stipulates that each track from each sensor be assigned exactly once. For multidimensional data association, constraints ensure that each sensor measurement at each time step is assigned to a track exactly once. Unfortuantely, the MDAP is NP-hard for dimensions $\geq 3$, whereas there exists many polynomial-time algorithms for the LAP. We will formuate these problems more rigorously. The algorithm presented in this sruvey are for solving the various MDAPs encountered in multi-target tracking, and are generally applicable (with modification) to both data association and track-totrack association.

It has been suggested that non-kinematic data obtained from sensors can be incorporated into association algorithms to improve performance. For example, a classifier can be used to prevent two sensor tracks with different target class labels from being associated, which reduces the number of potential assignments. Appearance information has been used etensively in-depth survey. We will be discussing data-driven approaches for discovering features to augment association algorithms. Additionally, we will survey optimization algorithms for finding the solution to a MDAP.

There are several related surveys to ours, and we wish to highlight the relationship between the contributions of these surveys and those of our own.

#### Problem formulation

We will first formally introduce the linear assignment problem (LAP) for single-sensor data association and track-to-track association with two sensors. Following this, we will examine the various MDAP formulations.

##### 2.1 Linear Assignment

Consider a scenario where there are $m$ existing tracks and $n$ new sensor measurements at time $k, k=1,cdots, T$. We assume that there is a matrix $C_k \in \mathbb{R}^{m\times n}$, withine entries $c_k^{ij} \in C$ representing the cost of assigning measurement $j$ to track $i$ at time $k$. The goal is to find the optimal assignment of measurements to tracks so that the total assignment coast is minimized. Using binary decision variable $x^{ij} \in \{0, 1\}$ to represent an assignment of a measurement to a track, we end up with a 0-1 interger program.

with constraints

where $x \in X$ is a binary assignment matrix. There are $mn$ constraints forcing the rows and columns of $X$ to sum to 1. Note that $C_k$ is not required to be a square matrix. To capture the fact that some sensor measurements will either be false alarms or missed detections, a dummy track is added to the set of existing tracks, so that $C_k$ is now an $(m+1)\times n$ matrix. The entries in the $(m+1)^{th}$ row represents the costs of classifying measurements as false alarms. Missed detections are usually handled by forming validation gates around around the $m$ tracks. These gates can be used to determine, with some degree of confidence, whether any of the new measurements might have originated from a track. The canonical approach is use elliptical gates, which are typically computed from the convariance estimates provided by a Kalman Filter. In video-based tracking, a similar tactic is to suppress object detections with low confidence values.

Even though there are $mn!$ possible assignments, many polynomial-time algorithms exist for finding the globally optimal assignment matrix. Most famous is the $O(n^3)$ Hungarian, or Kuhan Munkres algorithm. Another popular method is the Auction algorithm, introduced by Bertsekas. These algorithms are fast and are easy to integrate into real-time multi-target tracking solutions. However, by only considering the previous times step when assigning measurements or tracks, we are making a Markovian assumption about the infomration needed to find the optimal assignment. In situations with lots of clutter, false alarms, missed detections and occlusion, the performance of these algorithms will significantly deteriorate. Indeed, it may be beneficial to instead use a sliding window of previous and ftuture track states to construct assignment cost that model the relationship between tracks and new sensor measurements more accurately. Instead of updating the assignment within the sliding window at each time step, and alternative approach is to simply delay making a decision within the sliding window. In the sequel, we describe how this affects the formulation of the assignment problem. The single-scan track track-to-track association problem with two sensor is also a LAP, where $m$ and $n$ represent the sets of tracks maintained by each sensor. Similar methods for handling false alarms and missed detections in data-association can be used for track-to-track association with uneven sensor track lists, i.e., $m\neq n$. If the assignment costs are known, an optimal track assignment can be found in polynomial-time using one of the previously mentioned algorithms.

##### 2.2 Multidimensional Assignment

Within the single-sensor and multi-sensor tracking paradigms, there are a few different ways to formulate data association and track-to-track association as a MDAP. Each formulation seeks to optimize slightly different criteria, but each solution technique is generally applicable to all of them with minor modifications.

We begin by considering the MDAP for multi-scan data association with one sensor. This scenario is the most commonly encountered, especially in video-based tracking. Let the number of scans or the temporal sliding window size, be given by $T$. Since the objective is to associate new sensor measurements with a set of existing tracks, the resulting MDAP has $T+1$-dimensions. When $T \geq 2$, the assignment problem is NP-hard.

 Let the set of nosiy measurement at time $k$ be referred to as scan $k$ and be represented by $Z_k=\{z_k^i\}$, where $i$ is the $i^{th}$ measurement of scan $k, i=1,\cdots, M_k$. $M_k$ is the number of measurements in each scan, i.e., $$Z_k =M_k$. The main assumption we are making is that each object is responsible for at most one measurement within each scan. We let$Z^T={Z_1, \cdots, Z_T}$represent all measurements in the sliding window of size$T$$.

Let $\mathbb{T}$ be the set of all possible partitions of the set $Z^T$. We seek an optimal prtitioning $\gamma^* \in \mathbb{T}$, also called a hypothesis, of $Z^T$ into tracks. Note that a track is just an ordered set of measuremtns $\{z_1^i, z_2^i, \cdots, z_T^i\}$; one measurement from each scan at each time step is attributed to each track. Hence, a partition $\gamma$ represents a valid collection of tracks that adhere to the MDAP constraints. Now, we define $\gamma^j$ to be the $j^{th}$ track in $\gamma$. Following this, we can define a cost for each track $\gamma^j$ in a partition as $c_{i_1, i_2, \cdots, i_T}$, where the indices $i_1, i_2, \cdots, i_T$ indicate which measurements from each scan belong to this pariticular track. This represents the cost of track j being assigned measurement $i$ from scan 1, measurement $i$ from scan 2, and so on. Curcially, the multidimensional constraints prevent measurements from being assigned to two differenet tracks and ensure that each measurement is matched to a track. If we use binary variables $\rho_{i_1, i_2, \cdots, i_T \in \{0, 1\}}$ to indicate if a track is present in a partition, then we can represent the MDAP objective as

with constraints

The solution $\rho$ to this MDAP is the multidimensional extension of the binary assigment matrix. Simply, one may consider $\rho$ as being a multidimensional array with binary entries, such that the sum along each dimension is 1. Similarly to the LAP, we can augment each scan by including $a z_k^0$ dummy measurement in the set of detections at time $k$ to address false alarms. This is useful for identifying track birth and track death as well, but care should be taken wehn defining the cost for assigning measurements as false alarms or missed detections to avoid high number of alse positives and false negatives.

It is common to solve for an approximate solution within a fixed-sized sliding window $T$, then shift the sliding window forward in time by $% $ so that the new sliding window overlaps with the old region. This allows for tracks to be linked over times, and it provides a compromise between “offline” tracking, when $T$ is set to the length of an entire sequence of measurements, and “online” tracking, when $T=1$.

The other form of the MDAP we are interested in is multi-sensor association with $S\geq 3$ sensors. This scenario is common in centralizd tracking systems, where sensors that are distributed around a surveillance region report raw measurements to a central node. When each sensor sends its local tracks to a central node for track association and fusion. a MDAP must be solved. In this case, the dimensionality of the MDAP is equal to $S$, and hence, is NP-hard. The main difference between this problem and the previous data association problem is that it deals solely with tracks, as opposed to new sensor measurements from all sensors. Multi-scan track-to-track association with two sensors is also a MDAP, as wel as multi-scan multi-sensor data association, but we omit these cases for brevity in our formulation and for the fact that they canbe defined quite similarly from what is presented next.

In this scenario there are $s\geq 3$ sensors, each maintaining a set of local tracks and using a sliding window of size $T \geq 1$. We define $X_k^s = \{x_k^s\}, s=1,\cdots, S$, to represent the set of track state estimates produced by sensor $s$ at time $k$. We have $i=1, \cdots, N_s$, where $N_s$ is the number of tracks being maintained by sensor $s$ and $x_k^{i,s}$ interpreted as the $i^{th}$ track of sensor $s$ at scan $k$. Then for each sebsirm we have $x^{T,s}=\{X_1^s, \cdots, X_T^s\}$, which represents the collection of track state estimates within the sliding window. We seek an optimal partitioning $\gamma^*\in \Gamma$ of $X^T = \{X^{T,1}, \cdots, X^{T,S} \}$ of tracks over all scans and sensors that minimizes the total assignment cost, and we can define a partial assignment hypothesis in a partition $\gamma$ as $\gamma^l = \{\{x_1^{j,1}, x_1^{j,2}, \cdots, x_1^{j,N_s}, \cdots, \{x_T^{j,1}, x_T^{j,2}, \cdots, x_T^{j, N_s}\}\}$. In other words, this states that the $j^{th}$ track of sensor 1 from scan 1, $j^{th}$ track of sensor 2 from scan 1, and so on.

#### Assignment costs

The assignment cost function has a massive impact on tracking performance. In this subsection, we will introduce various perspectives towards defining assignment costs, specifically highlighting probabilistic appraoches.

##### Kinematric Costs

In situations where sensor measurements only consist of noisy esitmates of kinematic data from targets (e.g., position and speed), a prbabilistic framework can be used to recover the unobservable state of the targets. The most common approach is to handle the uncertainty in the sensro measurements and target kinematics with a stochastic Bayesian recursive filter; The Kalman Filter-probably the most popular filter of this falvor-provides the means of updating a posterior distribution over the target state given by corresponding measurement likelihood, i.e., $P(x_k|z_k) \propto P(z_k | z_{k-1})$. We are using the same notation as befor, such taht $x_k$ repressents the target state at time $k$ and $z_k$ is the measurement at time $k$. One of the reasons for the popularity of the Kalman Filter is that by assuming that all distributions of interest are Gaussian, the posterior update can be computed in closed form. Now recall that a partial association hypothesis $gamma^j$ for the multi-scan single-sensor data association problem assigns $T$ measurements to a single track within the sliding window of length $T$. The canonical cost function for data association is to minimize the following negative log-likelihood ratio:

The partial hypothesis $\gamma^j$ represents the $j^{th}$ track of the hypothesis $\gamma$, and $\gamma^0$ represents a dummy track where all measurements attributed to it are considered false alarms. Assuming the sensor has a probability of 1 of detecting each target and a uniform prior over all assignment hypotheses, the likelihood that the $j^{th}$ track generated the assigned measurements is

Assuming independence of the measurements and trck states between time steps, we can decompose the likelihood that the measurements orginated from track $\gamma^j$ as

In the Kalman Filter and its extensions, the right-hand side has an attractive closed form representation as a Mahalanobis distance between the measurement prediction and the observed masurements, scaled in each dimension of the measurement space by the state and measurement convairance. This can easily be derived by tracking and plugging it into the negative log-likelihood ratio.

In track-to-track association, the conventional cost function associated with a partial hypothesis is the likelihood that the tracks from multiple sensors were all generated by the same “true” target. When $S=2$, the simplest appraoch is to consider the random variable $\Delta_{12}=x^1-x^2$, which is the difference between the track state estimate from sensor 1 and sensor 2. When track state estimates are Gaussian random variable, $\Delta_{12}$ is also Gaussian. The cost fuction becomes the likelihood that $\Delta_{12}$ has zero mean and convariance given by $\sum = \sum_1 + \sum_2 - \sum_{12} - \sum_{21}$. The first two terms of the covariance are the uncertainty around the track state estimates, and the second two terms are the corss convariances. A straightforward way to extend to the $S\geq 3$ case is to use star-shaped costs $\Delta_{1S}=\sum_{i=2}^S\Delta_{1i}$. For the Gaussian case, the cost can also be writeen in closed form as a Mahalanobis distance between the track state estimates:

##### Feature-augmented Costs

It is often the case in multiple target trcking that sensor generate high-dimensional observation of the surveillance region from which target information must be extracted. The most obvious example of this i the image data generated by a video surveillance system. Tis data, when featurized, can used to augment or replace the kinematic costs metioned in the previous subsection. The goal of doing this is to improve the association accuracy, and ultimately the overall tracking performance.

Due to the high-dimensonality of the raw measurements, almost all such methods attempt to learn a pairwise cost between measurements or tracks using feature extracted from the data. This pairwise cost can represent the association probability of the two objects, or simply some notation of siilarity, e.g., a distance. There are many ways of formulating the problem of learning assignment costs and using ti for solving data association or track-to-track association as a machine learning problme. For example, one technique is to use metric learning to transform the high dimensonal sensor measurements into a lower-dimensonal geometric space where a Euclidean distance can be used as the assignment cost function. Learning pairwise costs from data is heavily used in the multi-target tracking computer vision community, partially due to the ease at which features can be extracted from images.

There are multiple ways to incorporte learned pairwise cost into a MDAP solver. One common approach is as follows. The probability of association for a pair of measurements or tracks $\Lambda_i$ and $\Lambda_j$ can be written as a joint pdf; assuming independece of the kinematic ($K$) and non-kenematic ($NK$) components of tis probabilistic cost function, the resulting negative log-likelihood pariwise cost is:

Usually, $P_{NK}(\Lambda_i, \Lambda_j)$ is parameterized by weights $\theta$ and is a function of the features exracted from the sensor data and $\theta$. For example, this prbability could be represented as a neural network that outputs a similarity score between 0 and 1.

#### Optimization

In this section, we wil review recent work on a variety of optimization algorithms for solving MDAPs in real-time multi-target tracking systems. Our focus will be on approaches with a machine learning flavour, e.g., approximate inference techniques and deep neural networks, as well as the probabilistic modeling aspects of the problem. We will start by briefly convering non-probabilistic methods that are useful for constracting with whtat is currently popular. The techniques discussed in this section are quite general, and in most cases can be used for both the data association and track-to-track association problems with proper modification. It is important to notice that certain modeling assumptions, such as how the assignment cost function is defined, can case a tracker to make error regardless of how strong the optimization approach is.

Heuristically searching through the space of valid solutions within a time limit is an attractive way of ensuring both real-time performance and that a good local optima will be discovered. The most weill-known method, the greedy Randomized Adaptive Search Procedure (GRASP), was originally introduced for multi-sensor multi-target tracking. The idea behind GRASP is to randomly select each partial assignment from lists of greedily chosen candidates to form a solution $\gamma$. Then a local random search is conducted to attempt to improve this solution. This procedure is repeated until the alloted time runs out or a maximum number of iterations is reached, at which point the best solution that was dicovered is returned. GRASP algorithm also use gating techniques to help reduce the search space, and conduct the local searches by permuting a small number entries within some of the assgnments.

Other greedy search algorithms have been proposed, based on the semi-greedy track selection (SGTS) algorithm introduced. SGTS-based algorithms first perform the usual greedy assignment algorithm step of sorting potential tracks by track score. Then, they gernate a list of candidate hypotheses and return the local optimal result. This processis repeated iteratively in a manner so that candidate hypotheses are generated that best represent the solution space. The construction of SGTS and its extensions are such that they can provide a solution that is within a guarantted factor of the optimal solution.

The main strength of search algorithm appear to be their simi;icity and the extent to which they are embarrassingly parallel. Despite being quite genral, the advent of more sophisiticated techquies that can leverage problem-specific information and the necessary hardware necessary to run them in real-time has most likely contributed to the lack of continued research on GRASP algorithms in the accademic tracking community.

#### Lagrangian Relaxation

The multidimensional binary constraints pose a significant challenge; a standard technique is to relax the constraints so that a polynomial-time algorithm can be used to find an acceptable sub-optimal solution. The existence of $O(n^3)$ algorithms for LAP suggests that if the constraints can be relaxed, a reasonably good solution to the MDAP should be obtainable within an acceptable amount of time. Indeed, Lagrangian relaxation algorithms for association in multi-target tracing, involve iteratively producing increasingly better solutions to the MDAP by successively solving relaxed LAPs and reinforcing the constraints. As set of Lagrange multipliers for the N-dimensional case, $\mathbb{u} = [u_3, u_4, \cdots, u_N]$, are introduced to incorporate the relaxed set of constraints into the cost function. Since there are potentially multiple constraints that are not being enforced at each iteration, the obtained solution is an optimistic lower bound on the acutal optimal solution, referred to as the dual solution. When the constraints are reapplied, a valid solution is obtainted that is an upper bound on the optimal solution, referred to as the primal solution. The idea is then to update $u$ using subgradient methods and to repeat the procedure until the duality gap, the difference between the primal and dual solutions, is below a threshold. To formulate this algorithm for real-time applications, it can also be set to terminate after a maximum number of iterations.

A parallel implementation of this method for the $K$-best case was developed, which enable efficient implementation of MHT algorithm. A variation on this approach using dual decomposition is proposed where the original MDAP is separated into subproblems that contain copies of the orignal variable; a constraints is introduced via Lagrangian relaxation that requires copies of the same variable to share the same value. In experimetns evaluating the performance of the dual decomposition method on a generic tracking problem with six closely spaced targets, it performed comparably with the Lagrangian relaxation algorithm.

Lagrangian relaxation has also been used to convert Equation 3 into a global network flow problem. The motivation behind this approach is a desired to incorporate higher-order motion smoothness constraints, beyond what is capable when only considering pairwise costs in multi-scan problmes. The minimum-cost network flow problem that results from the relaxation can be solved in polynomial-time; updates to the Lagrange multipliers enforcing the constraints are handled by subgradient methods. In the next subsection, we go into more detail on network optimization, one of the leading approaches to solving multi-target tracking association problems.

#### Network Optimization

A popular aproach to solving MDAPs for data association (Equation 3) in the computer vision tracking community is to transform the problem into finding a minimum-cost network flow. In the corresponding network, detections at each discrete time step generally become the nodes of the graph, and a complete flow path represents a target track, or trajectory. The amount of flow sent from the source node to the sink node correspondings the number of targets being tracked, and the total cost of the flow on the network corresponds to the log-likelihood of the association hypothesis. The globally optimmal solution to a minimum-cost network flow problem can be found in polynomial-time, e.g., with the push-relabel algorithm.

Another benefit of using minimum-cost network flow is that the graph can be constructed to significantly reduce the potential number of association hypothesis by limiting transition edges between nodes with a spatio-temporal nearness criteria, similar to gating. Furthermore, occlusion can be explicitly modeled by adding nodes to the graph corresponding to the case where target is partialy or fully occluded by another target for some amount of time. A sliding window approach can be used for real-time performance, rather than using the complete history of previous detections. To help illuminate the mapping from Equation 3 to a network flow problem, we adapte the following equatios, rewritten using the notation from Section 2.

Recall that we defined a dta association hypothessi $gamma$ as a partitioning of the set of all available measurements $Z^T$. Then, a MAP formulation of the MDAP for data association is given by

 where the product over tracks in the objctive reflects an assumption of track motion independence, and the potentially prohibitive constraint guarantees that no two tracks ever intersect. It is possible to derive the measurement likelihood using Equation 9; it’s factored as $$P(Z^T \gamma)=\prod_z P({z\in Z^T} \gamma)$, where each term in this product is a Bernoulli distribution with parameter$\beta$encoding the probability of false alarm and missed detection. The track probabilities$P(\mathcal{T}m)$are modeled as Markov chains to capture track initialization, termination, and state transition probabilities. A network flow graph can now be defined as a graph with source$s$and sink$t$as follow. For every measurement$z_k^i \in Z^T$, create two nodes$u_r, v_r$, create an arc$(u_r, v_r)$with cost$c(u_r, v_r)$and flow$f(u_r, v_r)$, an arc$(s, u_r)$and flow$f(s, u_r)$, and an arc$(v_r, t)$with cost$c(v_r, t)$and flow$f(v_r, t)$. For every transition$P(z{k+1}^i z_k^i)\neq 0$, create an arc$(v_r, u_s) with cost $c(u_r, u_s)$ and flow $f(v_r, u_s)$. An example of such a graph is given in Figure 4. The flows $f$ are indicator functions defined by defined by $% $\mathcal{T}_m$starts from$u_r$$}
0 & otherwise \end{matrix}

f(v_r, t) = \left{ \begin{matrix} 1 & if \exists\mathcal{T}_m \in \mathcal{T} \text{, $\mathcal{T}_m$ ends at $v_r$}
0 & otherwise \end{matrix}

f(u_r, v_r) = \left{\begin{matrix} 1 & if \exists\mathcal{T}_m \in \mathcal{T} \text{, z_k^i \in \mathcal{T}_m}
0 & otherwise \end{matrix}

f(v_r, u_s) = \left{\begin{matrix} 1 & if \exits\mathcal{T}_m \in \mathcal{T} \text{, $z_{k+1}^i comes after$z_k^i \in \mathcal{T}_m$% $

and the costs are defined as

and can be derived by taking the logarithm of Equation 12; The minimum cost flow through the network corresponds to the assignment $\gamma^*$ with the maximum log-liklihood.

Quite a few variation on this model have been proposed in the literature.

#### Conditional Random Fields

Probabilistic graphical models provide us with a powerful set of tools for modeling spatio-temporal relationships amongst sensor meaurements in data association and amongst tracks in track-to-track association. Indeed, conditional random fields (cRFs), a class of Markov random fileds, have been used extensively for solving MDAPs in visual tracking. A CRF is an undirected graphical model, often used for structured prediction tasks, that can represent a conditional probability distribution between sets of random variables. CRFs are well-known for their ability to expoloit grid-like structure in the underlying graphical model. We define a CRF over a graph $G=(V,E)$ with nodes $x_{v\in V} \in X$ such that each node emits a label $y\in Y$. For simplicity of notation, we refer to nodes as $x$ and omit the subscript. The labels take on values from a discrete set, e.g., $\{0, 1\}$; in the context of multi-target tracking, a realization of labels $y$ usually corresponds to an assignment hypothesis. A key theorem concerning random field state that the probability distribution being modeled can be written in terms of the cliques $c$ of the graph. For example in chain-structured graphs, each pair of nodes and corresponding edge is a clique.

CRFs, like the network flow models discussed in the previous subsection, are essentially a tool for modeling probabilistic relationships between a collection of random variable, and hence still require a separte optimization process for handling training and inference (such as the graph cut algorithm or message passing algorithm). We will focus on presenting how the data association problem is mapped onto a CRF and direct the reder to other sources such as for details on how to do approximate inference for these models. One of the benefits of using graphical models is that we have the flexibility to construct our graph using either sensor measurements, tracklets, or full tracks. Tracklets are a common choice for CRFs since they give an attractive heirarchical quality to the tracking solution; low-level mearuement are first associated into tracklets via, e.g., the Hungarian algorithm, and then stitched together into full tracks via a CRF. By working at higher level of abstraction, the original MDAP constraints 4 and 6 are reformulated; all that is needed tat the higher level is to ensure that each tracklet is only associated to one and only one track. This can also help reduce processing time for running in real-time.

Each clique $c$ in the graph has a clique potential i.e. $\psi_c = exp(w_c^T\phi(x,y_c))$. Note that the implied normalization term in Equation 15 can be ommited when solving for the maximum-liklihood labeling $y$ for a particlar set of observations $x$.

Features $\phi$ must be provided ( or can be extracted from data with supervised or unsupervised learning) and weights $\omega$ are learned from data. The observations $x$ can be either sensor measurements (for data association) or sensor-level tracks ( for track-to-track association). The Markov property of CRFs can be interpreted in the context of multi-target tracking as assuming that the assignment of the observation to tracks within a particular spatio-temporal section of the surveillance region is independent of how they are assigned to tracks elsewhere-conditional on all observations. This adds an aspect of local optimality and, in a way, embeds similar assumptions as a gating heuristic. A solution to Equation 15, i.e. the maximum-likelihood set of labels $y$, can be used as a solution to the corresponding MDAP.

As is common with CRFs, the problem of solving for the most likely assignment hypothesis is cast as energy minimization. The objective to minimize is an energy function, computed by summing over the clique potentials; each potential is interpreted as contributing to the energy of the assignment hypothesis. Each lique consists of a set of vertices and edges, where each vertex is a pair of tracklets that could opotentially be linked together. The corresponding labels for each vertex take values from the set $\{0, 1\}$ and indicates whether a pair of tracklets are to be linked or not. The energy term for each clique is decomposed into the sum of a unary term for the vertices and a pair wise term for the edges. The weights $\omega$ are learned with the RankBoost algorithm. Other techniques for learning the parameters of a CRF that maximize the log-likelihood of the training data include iterative scaling algorithms and gradient based techniques. In Section 4, we will examine the proglem of learning weights for assignment costs in more detal. The features used to construct these terms include appearacne, motion, and occlusion information, among others. CRF and network optimization-based trackers are by nature global optimizers, and must be run with a temporal sliding-window to get near real-time performance. For example, extensions to the generic CRF formulation are presented that enable it to run in real-time.

A CRF formulation, Near Online Multi-Traget Tracking (NOMT), is proposed that also builds its graphs of track hypotheses using tracklets. The novelty of this work is in the use of an affinity measure beween detections called the Aggregated Local Flow Descriptor, and in the specfic form ofthe unary and pairwise terms in the energy function of the CRF. Inference in the CRF is sped up by first analyzing the structure of the graphical model so that independent subgraphs can be solved in parallel.

Other variations on the approaches above have seen as well. The energy term of a CRF is augmented with a continuous component to jointly solve the discrete data association and continuous trajectory estimation problems. A factor graph is embeded in the CRF to add more sructure and help model pairwise association explicityly. In the next subsection, we will investigate thow factor graphs, the belief progagation inference algorhtm, and itsvariants can be used to solve the MDAP. To summarize, applying CRFs to a specific multi-target tracking prblem involves defining hwo the graphical model will be constructed from the sensor data, speccifying an objective function, selecting or learning features for the terms within the objective function. training the model to learn the weights, and then performing approximate inference to extract the predicted assignment hypothesis.

#### Belief Propagation

In this section, we highlight recent work that formulate the association problems as MAP inference and use belief propation or one of its variants to obtain a solution. The effectiveness of BP at finding the MAP assignment hypothesis for the single and multi-sensor data association problems. BP is a general message-passing algorithm that can carry out exact inference on tree-structured graphs and approximate inference on graphs with cycles, or “loopy” graphs. The types of graphs under considieration are once again Markov random fields, albeit more general ones than the ones discussed in the previous subsections. Indeed, BP can be used on graphs that model joint distributions $P(x)=P(x_1, x_2, \cdots, x_N)$ that can be factorized into a product of clique potentials. As before, the clique potentials are assumed to be factorizable into pairwise terms. Therefore, for cliques $c$, we have

It is common to use factor graphs to explicitly encode dependencies between variables. A factor graph decomposes a joint distribution into a product of serveral local functions $f_j(X_j)$, where each $X_j$ is some subset of $\{x_1, x_2, \cdots, x_N\}$. The graph is bipartite and has nodes $x$ (i.e. discrete random variables) and factors (i.e., dependencies) $f\in\mathbb{F}$, and edges between the nodes and factors. For example, the graph of $g(x_1, x_2, x_3) = f_A(x1)f_b(x_2,x_3)f_C(x_1,x_3)$ has factors $f_A, f_B$ and $f_C$ and nodes $x_1, x_2, x_3$. The joint distribution for a factor graph can be written similarly to Equation 16 as

where $n_f$ represents the set of nodes $x$ that are connected to the factor $f$.

 Parallel message-passing algorithms, such as BP, operate by having each node of the graph interatively send messsages to tis neighbors simultaneously. We define messanges from a nodes $x_s$ to its neighbor $x_t\in\mathcal{N}(s)$ as $u_{s\leftarrow t}(x_s). In a factor graph, the set of neighbor$\mathcal{N}(s)$for a node$x_s$are its corresponding factors. The max-product algorithm is useful for finding the MAP configuration$x^={x_s^ s\in V} which corresponds to the best assignment hypothesis $\gamma^*$. In this algorithm messages are computed recursively in gernal pairwise Markov random field by

### Deep Learning

Neural networks have a rich history of being used to solve combinatorial optimization problme. One of the most influential papers in this line of research, by Hopfield and Tank, describes how to use Hoefield nets to approximately solve instances of the traveling Salesman Problem (TSP). Despite the controversy associated with their reuslts, this work inspired many others to pursue these ideas. This has lead to the present day, where research on the use of deep neural networks to solve problmes like the TSP has tarted to pick up speed.

#### Deep Reinforcement Learning.

The assignment probles in multi-target tracking are, at their core, combinatorial problems. Naturally, the question of whether deep neural networks are useful for finding near-optimal solutions to the LAP or MDAP is of significatn interest. A preliminary ansower to this quesion, in recent work, suggests the affirmative; they used a recurrent neural network to solve a small MDAP in a simulated multi-target tracking scenario. Impressively, they were also able to get good performance on a quadratic assignemnt problems that involved finding point correspondences between pair of images. It was also suggested that using a problem-specfic objective function for training neural network in a supervised manner as opposed to using, e.g., a gression loss, is preferable. One of the key chanllenges that supervised learning approaches face in this domain is obtaining labeled ground-truth samples, since generating optimal solutions to NP-hard combinatorial optimization problems can be time-consuming or enven impossible. To address this, ()[8] and ()[29] use reinforcement learning to avoid the requirement of labeled data. The main difficulties here are deciding hwo to represent the data for efficient learning and enforcing the original constraints of the problem during training, e.g., Equation 4 and 6. Naively seraching in the space of assignment hypothesies forces a reinforcement learning aggent to select an action from an action space of size $n!$. Furhtermore, if the agent’s policy is parameterized by a deep neural network, as is the case in deep reinforcement learning, the ouput of the policy network (if searching directly in the space of valid soutions) is a permutation matrix; more formally, an extreme point of the Birkhoff polytope. This has been known to be quite difficult to do with neural networks. An alternative to this the approach, where a Deep Q-Network augmented with a graph-embedding layer is used to greedily construct valid solutions to graph combinatorial optimization problems. Principled approaches for doing inference over permutations have been proposed based on annealing a temperature-controled parameter to produce a discrete permuation matrix from a continuous doubly-stochastic matrix. However, this technique has yet to be extended to the reinforcement learning setting. We note that reinforcement learning has already been applied successfully to multi-target tracking, where a policy is learned to control track initialization, maintenance, and removal.

##### Deep learning on Graphs and Sets

Featurization of the assignment hypothesis graph seems useful for a deep learning-based approach. Graph-embedding techniques can potentiallly enable deep neural network to handle missed direction and false alarms by providing the means to model missing edges in the assignment hypothesis graph. However, learning useful inductive representation of graph-structured data is still an open problem in machine learning. Notably, the deep reinformcement learning algorithms makes use of a pwoerful graph embedding technique called struct2vec proposed in their earlir work. Also see for a general disscussion on applying deep learning to graphs, including the recently proposed Graph Convolutional Networks. In particular, it is observed that the transductive natrue of the grpah embedding learned by GCNs prohibits them from generalizing to graphs with different structure at test time, rendering this approach unusable for multi-target tracking. The review papers by Goyal et. al. and Hamilton et. al. are also useful for learning more about recent efforts at embedding graph into a feature sapce. When applying deep neural network to cimbinatirial optmization problem where the solution sapce consists of permuation or subsets of the input, vinyals et.al. proposed Pointer Networks, which leverage attention mechanisms and the powerful seq2seq architecture to greedily construct valid solutions. A deep learning architecture inspiredby the theory of Random Finite Sets is proposed to predict outputs that are structured as sets by simultaneously predicting the cardinality of the set. They include promising result from pedestrian detection benchmarks, showing result that are slightly worse than state-of-the-art.

#### End to End Multi-target tracking.

As is common with deep learning research, some have already gone further to ask wherher multi-target tracking can be solved in an entirely end-to-end fashion. In other words, given noisy measurements of the environment, the objective is for deep learning system to directly output the filtered tracks, combining the association problem with state estimation. An investigation revealed that a recurrent-convolutional neural network is able to learn to track multiple traget from raw inputs in a synthetic problem without access to labeled training data. Curcially, rather than maximizing the likelihood of the next state of the system at each time step, as would by natural for standard Bayesian recursive filtering, they modified the cost function to maximize the likelihood at some time $t+n$ in the future to force the network to leanr a model of the system dynamics. More recently, they extended this work for use with raw LiDAR data collected by an automous vehicle. In short, they showed that their system is able to predict an unoccluded verion of the occupancy grid derived from the sensor’s field-of-view. Recnelty, researcher proposed Recurrent Autogressive Network, an apparoch to online multi-target tracking that seeks to incorporate interneal and external memory componenets into a deep learning framework to help handle occlusion and appearance changes. Curcially, they are able to show that RAN indeed makes use of its external memory to maintain tracks while the target are occluded. The appreaance and motion feature in the external memory and the hidden state of the recurrent netowkr used ofr the internal memory are combined to produce association scores for data association with the Hungarian alogrithm.

Instead of pursuing the monolithic end-to-end appraoch, represents the state estimation and data association problems separtely in their deep learning architecture, arguing that doing so provides the means to separately train and debug these components. They design a Long Short-Term Memory (LSTM) cell specifically for solving the MDAP in data association. Despite not using any visual feature, their appraoch achieves reasonalbe performace relative to tother similar system on the MoT Challenge 2015 dataset.

Reasearch on applying deep learning to the LAP and MDAP in multi-target tracking is still in its infancy; based on the flurry of recent work on this problem, it is likely that we will see significant prograess on this in the near future. However, using data-driven solution bring up the question aobut whether such a system could generalize to any environemnt it may be deployed in. A interesting research direction for addressing this is zero-shot leaning. Next, we will trackle the other major machine leanring task in multi-target tracking - learning the asisgnment cost.