The post Building a Content-Based Search Engine IV: Earth Mover’s Distance appeared first on deep ideas.
]]>We have seen how we can represent multimedia objects efficiently and expressively by summarizing a set of feature vectors into a data structure called a feature signature. Given two multimedia objects represented as feature signatures, we can measure the dissimilarity of the objects using a distance measure on their feature signatures. Numerous distance measures for feature signatures have been proposed (see [BS13] for an overview). The distance measure that has turned out to be the most effective is called the Earth Mover’s Distance.
Proposed in [RTG00] for the domain of content-based image retrieval, the Earth Mover’s Distance (short: EMD) is a distance measure on feature signatures that can be thought of as the minimum required cost for transforming one feature signature into the other one. This cost is formulated by means of a transportation problem: We determine the optimal way to move the weights from the representatives of the first signature ($X$) to the representatives of the second signature ($Y$). The cost for moving a certain amount of weight is given by the amount of weight multiplied by the distance over which it is transported.
The following image depicts an example. On the top, we see the feature signatures of two videos. Let’s call the left-hand signature $X$ and the right-hand signature $Y$. On the bottom, we see an isolated representative of $X$ and the representatives of $Y$ to which it moves weight. As we can see, the representatives in $Y$ to which $X$’s representative moves weight are quite similar to $X$’s representative, resulting in a relatively small “movement cost” or “transformation cost” for this representative.
Let’s see how we can formulate the Earth Mover’s Distance as an optimization problem. Let $\mathbb{F}$ be the set of all possible features, $\delta : \mathbb{F} \times \mathbb{F} \rightarrow \mathbb{R}_{\geq 0}$ be a distance function on features (called ground distance, e.g. the Euclidean distance) and $X, Y \in \mathbb{S}$ be two feature signatures. We call $f : R_X \times R_Y \rightarrow \mathbb{R}$ a flow from signature $X$ to signature $Y$. For two representatives $g \in R_X, h \in R_Y$, it tells us how much weight is moved from $g$ to $h$. $f$ is called a feasible flow if it fulfills the following constraints:
Now let $F = \{f \; | \; f \; \text{is a feasible flow}\}$. There are infinitely many feasible flows. We are interested in the flow with the minimum cost, where the cost is defined as the sum over all pairs of representatives $g$, $h$ of the flow $f(g, h)$ multiplied by their ground distance $\delta(g, h)$. Intuitively, this means that we want to find a flow that tends to move weights from representatives in $X$ to nearby (i.e. similar) representatives in $Y$. The Earth Mover’s Distance is then defined as the cost of the minimum cost flow, i.e. the cost required to transform one signature into the other one.
$$EMD_\delta(X, Y) = min_{f \in F} \left \{ \frac{ \sum_{g \in R_X} \sum_{h \in R_Y} f(g, h) \cdot \delta(g, h) }{ min\{ \sum_{g \in R_X} X(g), \sum_{h \in R_Y} Y(h) \} } \right \}$$
The denominator acts as a normalization term.
The definition of the EMD corresponds to a linear program, i.e. an optimization problem with a linear objective function and linear constraints. It can be solved, for instance, using the Simplex algorithm (cf. [Van01]), which has an exponential worst-time complexity. In practice, we would just use a library that calculates the Earth Mover’s Distance for us directly. According to [SJ08], the empirical time complexity for calculating the Earth Mover’s Distance between two signatures $X$ and $Y$ using the simplex algorithm lies between $\mathcal{O}(n^3)$ and $\mathcal{O}(n^4)$ where $n = |R_X| + |R_Y|$. An approximation to the Earth Mover’s Distance can, however, be computed in linear time [SJ08]. The Earth Mover’s Distance is a metric, provided that the ground distance is a metric and the compared signatures are normalized to the same total weight.
In the next section, we present another distance measure on feature signatures that is only slightly less effective, but can be computed in quadratic time. You can either subscribe to deep ideas by Email, subscribe to the Facebook page or follow on Twitter to stay updated.
[BS13] Christian Beecks and Thomas Seidl. Distance based similarity models for content based multimedia retrieval. PhD thesis, Aachen, 2013. Zsfassung in dt. und engl. Sprache; Aachen, Techn. Hochsch., Diss., 2013.
[RTG00] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
[Van01] Robert J Vanderbei. Linear programming. Foundations and extensions, International Series in Operations Research & Management Science, 37, 2001.
[SJ08] Sameer Shirdhonkar and David W Jacobs. Approximate earth mover’s distance in linear time. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
The post Building a Content-Based Search Engine IV: Earth Mover’s Distance appeared first on deep ideas.
]]>The post Building a Content-Based Search Engine III: Feature Signatures appeared first on deep ideas.
]]>When computing the distance between two multimedia objects, it would be highly inefficient to take into account all of the extracted feature vectors. For most practical purposes, however, it is not necessary to do this in order to achieve a good discriminability of the objects. Most of the vectors carry redundant information or fine-grained details that do not have a significant influence on the overall similarity of two objects. Hence, we summarize all of the extracted features into a structure called a feature signature.
Intuitively, a feature signature is characterized by a relatively small set of feature vectors, called the representatives, along with a weight for each representative.
Formally, if we let $\mathbb{F}$ denote the set of all possible features, a feature signature $X$ is a function $X : \mathbb{F} \rightarrow \mathbb{R}$ such that $|\{f \in \mathbb{F} \; | \; X(f) = 0\}| < \infty$ (i.e. it assigns a weight to only a finite number of vectors and assigns 0 everywhere else). We refer to $R_X =|\{f \in \mathbb{F} \; | \; X(f) = 0\}|$ as the representatives of $X$. We use $\mathbb{S}$ to refer to the set of all feature signatures.
A common way to calculate a feature signature is to apply a clustering algorithm (e.g. k-means) to the extracted set of feature vectors. From the resulting clustering, we devise a feature signature by defining the cluster means as the representatives and assigning them a weight corresponding to the relative size of the cluster, i.e. the number of cluster elements divided by the total number of extracted features. Here is an example depicting this process for 2-dimensional feature vectors:
First, the vectors are clustered, yielding 3 clusters (red, green and blue). Then we compute the cluster centers (depicted as the large red, green and blue dots), which we define as the representatives of the feature signature S, and assign them a weight corresponding to the relative cluster size (i.e. the number of feature vectors in the cluster divided by the total number of feature vectors).
Let’s formalize this process. Let $C = C_1, …, C_m$ be a clustering of feature vectors. We define the clustering-induced normalized feature signature $X_C$ as $X_C : \mathbb{F} \rightarrow \mathbb{R}$ with:
$$
X_C(f)=
\begin{cases}
\frac{|C_i|}{\sum_{1 \leq j \leq m} |C_j|} & \text{if } f = \frac{1}{|C_i|} \sum_{g \in C_i} g\\
0 & \text{else}
\end{cases}
$$
Before applying the clustering algorithm, we multiply each dimension by a certain weight, which allows us to control the importance of that dimension for the clustering. When using k-means to calculate the clustering, we can specify the desired number of representatives k in advance. This allows us to control the expressiveness of the feature signature. The higher we choose k, the more expressive the feature signature gets, with the downside of increasing the storage size and the computational complexity of the distance computation. There is a monotonous relation between k and the effectiveness (i.e. adequacy of the results) as well as the query processing time. Hence, k allows us to control the tradeoff between effectiveness and efficiency.
The following image depicts 3D visualizations of two videos and their feature signatures with k = 100. Here, the clusters are represented as spheres. Their position in the 3D coordinate system corresponds to the position (x and y) and the time (t), the color of the sphere corresponds to the L*a*b* color dimensions of the representatives and the volume of the sphere corresponds to the weight that the feature signature assigns to the representative. As we can see, the feature signature summarizes the visual content of the video as it unfolds over time using just 100 vectors.
We have seen how feature signatures reduce the rather large amount of information inherent in the feature vectors into a compact representation that still reveals a lot of information about the feature distribution, since it summarizes how many feature vectors are located at which locations in the feature space. In the next section, we will see how we can compute the similarity between two feature signatures. Continue with the next section: Earth Mover’s Distance
The post Building a Content-Based Search Engine III: Feature Signatures appeared first on deep ideas.
]]>The post Building a Content-Based Search Engine II: Extracting Feature Vectors appeared first on deep ideas.
]]>This is part 2 in a series of tutorials in which we learn how to build a content-based search engine that retrieves multimedia objects based on their content rather than based on keywords, title or meta description.
In the previous section, we saw how similarity between multimedia objects can be formalized and which types of queries exist with respect to this formalization. In a step towards efficiently computing similarity between two multimedia objects, we are now going to see how we can characterize the contents of individual multimedia objects (in our example, a video) by extracting a set of so-called feature vectors, which are vectors from the Euclidean space that describe certain local characteristic properties.
Since we are interested in visual similarity between two videos, our goal is to extract a set of vectors, each of which describes certain local visual properties of the video numerically. This process can be depicted visually as follows:
We first select a certain amount of sample frames from the video (e.g. 10 frames per second). For each of these frames, we select a fixed amount of equidistant sample pixels. Finally, for each sample pixel, we compute an 8-dimensional Euclidean vector $(x, y, L, a, b, \chi, \eta, t)$ describing the visual appearance of the pixel and its context. The choice of this vector is just a suggestion and it isn’t necessary to include all of features presented here.
The first two dimensions of this vector correspond to the x and y coordinates of the pixel inside the frame. The next 3 dimensions correspond to the color of the pixel in the L*a*b* color space, i.e. the lightness, the position between red and green and the position between blue and yellow (cf. [Wik15b]). The reason we choose this color space instead of e.g. RGB is the fact that Euclidean distances in this space have a significantly higher correlation with perceptual dissimilarity than other color spaces, making it more suitable for our task of measuring visual similarity. Additionally, we calculate the contrast $\chi$ of a 12 x 12 neighborhood of the pixel as proposed by Tamura et al. in [TMY78], which is a measure of the dynamic range of the colors. Furthermore, we calculate the coarseness $\eta$ of the pixel, as proposed in [TMY78], which is a measure of how big the structures surrounding that pixel are. Finally, we add the time t of the frame from which the pixel was sampled as another dimension (in seconds from the beginning of the video).
The whole set of extracted feature vectors, then, comprises a summary of how the visual contents of the video unfold over time.
The entries of the vectors all measure different aspects and stem from different ranges. Since we want all dimensions to have equal importance in the distance computations, irrespective of their value range, we normalize all 8 dimensions individually, yielding a vector whose entries lie between 0 and 1: The positions x and y are divided by the image width and height, respectively. The L* color coordinate ranges from 0 to 100 and is hence divided by 100. The a* and b* color coordinates range from -128 to 127. Therefore, we add 128 and divide by 255. The contrast $\chi$ ranges from 0 to 128 and is therefore divided by 128. The coarseness $\eta$ ranges from 0 to 5 and is hence divided by 5. Finally, the time is divided by the video duration.
The first 7 dimensions that describe a pixel in the context of its frame have yielded high effectiveness for the task of retrieving visually similar images (cf. [BUS10a]) and were hence adopted. Since a video can be thought of as a generalization of an image along another dimension (the time dimension), the image retrieval approach was extended simply by adding the time as another dimension to the feature vectors. The rationale for this is that there is no conceptual difference between the spatial dimensions and the time dimension. A video can be imagined to be an image changing over time. The fact that a video is usually represented as a sequence of frames is just a way to store a video digitally, and it has lead many of the video retrieval approaches to base their video representations on frame sequences, even though semantically a video can be treated reasonably as an image changing continuously over time rather than as a sequence of images.
We now know how we can express local visual properties of a video by means of a set of feature vectors. In the next section, we will see how we can summarize these vectors into a more compact representation scheme that allows us to store the contents of the video using less space, and to compute the visual similarity between two videos more efficiently. Continue with the next section: Feature Signatures
[Wik15b] Wikipedia. Lab color space http://en.wikipedia.org/wiki/Lab_color_space, 2015.
[TMY78] Hideyuki Tamura, Shunji Mori, and Takashi Yamawaki. Textural features corresponding to visual perception. IEEE Transactions on Systems, Man and Cybernetics, 8(6):460–473, 1978.
[BUS10a] Christian Beecks, Merih Seran Uysal, and Thomas Seidl. A comparative study of similarity measures for content-based multimedia retrieval. In Multimedia and Expo (ICME), 2010 IEEE International Conference on, pages 1552–1557. IEEE, 2010.
The post Building a Content-Based Search Engine II: Extracting Feature Vectors appeared first on deep ideas.
]]>The post Building a Content-Based Search Engine I: Quantifying Similarity appeared first on deep ideas.
]]>The explosion of user-generated content on the internet during the last decades has left the world of querying multimedia data with unprecedented challenges. There is a demand for this data to be processed and indexed in order to make it available for different types of queries, whilst ensuring acceptable response times.
An arguably important task is the retrieval of multimedia objects (e.g. images or videos) that are visually similar to a certain query object (e.g. a query image or a query video). We define two multimedia objects to be visually similar if they depict contents that “look similar” to humans. So far, this task has gained comparatively little research recognition.
Most of the major search engines or content suggestion engines only allow for text-based queries and the search only considers metadata such as title, description text or user-specified tags. The content of the multimedia objects is not taken into account. Such systems are very limited with respect to the types of queries that are possible, and with respect to the actual relevance of the retrieved results.
In this series of tutorials, I introduce a method for retrieving visually similar multimedia objects to a specified query object. This method is based on so-called feature signatures, which comprise an expressive summary of the content of a multimedia object that is significantly more compact than the object itself, allowing for an efficient comparison between objects. This method is applicable to virtually all kinds of multimedia objects. In the course of this tutorial, we’ll take content-based video similarity search as our main example. However, it should be clear how to adapt this method to your particular needs.
This section introduces some fundamental preliminaries for the problem of retrieving similar multimedia objects to a given query object. We will review how similarity between multimedia objects can be formalized and which types of queries exist with respect to this formalization.
In order to retrieve similar multimedia objects to a given query object, we need a way to compare the query object to the database objects and quantify the similarity or dissimilarity numerically. There are many ways in which similarity or dissimilarity between two objects can be measured, and it is highly dependent on the nature of the compared objects and on the aspects which we want to compare. For example, videos could be compared with respect to their visual content, their auditory content or meta-data such as the title or a description text.
The most common way to model similarity is by means of a distance function. A distance function assigns high values to objects that are dissimilar and small values to objects that are similar, reaching 0 when the two compared objects are the same. Mathematically, a distance function is defined as follows:
Let $X$ be a set. A function $\delta : X \times X \rightarrow R$ is called a distance function if it holds for all $x, y \in X$:
When it comes to efficient query processing, as we will see later, it is useful if the utilized distance function is a metric.
Let $\delta : X \times X \rightarrow R$ be a distance function. $\delta$ is called a metric if it holds for all $x, y, z \in X$:
Once we have modeled the similarity for pairs of multimedia objects by means of a distance function, we can reformulate the problem of retrieving similar objects to the query object by utilizing such a function. A prominent query type is the so-called range query, which retrieves all database objects for which the distance to the query object lies below a certain threshold. The formal definition is given below (adopted from [BS13]).
Let $X$ be a set of objects, $\delta : X \times X \rightarrow R$ be a distance function, $DB \subseteq X$ be a database of objects, $q \in X$ be a query object and $\epsilon \in \mathbb{R}$ be a search radius. The range query $range(q, \delta, X)$ is defined as
$range_\epsilon(q, \delta, X) = \{x \in X \; | \; \delta(q, x) \leq \epsilon\}$
For range queries, it is hard to determine a suitable threshold $\epsilon$ to yield a result set of a desired size. When $\epsilon$ is too low, the result set might be very small or even empty. On the other hand, when choosing it too large, the result set might come near to including the entire database. This problem can be solved by issuing a k-Nearest Neighbor Query (short: kNN query) instead. In this query type, we specify the desired number of retrieved objects $k$ instead of a distance threshold. If we assume that the distances between the query object and the database objects are pairwise distinct, the k-Nearest Neighbors are the $k$ objects that have the smallest distance to the query object. The formal definition is given below (adopted from [SK98]).
Let $X$ be a set of objects, $\delta : X \times X \rightarrow R$ be a distance function, $DB \subseteq X$ be a database of objects, $q \in X$ be a query object and $k \in \mathbb{N}, k \leq |DB|$. We define the k-Nearest Neighbors of $q$ w.r.t. $\delta$ as the smallest set $NN_q(k) \subseteq DB$ with $|NN_q(k)| \geq k$ such that the following holds:
$\forall o \in NN_q(k), \forall o^\prime \in DB − NN_q(k) : \delta(o, q) < \delta(o^\prime , q)$
Our goal now is to devise a distance function that reflects human judgement of similarity. In the next section, we will learn how to extract features from multimedia objects, which are sets of vectors that characterize the content of that object. Continue with the next section: Extracting Feature Vectors.
[BS13] Christian Beecks and Thomas Seidl. Distance based similarity models for content based multimedia retrieval. PhD thesis, Aachen, 2013. Zsfassung in dt. und engl. Sprache; Aachen, Techn. Hochsch., Diss., 2013.
[SK98] Thomas Seidl and Hans-Peter Kriegel. Optimal multi-step k-nearest neighbor search. In ACM SIGMOD Record, volume 27, pages 154–165. ACM, 1998.
The post Building a Content-Based Search Engine I: Quantifying Similarity appeared first on deep ideas.
]]>The post Deep Learning From Scratch VI: TensorFlow appeared first on deep ideas.
]]>It is now time to say goodbye to our own toy library and start to get professional by switching to the actual TensorFlow.
As we’ve learned already, TensorFlow conceptually works exactly the same as our implementation. So why not just stick to our own implementation? There are a couple of reasons:
TensorFlow is the product of years of effort in providing efficient implementations for all the algorithms relevant to our purposes. Fortunately, there are experts at Google whose everyday job is to optimize these implementations. We do not need to know all of these details. We only have to know what the algorithms do conceptually (which we do now) and how to call them.
TensorFlow allows us to train our neural networks on the GPU (graphical processing unit), resulting in an enormous speedup through massive parallelization.
Google is now building Tensor processing units, which are integrated circuits specifically built to run and train TensorFlow graphs, resulting in yet more enormous speedup.
TensorFlow comes pre-equipped with a lot of neural network architectures that would be cumbersome to build on our own.
TensorFlow comes with a high-level API called Keras that allows us to build neural network architectures way easier than by defining the computational graph by hand, as we did up until now. We will learn more about Keras in a later lesson.
So let’s get started. Installing TensorFlow is very easy.
pip install tensorflow
If we want GPU acceleration, we have to install the package tensorflow-gpu
:
pip install tensorflow-gpu
In our code, we import it as follows:
import tensorflow as tf
Since the syntax we are used to from the previous sections mimics the TensorFlow syntax, we already know how to use TensorFlow. We only have to make the following changes:
tf.
to the front of all our function calls and classessession.run(tf.global_variables_initializer())
after building the graphThe rest is exactly the same. Let’s recreate the multi-layer perceptron from the previous section using TensorFlow:
In the next lesson, we will learn about Keras, which is a high-level API on top of TensorFlow that allows us to define and train neural networks more abstractly – without having to specify the internal composition of all the operations everytime. You can either subscribe to deep ideas by Email or subscribe to my Facebook page to stay updated.
The post Deep Learning From Scratch VI: TensorFlow appeared first on deep ideas.
]]>The post Connectionist Models of Cognition appeared first on deep ideas.
]]>In this video, I give an introduction to the field of computational cognitive modeling (i.e. modeling minds through algorithms) in general, and connectionist modeling (i.e. using artificial neural networks for the modeling) in particular. We deal with the following topics:
The post Connectionist Models of Cognition appeared first on deep ideas.
]]>The post Robot Localization IV: The Particle Filter appeared first on deep ideas.
]]>The last filtering algorithm we are going to discuss is the Particle Filter. It is also an instance of the Bayes Filter and in some ways superior to both the Histogram filter and the Kalman Filter. For instance, it is capable of handling continuous state spaces like the Kalman Filter. Unlike the Kalman Filter, however, it is capable of approximately representing deliberate belief distributions, not only normal distributions. It is therefore suitable for non-linear dynamic systems as well.
The idea of the Particle Filter is to approximate the belief $bel(x_t)$ as a set of $n$ so-called particles $p_t^{[i]} \in dom(x_t)$: $\chi_t := \{ p_t^{[1]}, p_t^{[2]}, …, p_t^{[n]} \}$. Each of these particles is a concrete guess of the actual state vector. At each time step the particles are randomly sampled from the state space in such a way that $P(p_t^{[i]} \in \chi_t)$ is proportional to $P(x_t = p_t^{[i]} \, \vert \, e_{1:t})$.
This means that the probability of a particle being included in $\chi_t$ is proportional to the probability of it being the correct representation of the state, given the sensor measurements so far. This way, the update step can be thought of as a process similar to the evolutionary mechanism of natural selection: Strong theories, that are compatible with the new measurement, are likely to live on and reproduce, whereas poor theories are likely to die out. This results in the fact that the particles are likely to be centered around strong theories. We will see a visual example for this later.
We take the same approach as we did with all the previous Bayes Filters. First, we calculate a particle representation of $\overline{bel}(x_{t+1})$ from $\chi_t$, which we denote $\overline{\chi}_{t+1}$: For each particle $p_t^{[i]} \in \chi_t$, we sample a new particle $\overline{p}_{t+1}^{[i]}$ from the distribution $P(x_{t+1} \, \vert \, x_t = p_t^{[i]})$, which can be obtained from the transition model. We put all these new particle into the set $\overline{\chi}_{t+1}$.
As an example, let’s consider a moving robot in one dimension. The state contains only one variable, the location. From time $t$ to $t + 1$ the robot has moved an expected distance of 1 meter to the right with Gaussian movement noise. In this case we would just add 1 to the locations of all the particles plus a random number that is sampled from the transition model.
Now we calculate the particle representation of $bel(x_{t+1})$, namely $\chi_{t+1}$, from $\overline{\chi}_{t+1}$. The key idea here is to assign a so-called importance weight, denoted $\omega[i]$, to each of the particles in $\overline{\chi}_{t+1}$. This importance weight is a measure of how compatible the particle $\overline{p}_{t+1}^{[i]}$ is with the new measurement $e_{t+1}$. This probability can be obtained from the sensor model. $\chi_{t+1}$ is then constructed by randomly picking $n$ particles from $\overline{p}_{t+1}^{[i]}$ with a probability proportional to their weight. The same particle may be picked multiple times. This procedure is called resampling.
We elucidate the Particle Filter with a localization example that’s similar to the Kalman Filter example, i.e. we use the same transition and sensor models as well as the same position and measurement chains. Since the particles are drawn from the state space, they are simply real numbers. This time, we start with a uniform distribution over the interval $[0, 5]$. In this instance, we use 30 particles. For obvious reasons, a numerical representation of the particle sets at each time step will not be given, but a graphical representation can be seen in the following figure. Each of the black/gray lines represents one or more particles. Since multiple particles can fall on the same pixel, the opacities of the lines are proportional to the number of particles on that pixel. Again, the blue line represents the actual position and the red graph represents $P(x_t \, \vert \, e_t)$.
In this series of articles, we have introduced the Bayes Filter as a means to maintain a belief about the state of a system over time and periodically update it according to how the state evolves and which observations are made. We came across the problem that, for a continuous state space, the belief could generally not be represented in a computationally tractable way. We saw three solutions to this problem, all of which have their advantages and disadvantages.
The first solution, the Histogram Filter, solves the problem by slicing the state space into a finite amount of bins and representing the belief as a discrete probability distribution over these bins. This allows us to approximately represent arbitrary probability distributions.
The second solution, the Kalman Filter, assumes the transition and sensor mod- els to be linear Gaussians and the initial belief to be Gaussian, which makes it inapplicable for non-linear dynamic systems – at least in its original form. As we showed, this assumption results in the fact that the belief distribution is always a Gaussian and can thus be represented by a mean and a variance only, which is very memory efficient.
The last solution, the Particle Filter, solves the problem by representing the belief as a finite set of guesses at the state, which are approximately distributed according to the actual belief distribution and are therefore a good representation for it. Like the Histogram Filter, it is able to represent arbitrary belief distributions, with the difference that the state space is not binned and therefore the approximation is more accurate.
[NORVIG] Peter Norvig, Stuart Russel (2010) Artificial Intelligence – A Modern Approach. 3rd edition, Prentice Hall International
[THRUN] Sebastian Thrun, Wolfram Burgard, Dieter Fox (2005) Probabilistic Robotics
[NEGENBORN] Rudy Negenborn (2003) Robot Localization and Kalman Filters
[DEGROOT] Morris DeGroot, Mark Schervish (2012) Probability and Statistics. 4th edition, Addison-Wesley
[BESSIERE] Pierre Bessire, Christian Laugier, Roland Siegwart (2008) Probabilistic Reasoning and Decision Making in Sensory-Motor Systems
The post Robot Localization IV: The Particle Filter appeared first on deep ideas.
]]>The post Robot Localization III: The Kalman Filter appeared first on deep ideas.
]]>This post deals with another solution to the continuous state space problem, the Kalman Filter, invented by Thiele, Swerling and Kalman. It has successfully been used in many applications, like the mission to Mars or automatic missile guidance systems (cf. [NEGENBORN, abstract]). The classical application is radar tracking, but there is a vast amount of other applications (cf. [NORVIG, pp. 588, 589]).
In its essence, it is an implementation of the Bayes Filter in which the belief is a normal (Gaussian) distribution and can therefore be represented by its parameters: a mean vector and a covariance matrix. In this representation, the mean vector is the expected state and the covariance matrix is a measure of uncertainty.
In order for the Kalman filter to work, we need to make a few assumptions about the system we wish to describe (in addition to the Markov assumptions of the Bayes filter). If these assumptions hold, the belief $bel(x_t)$ will be normally distributed at each time step $t$ and can thus be represented by a mean vector $\mu_t$ and a covariance matrix $\Sigma_t$. It is also true that if either of the three assumptions is violated, then the belief will always be non-Gaussian for $t \geq 1$ (cf. [RISTIC, p. 4]). Thus, these assumptions are necessary and sufficient conditions for the Kalman Filter. In the next section, we will see how the Kalman Filter algorithm follows from these assumptions for one-dimensional state spaces. After that, we will take a look at the multi-dimensional algorithm. The assumptions are as follows:
The assertion that the belief is always normally distributed is very important, since it ensures the computational tractability of the belief update for deliberate time steps, because in the general case, i.e. for deliberate sensor and measurement distributions, a representation of the belief could, as we argued in chapter 2, grow unboundedly over time.
For simplicity, we’ll first assume that we are dealing with a one-dimensional state space (i.e. $x_t$ is just a real number, e.g. a position along a line). We will take a look at the multidimensional case later. The transition phase from time $t$ to $t + 1$ just adds some number $\delta_{t+1}$ to the state, plus some unpredictable Gaussian noise $\epsilon_{t+1}$ (as before, imagine a robot moving at a desired speed of $\delta$ per time step with some unpredictable random error):
$$x_{t+1} = x_t + \delta_{t+1} + \epsilon_{t+1}$$
Then our transition model is given by
$$P(x_{t+1} \, \vert \, x_t) = \mathcal{N}(x_t + \delta_{t+1}, \phi^2)$$
The variance $\phi^2$ acts as a measure of uncertainty, reflecting the transition noise $\epsilon$. In the robot example, assuming we are at position $x_t$ at time step $t$, the position at time step $t+1$ is a Gaussian cloud around an expected position of $x_t + \delta_{t+1}$ with a variance (uncertainty) of $\phi^2$.
Our sensor model is given by
$$P(e_{t+1} \, \vert \, x_{t+1}) = \mathcal{N}(x_{t+1}, \psi^2)$$
Again, the variance $\psi^2$ acts as a measure of uncertainty, this time for the measurement noise $\zeta$. In the robot example, assuming that we are at position $x_{t+1}$, the measurement that we get can be expected to be sampled from a Gaussian cloud around $x_{t+1}$ with a variance of $\psi^2$
Assuming that the belief at some time step $t$ is a normal distribution, i.e. $bel(x_t) = \mathcal{N}(\mu_t, \sigma_t^2)$, it can be shown that the projected belief $\overline{bel}(x_{t+1})$ is also a normal distribution with mean $\overline{\mu}_{t+1} = \mu_t + \delta_{t+1}$ and variance $\overline{\sigma}_{t+1}^2 = \sigma_{t}^2 + \phi^2$.
Considering the robot example, it should not surprise us that the expected position at time step $t+1$ is just the expected position at time step $t$ plus the expected distance $\delta_{t+1}$ that we wanted to move. Moreover, it seems reasonable that our new uncertainty in the belief, $\overline{\sigma}_{t+1}^2$, is given by the old uncertainty $\sigma_{t}^2$ plus the uncertainty that we get due to the transition $\phi^2$.
Now, assuming that $\overline{bel}(x_{t+1})$ is normally distributed, it can be shown that the updated belief $bel(x_{t+1})$ after receiving a measurement $e_{t+1}$ is a normal distribution as well, this time with mean $\overline{\mu}_{t+1} + k_{t+1} \cdot (e_{t+1} – \overline{\mu}_{t+1})$ and variance $\sigma_{t+1}^2 = (1 – k_{t+1}) \overline{\sigma}_{t+1}^2$ where $k_{t+1} = \frac{\overline{\sigma}_{t+1}^2}{\overline{\sigma}_{t+1}^2 + \psi^2}$.
We can see that the new mean is a weighted average of the new measurement and the old mean, where the weights are the transition noise and the sensor noise, respectively. This makes intuitive sense: The importance of the new measurement increases with the uncertainty of the current belief, whilst the importance of the current belief increases with the uncertainty of the measurement.
A proof of these statements can be found in [NEGENBORN, pp. 34 – 37].
Now that all the preparatory work is done, we can formulate the actual Kalman Filter algorithm. It is basically a variant of the Bayes Filter with the property that the beliefs $bel(x_t)$ and $\overline{bel}(x_t)$ are now represented by their parameterizations $(\mu_t, \sigma_t^2)$ and $(\overline{\mu}_t, \overline{\sigma}_t^2)$, respectively. As with the Bayes Filter, the correctness follows by induction.
One-Dimensional Kalman Filter
- $\overline{\mu}_{t+1} = \mu_t + \delta_{t+1}$
- $\overline{\sigma}_{t+1}^2 = \sigma_{t}^2 + \phi^2$
- $k_{t+1} = \frac{\overline{\sigma}_{t+1}^2}{\overline{\sigma}_{t+1}^2 + \psi^2}$
- $\mu_{t+1} = \overline{\mu}_{t+1} + k_{t+1} \cdot (e_{t+1} – \overline{\mu}_{t+1})$
- $\sigma_{t+1}^2 = (1 – k_{t+1}) \overline{\sigma}_{t+1}^2$
- return $\mu_{t+1}, \sigma_{t+1}^2$
The variable $k$ is often called the Kalman gain (cf. [NORVIG, p. 588]) and functions as a measure of how important the new measurement is. If the uncertainty of the projected belief is low, then the Kalman gain will be low and thus the new measurement will not have a big impact on the belief. Additionally, if the uncertainty of the measurement is high, the Kalman gain will be low as well and if it is low, the Kalman gain will be high.
The Kalman gain is first incorporated in the expectation update. First, the deviation of the measurement from the expectation, $e_{t+1} – \mu_{t+1}$, is calculated, then it is weighted with the Kalman gain and finally it is added to the expectation. This has exactly the desired effect that the new measurement has an impact on the belief that is proportional to its importance. Dependent on how much new information has been incorporated, the uncertainty decreases, which is implemented in the variance update.
We will now shed some light on this algorithm by applying it to a one-dimensional robot localization problem up to time step 4. The state, i.e. the robot’s location, is simply a real number. The robot believes that it starts out at $x_0 = 0$ with some uncertainty, which is reflected by a prior belief of $\mathcal{N}(\mu_0 = 0, \sigma_0^2 = 1.0)$.
We assume that the robot moves at constant average speed $d_t = 1$ with a noise of $\phi^2 = 0.1$. The positions of the robot shall be $x_0 = 0, x_1 = 0.4543, x_2 = 1.3752, x_3 = 2.2080, x_4 = 3.4944$. I sampled these positions randomly using the specified transition model. Of course, they are not known to the algorithm and they shall only be used for a later comparison with the resulting beliefs (and to create the measurements). We can see that the transition noise really had an impact here. For example, from time step 0 to 1, the robot only moved 0.4543 units when the expected distance was 1 unit.
In our example, the robot is able to sense its position with a measurement noise of $\psi^2 = 1.0$. This is very big noise if we consider that it means that, in expectation, about 68.2% of the measurements are within a distance of 1 unit to the actual position (which is already a big interval) and 31.8% of the measurements might even be outside this interval. Let’s assume that we make the following measurements (which have been sampled from the sensor model using the actual positions specified above): $e_1 = 3.3558, e_2 = −0.0570, e_3 = 1.8155, e_4 = 3.7446$. We can see the obvious impact of the measurement noise: Although we were at position 0.4543 at time step 1, we measured the position 3.3558.
The following figure shows the development of the belief for the first four time steps, both numerically and graphically. At each time step, the black graphs show the belief specified in the upper right-hand corner, whereas the red graphs show the measurement probabilities $P(e_t \, \vert \, x_t)$. The blue line shows the position $x_t$ and the green line the expected position, i.e. the mean of the belief distribution. Take some time to go over the graphs and do not let the mass of information confuse you. After having understood this example, you are able to visualize the Kalman Filter, which helps a lot when using it.
We can see that even though the measurements have been very bad, we still arrive at a belief that is quite reasonable, with an error of only 0.144.
Let’s now take a glimpse at the multi-dimensional situation, which looks a little scary but really is completely analogous to the one-dimensional case.
As before, the transition model has to be a linear Gaussian ($x_{t+1} = A_{t+1} x_t + \Delta_{t+1} + \epsilon_{t+1}$), so our transition probability are given by $P(x_{t+1} \, \vert \, x_t) = \mathcal{N}(A_{t+1} x_t + \Delta_{t+1}, \Phi_{t+1})$. Our uncertainty is now reflected by a covariance matrix $\Phi_{t+1}$ instead of a variance along a single dimension.
The sensor model has to be a linear Gaussian as well ($e_{t+1} = B_{t+1} \cdot x_{t+1} + \zeta_{t+1}$), so our measurement probabilities are given analogously by $P(e_{t+1} \, \vert x_{t+1}) = \mathcal{N}(C_{t+1} x_{t+1}, \Psi_{t+1})$.
With these models defined, we can now state the multi-dimensional Kalman Filter algorithm (cf. [THRUN, p. 42]).
Multi-Dimensional Kalman Filter
It is worth noting that $\overline{\Sigma}_t$, $\Sigma_t$ and $K_t$ are independent of the measurements and can therefore be calculated in advance, which reduces the amount of computation that has to be done “live”.
When we are dealing with linear Gaussian systems, the Kalman Filter is the way to go, since it is very efficient, easy to implement and completely exact if the three assumptions really hold. There are, however, only very few systems that really behave like this. Normally, the state transition process is nonlinear, which means it can not be described with a simple matrix multiplication.
The so-called extended Kalman Filter attempts to overcome this issue. The idea here is that if the state transition process is approximately linear in regions that are close to $\mu_t$, then a Gaussian belief is a reasonable approximation. If the system behaves nonlinear in regions close to the mean, the extended Kalman Filter yields bad results.
A different solution is the so-called switching Kalman Filter, which works by running multiple instances of the Kalman Filter in parallel, where each of them uses a different transition model. The overall belief is then calculated as a weighted sum of the belief distributions of the different instances, where the weight is a measure of how compatible this particular instance is with the measurements.
Continue with IV: The Particle Filter.
The post Robot Localization III: The Kalman Filter appeared first on deep ideas.
]]>The post Dealing with Unbalanced Classes in Machine Learning appeared first on deep ideas.
]]>Unbalanced classes create two problems:
Fortunately, these problems are not so difficult to solve. Here are a few ways to tackle them.
If possible, you could collect more data for the underrepresented classes to match the number of samples in the overrepresented classes. This is probably the most rewarding approach, but it is also the hardest and most time-consuming, if not downright impossible. In the cancer example, there is a good reason that we have way more non-cancer samples than cancer samples: These are easier to obtain, since there are more people in the world who haven’t developed cancer.
Artificially increase the number of training samples for the underrepresented classes by creating copies. While this is the easiest solution, it wastes time and computing resources. In the cancer example, we would almost have to double the size of the dataset in order to achieve a 50:50 share between the classes, which also doubles training time without adding any new information.
Similar to 2, but create augmented copies of the underrepresented classes. For example, in the case of images, create slightly rotated, shifted or flipped versions of the original images. This has the positive side-effect of making the model more robust to unseen examples. However, it only does so for the underrepresented classes. Ideally, you would want to do this for all classes, but then the classes are unbalanced again and we’re back where we started.
Remove training samples from the overrepresented classes so that the number of training samples for all classes is the same. This solves our problem and reduces training time, but it makes our model worse. After all, we want to use as much labelled data as we possibly can, even if this causes unbalanced classes. I don’t recommend this solution.
The sensitivity tells us the probability that we detect cancer, given that the patient really has cancer. It is thus a measure of how good we are at correctly diagnosing people who have cancer.
$$sensitivity = Pr(detect\, cancer \; \vert \; cancer) = \frac{\text{true positives}}{\text{positives}}$$
The specificity tells us the probability that we do not detect cancer, given that the patient doesn’t have cancer. It measures how good we are at not causing people to believe that they have cancer if in fact they do not.
$$specificity = Pr(\lnot \, detect\, cancer \; \vert \; \lnot \, cancer) = \frac{\text{true negatives}}{\text{negatives}}$$
A model that always predicts cancer will have a sensitivity of 1 and a specificity of 0. A model that never predicts cancer will have a sensitivity of 0 and a specificity of 1. An ideal model should have both a sensitivity of 1 and a specificity of 1. In reality, however, this is unlikely to be achievable. Therefore, we should look for a model that achieves a good tradeoff between specificity and sensitivity. So which one of the two is more important? This can’t be said in general. It highly depends on the application.
If you build a photo-based skin cancer detection app, then a high sensitivity is probably more important than a high specificity, since you want to cause people who might have cancer to get themselves checked by a doctor. Specificity is a little less important here, but still, if you detect cancer too often, people might stop using your app since they unnecessarily get annoyed and scared.
Now suppose that our desired tradeoff between sensitivity and specificity is given by a number $t \in [0, 1]$ where $t = 1$ means that we only pay attention to sensitivity, $t = 0$ means we only pay attention to specificity and $t = 0.5$ means that we regard both to be equally important. In order to incorporate the desired tradeoff into the training process, we need the samples of the different classes to have a different contribution to the loss. To achieve this, we can simply multiply the contribution of the cancer samples to the loss by
$$\frac{\text{number of non-cancer samples}}{\text{number of cancer samples}} \cdot t$$
In Keras, the class weights can easily be incorporated into the loss by adding the following parameter to the fit function (assuming that 1 is the cancer class):
class_weight={ 1: n_non_cancer_samples / n_cancer_samples * t }
Now, while we train, we want to monitor the sensitivity and specificity. Here is how to do this in Keras. In other frameworks, the implementation should be similar (for instance, you could replace all the K calls by numpy calls).
from keras import backend as K def sensitivity(y_true, y_pred): true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) return true_positives / (possible_positives + K.epsilon()) def specificity(y_true, y_pred): true_negatives = K.sum(K.round(K.clip((1-y_true) * (1-y_pred), 0, 1))) possible_negatives = K.sum(K.round(K.clip(1-y_true, 0, 1))) return true_negatives / (possible_negatives + K.epsilon())
model.compile( loss='binary_crossentropy', optimizer=RMSprop(0.001), metrics=[sensitivity, specificity] )
If we have more than two classes, we can generalize sensitivity and specificity to a “per-class accuracy”:
$$perClassAccuracy(C) = Pr(detect\, C \; \vert \; C)$$
In order to train for maximum per-class accuracy, we have to specify class weights that are inversely proportional to the size of the class:
class_weight={ 0: 1.0/n_samples_0, 1: 1.0/n_samples_1, 2: 1.0/n_samples_2, ... }
Here is a Keras implementation of the per-class accuracy, which I adopted from jdehesa at Stackoverflow.
INTERESTING_CLASS_ID = 0 # Choose the class of interest def single_class_accuracy(y_true, y_pred): class_id_true = K.argmax(y_true, axis=-1) class_id_preds = K.argmax(y_pred, axis=-1) accuracy_mask = K.cast(K.equal(class_id_preds, INTERESTING_CLASS_ID), 'int32') class_acc_tensor = K.cast(K.equal(class_id_true, class_id_preds), 'int32') * accuracy_mask class_acc = K.sum(class_acc_tensor) / K.maximum(K.sum(accuracy_mask), 1) return class_acc
If you have any questions, feel free to leave a comment. If you want to stay updated about new machine learning articles, you can either subscribe to deep ideas by Email, subscribe to my Facebook page or follow me on Twitter.
The post Dealing with Unbalanced Classes in Machine Learning appeared first on deep ideas.
]]>The post Robot Localization II: The Histogram Filter appeared first on deep ideas.
]]>The Histogram Filter is the most straightforward solution to represent continuous beliefs. We simply divide $dom(x_t)$ into $n$ disjoint bins $b_0, …, b_{n−1}$ such that $\cup_i b_{i} = dom(x_t)$. Then we define a new state $x_t^\prime \in \{0, …, n − 1\}$ where $x_t^\prime = i$ if and only if $x_t \in b_i$. Since $x_t^\prime$ has a discrete, finite state space, we can use the discrete Bayes Filter to calculate $bel(x_t^\prime)$.
$bel(x_t^\prime)$ is an approximation for $bel(x_t)$ then: For each bin $b_i$, it gives us the probability that $x_t$ is in that bin. The more bins we use, the more accurate the approximation becomes, with the downside of increasing computational complexity.
To make this more clear, we shall apply the Histogram Filter to a global localization example as displayed in the following image:
A self-driving car lives in a one-dimensional, cyclic world that is 5 meters wide. By cyclic, we mean that if it is in the rightmost cell and moves one step to the right, it’s back in the leftmost cell. The robot’s position at each time step is given as $pos_t \in [0, 5)$, which is the only state variable. It has a sensor that is, under uncertainty, able to tell the color of the wall next to it. We assume that the car is constantly moving right under noise, at an expected speed of one meter per time step.
In order to apply the Histogram Filter, we choose the following decomposition of the state space: $b_0 = [0, 1)$, $b_1 = [1, 2)$, $b_2 = [2, 3)$, $b_3 = [3, 4)$, $b_4 = [4, 5)$. This way, the position can be measured as a discrete variable $pos_t^\prime \in \{0, …, 4\}$, which is an estimate of the true, continuous position. Each discrete position corresponds to exactly one of the distinguished grid cells in the above image.
We can now specify the transition and sensor models. We assume that the car intends to move exactly one grid cell to the right at each time step, but that the inaccuracy of the motor causes it to move 2 grid cells in 5% of the cases, not move at all in 5% of the cases and move exactly 1 grid cell in 90% of the cases. This results in the following transition model:
$$
P(pos_t^\prime = x + 2 \; mod \, 5 \; \vert \; pos_{t−1}^\prime = x) = 0.05\\
P(pos_t^\prime = x + 1 \; mod \, 5 \; \vert \; pos_{t−1}^\prime = x) = 0.9\\
P(pos_t^\prime = x \; \vert \; pos_{t−1}^\prime = x) = 0.05
$$
As for the sensors, we assume that in 90% of the cases the measured color is correct and in 10% of the cases it is incorrect, yielding the following sensor model:
$$
P(MeasuredColor_t = Blue \; \vert \; pos_t^\prime = 0, 2, 3) = 0.9\\
P(MeasuredColor_t = Orange \; \vert \; pos_t^\prime = 0, 2, 3) = 0.1\\
P(MeasuredColor_t = Blue \; \vert \; pos_t^\prime = 1, 4) = 0.1\\
P(MeasuredColor_t = Orange \; \vert \; pos_t^\prime = 1, 4) = 0.9
$$
Let’s now use the discrete Bayes filter to calculate the car’s belief for three time steps where the sensor measurements are Orange, Blue and Orange in that order. We assume that the car starts at the very left (but it does not know that it does) and travels exactly one grid cell to the right per time step (which it does not know either). We can represent the belief as a 5-dimensional row vector $bel(pos_t^\prime) = (bel_{t,1}, bel_{t,2} bel_{t,3}, bel_{t,4}, bel_{t,5})$ where $bel_{t,i}$ represents the probability that the robot is in cell $i$ at time-step $t$.
The car has no prior knowledge about its position. Thus, it starts out with the following belief:
$bel(pos_0^\prime) = (0.2, 0.2, 0.2, 0.2, 0.2)$
First, it projects the previous belief to the current time step:
$\overline{bel}(pos_1^\prime) = \sum_{pos_0^\prime} P(pos_1^\prime \; \vert \; pos_0^\prime) \cdot bel(pos_0^\prime)$
$= (0.05, 0.9, 0.05, 0.0, 0.0) \cdot 0.2 + (0.0, 0.05, 0.9, 0.05, 0.0) \cdot 0.2$
$+ (0.0, 0.0, 0.05, 0.9, 0.05) \cdot 0.2 + (0.05, 0.0, 0.0, 0.05, 0.9) \cdot 0.2$
$+ (0.9, 0.05, 0.0, 0.0, 0.05) \cdot 0.2 = (0.2, 0.2, 0.2, 0.2, 0.2)$
This results in the same belief as before, which shouldn’t surprise us, since each cell was equally likely to be the car’s position at time $t = 0$ and therefore, since the robot just moved blindly, each cell is still equally likely to be its position at time $t = 1$.
Now the robot updates the projected belief with the sensor input:
$bel(pos_1^\prime) = \eta \cdot P(MeasuredColor_1 = Orange \; \vert \; pos_1^\prime) \cdot \overline{bel}(pos_1^\prime)$
$= \eta \cdot (0.1, 0.9, 0.1, 0.1, 0.9) \cdot (0.2, 0.2, 0.2, 0.2, 0.2)$
$= \eta \cdot (0.02, 0.18, 0.02, 0.02, 0.18)$
$= (0.04762, 0.42857, 0.04762, 0.04761, 0.42857)$
where the last step follows by dividing the vector by the sum over all vector values so that the probabilities sum up to 1. We can see that each of the two orange cells are equally likely to have caused the sensor measurement. Thus, the robot currently has two salient theories on where it might be.
$\overline{bel}(pos_2^\prime) = \sum_{pos_1^\prime} P(pos_2^\prime \; \vert \; pos_1^\prime) \cdot bel(pos_1^\prime)$
$= (0.39048, 0.08571, 0.39048, 0.06667, 0.06667)$
$bel(pos_2^\prime) = \eta \cdot P(MeasuredColor_2 = Orange \; \vert \; pos_2^\prime) \cdot \overline{bel}(pos_2^\prime)$
$= \eta \cdot (0.9, 0.1, 0.9, 0.9, 0.1) \cdot (0.39048, 0.08571, 0.39048, 0.06667, 0.06667)$
$= (0.45165, 0.01102, 0.45165, 0.07711, 0.00857)$
$\overline{bel}(pos_3^\prime) = \sum_{pos_2^\prime} P(pos_3^\prime \; \vert \; pos_2^\prime) \cdot bel(pos_2^\prime)$
$= (0.03415, 0.40747, 0.05508, 0.41089, 0.09241)$$bel(pos_3^\prime) = \eta \cdot P(MeasuredColor_3 = Orange \; \vert \; pos_3^\prime) \cdot \overline{bel}(pos_3^\prime)$
$= \eta \cdot (0.1, 0.9, 0.1, 0.1, 0.9) \cdot (0.03415, 0.40747, 0.05508, 0.41089, 0.09241)$
$= (0.00683, 0.73358, 0.01102, 0.08219, 0.16637)$
We can see that after 3 time steps the robot is already about 73% certain that it is in the second grid cell. After another 3 time steps of travelling right and sensing the correct colors, it is 94% certain about its position.
The disadvantage of the Histogram Filter is obvious: We are not able to tell the probability of each possible state. We are only able to tell the probability that the state is in a certain region of the state space. This disadvantage might be circumvented by using a very fine-grained decomposition of the state space, but this drastically increases the computational complexity.
Continue with the next part: Robot Localization III: The Kalman Filter
The post Robot Localization II: The Histogram Filter appeared first on deep ideas.
]]>