The post Introduction to Evolutionary Psychology appeared first on deep ideas.
]]>A prominent argument given in favor of the SSSM is the fact that genetically determined behavior might be maladaptive due to changing environmental conditions, and therefore the mind evolved towards general-purpose and domain-general learning systems. On this view, the phenotype’s behavior is plastic and tailored toward maximizing individual fitness under changing environmental circumstances. The selective pressures of ancestral environments gave rise to this plasticity, but the concrete adaptive problems that have been faced in these environments play only a minor role in explaining the behavior of modern humans. This is the reason why many social scientists study human behavior in modern conditions more or less independently from their evolutionary history.
Evolutionary psychology, in contrast, holds that psychological mechanisms are evolved adaptations to ancestral adaptive problems. An analogy is drawn here between organs in the body and “cognitive programs” or “mental organs”: Analogous to how organs in the body evolved to solve a particular adaptive problem, e.g. digesting food, cognitive programs evolved to solve a particular adaptive information processing problem, e.g. predator/prey distinction, kin detection, language, etc.
In the following, we will break down the individual tenets of evolutionary psychology and review the arguments that are given in support of these tenets. Since not all tenets are shared by all evolutionary psychologists, we will focus here on the formulation given by Cosmides and Tooby (1987) and Tooby and Cosmides (2005). The tenets are not listed explicitly, but can be reconstructed implicitly from these texts. I will go through each tenet in turn and present a reconstruction of the arguments that motivate these tenets.
This tenet is motivated as follows: Environments pose adaptive information processing problems to organisms. Hence, the genes of organisms that successfully solve these information processing problems spread in the gene pool and such organisms are, by definition, computers.
This tenet, Tooby and Cosmides (2005, p. 31) argue, is shared by proponents of the SSSM. Even a domain-general learning mechanism would be an innate information processing mechanism that evolved at some point to solve adaptive problems. For example, operant conditioning presupposes an innate mechanism to alter the probability of behaviors based on their intrinsically reinforcing consequences (like food or pain). Similarly, classical conditioning presupposes innate unconditioned stimuli and a method to calculate contingencies. Consequently, Tooby and Cosmides (2005, p. 32) conclude that “learning is not an alternative explanation to the claim that natural selection shaped the behavior” and that “a behavior can be, at one and the same time, cultural, learned, and evolved”. This means that the commonly perceived controversy between innateness/evolvedness on the one hand and learnedness on the other is based on a false dichotomy. Rather, it is proposed, evolution created programs as learning mechanisms, and these mechanisms are a prerequisite for learning to be able to occur. The disagreement between the SSSM and evolutionary psychology, therefore, only regards the structure of the evolved learning mechanisms, not the question whether such learning mechanisms evolved at all.
When we accept the theory of evolution through natural selection, it arguably becomes theoretically impossible to deny that the brain evolved to be a computer that solves adaptive information processing problems – unless we claim that (A) evolution hasn’t found this path yet, (B) evolution cannot find this path in principle since it would lead through a fitness valley or (C) adaptive problems aren’t information processing problems and therefore a computer would not be the ideal solution. Discussing these possibilities would be beyond the scope of this introduction, so I am going to suppose (A), (B) and (C) to be false for the rest of this discussion. This leads us to accept this tenet.
Cosmides and Tooby (1987, p. 47) and Tooby and Cosmides (2005, pp. 294- 299) argue that there is no domain-general success criterion that is correlated with fitness and, therefore, a domain-general mechanism would not be successful at actually maximizing fitness and could therefore not have evolved. This argument can be summarized as follows: If no domain-specific innate knowledge is present in the organism, then it can only acquire knowledge that can be inferred from perceptual inputs, without relying on innate perceptual heuristics. Similarly, it can learn behaviors only through trial and error learning, which would amount to generating random sequences of actions, observing the fitness outcome (e.g. the number of produced offspring) and then reinforcing or mitigating behaviors based on this outcome. Proposing instead that the mechanism could rely on perceptual cues like smell or taste as a proxy for expected fitness, they argue, amounts to “admitting domain-specific innate knowledge”.
However, when observing a certain positive or negative fitness outcome (like an increase or decrease in the produced offspring), it is virtually impossible to trace it back to the precise actions or sequences of actions that caused it, since virtually any action taken before in the organism’s life could have caused it. Furthermore, whether a sequence of action promotes fitness is highly context-sensitive. Thus, due to the resulting combinatorial explosion, behaviors cannot reliably be reinforced or mitigated and behavior stays more or less random. Therefore, an organism with adequate innate domain-specific knowledge, perceptual heuristics and perception-action patterns would have a fitness advantage over an organism that only has a domain-general fitness-maximizing mechanism, consequently triggering selection for organisms with these traits.
It should be noted that it is not claimed that all cognitive programs generate behavior deterministically based on the current perceptual input. Rather, some of these programs exhibit what is commonly called experience-dependent plasticity: They are able to learn based on the input they receive throughout the organism’s development (Cosmides and Tooby, 1987, p. 284). For example, the language program learns to acquire the language of a person’s surrounding community. The programs, therefore, did not evolve to produce a certain kind of behavior, but they evolved to produce a mapping from current inputs and the sequence of inputs they received throughout development to behaviors. Different programs have different degrees of experience-dependent plasticity, depending on the fitness advantage that plasticity would provide over genetic determinism in the program’s adaptive domain.
In a similar fashion, programs are experience-expectant: They evolved to be able to develop only if they receive certain informational inputs at critical periods throughout development (Tooby and Cosmides, 2005, p. 34-35). This entails that a program’s innateness does not mean that it is present at birth – much like teeth are innate but not present at birth. Rather, a cognitive program can develop at any point in an organism’s life, depending on whether it is relevant at that point in life and whether the developmentally relevant informational inputs have been received. Tooby and Cosmides (2005, p. 35) stress that this developmentally relevant information consists not only of contingencies in physical laws and the behavior of other organisms, but also of the physical and cultural environment. The latter comprise a second inheritance system that co-evolves with the genes, and changes in these environments can lead to significant alterations in the operation of the cognitive programs, or even a failure of certain cognitive programs to develop.
It should also be noted that it is not claimed that the cognitive programs can only generate behavior according to their original adaptive function. For example, the language program, which arose as an adaptation for spoken language, can learn to acquire reading and writing (Tooby and Cosmides, 2005, p. 26). The ability to learn reading and writing is not an adaptation but a by-product of the adaptation for spoken language.
However, it is claimed that the perception-behavior relations humans can learn are constrained or patterned by the structure of their innate cognitive programs. Hence, humans are not able to learn to perform arbitrary tasks. Rather, they are able to learn a task only if either a cognitive program to tackle this type of task arose as an adaptation, or the ability to solve this task is a by-product of some cognitive program that arose for some similar adaptive problem (as in the case of reading and writing). This is arguably the strongest and most vigorously debated entailment of evolutionary psychology, since it is at stark contrast with the SSSM, which posits a domain-general learning mechanism.
To make Tenet 3 more vivid, consider an example: the adaptive problem of avoiding inbreeding. Inbreeding is more disastrous the more related the inbreeding mates are. As argued under Tenet 2, a domain-general fitness-maximizing mechanism could not learn the relation between defective offspring and sex with relatives. Hence, Lieberman, Tooby, and Cosmides (2003) propose that humans evolved a kin detection program as a response to the evolutionary recurrent statistical relationship between inbreeding and reduced fitness. This program, they propose, combines various cues, like duration of coresidence during childhood, the degree to which one’s own mother cared for the person in question, olfactory signature, etc. to compute an estimate of the degree of relatedness to a person. This estimate is not computed everytime one encounters a person, but rather it is learned over time and stored. It is then fed into another program – the program that computes the sexual attractiveness of an individual. The higher the degree of relatedness, the lower the sexual attractiveness of that individual should be. This proposed incest avoidance program was tested by conducting a study in which participants were asked about the amount of time they spent with siblings during childhood, and the degree of aversion that they feel during imagination of sexual intercourse. As predicted by positing an inbreeding avoidance program, the study found a significant correlation between the period of childhood coresidence and the degree of sexual aversion (Lieberman, Tooby, and Cosmides, 2003, p. 27).
While the argument for Tenet 2 should lead us to accept that the mind must have some sort of innate domain-specific knowledge, it does not actually show that this must be manifested in the form of a large collection of functionally isolable programs that are domain-specific and correspond to particular adaptive information processing problems. Tenet 3, as it stands, is therefore not backed by the theoretical arguments that are given in favor of Tenet 2 – even though Cosmides and Tooby (1987) and Tooby and Cosmides (2005) take it to be that way. This conclusion is based on a false dichotomy between the extreme form of a domain-general fitness-maximizing mechanism, or “blank slate”, described in Tenet 2 and the view described in Tenet 3. Since this dichotomy does not exist, arguments against the blank slate view are not arguments in favor of Tenet 3. This will be elaborated upon in an upcoming article (you can either subscribe to deep ideas by Email or subscribe to my Facebook page to stay updated).
Apart from theoretical considerations, other arguments given in favor of this tenet are empirical observations. If a proposed cognitive program predicts certain behaviors that are empirically found to be present, this is taken to be evidence in support of the existence of the proposed cognitive program. However, as with all of science, just because a certain theory is consistent with an observation, this does not verify the proposed theory – it only doesn’t falsify it (Popper, 2005). In an upcoming article, I will construct another model that accounts for the empirical observations taken as evidence for evolutionary psychology without actually positing functionally isolable cognitive programs and while maintaining a domain-general learning mechanism.
As Tooby and Cosmides (2005, p. 56 f.) point out, most of the uniquely human evolution took place in ancestral hunter-gatherer societies, and natural selection acts too slowly to have adapted to post-hunter-gatherer conditions.
Tooby and Cosmides (2005, pp. 36-39) argue that, since the genetic makeup of offspring is basically a random mixture of the genetic makeup of the mother and father, the genetic makeup of both parents has to code for a universal functional architecture – otherwise, the programs cannot function in concert in the offspring. The observed variation between individuals and races can therefore not be explained by differences in the overall functional architecture, but rather by genetic variations that tune quantitative parameters, adaptations that can be coded for by single genes or programs that can be activated or deactivated by single genes. Another source of variation are different perceptual inputs throughout development. These can cause some of the programs to learn different behaviors. However, these behavioral changes are limited by the degree of plasticity allowed for by the programs.
Tooby and Cosmides (2005, pp. 42-44) suggest that behavior with respect to certain object categories has computational requirements that are functionally incompatible with the demands for other categories. For example, snakes have been a dangerous, recurring predator throughout human evolutionary history and reacting to snakes requires a speed of processing and particular behavioral response packages that would be incompatible with the responses for other categories. Consequently, behavior with respect to snakes should be generated by some other system than, e.g., behavior with respect to humans.
Similarly, reasoning with respect to social relations can include methods of inference that would be invalid in content-free reasoning. They give the following example: “If you take the benefit, then you are obligated to satisfy the requirement” implies “If you satisfy the requirement, then you are entitled to take the benefit” (Tooby and Cosmides, 2005, p. 46). This is an inference of the form (P → Q) → (Q → P) that would be invalid in a domain-general logic. Hence, they argue that domain-specific reasoning systems would allow the organism to draw inferences that would be impossible using a domain-general reasoning system. This, in turn, improves the organism’s fitness, which results in selection for domain-specific reasoning systems.
These and similar considerations lead them to propose that attention systems, reasoning systems, learning systems and memory systems are category-based, i.e., they are not uniform systems but exist as separate cognitive programs for different object categories (e.g., for animals, humans and artifacts). Their unification, they claim, is a relict of folk psychology that ought to be eliminated from scientific psychology (Tooby and Cosmides, 2005, p. 45).
Since the mind is a collection of programs that generate different behaviors and could mutually interfere with each other, mechanism orchestration in particular evolutionary recurrent situations (e.g. being attacked by a predator) is an adaptive problem, and emotion programs evolved as a solution to this adaptive problem (Tooby and Cosmides, 2005, pp. 52-61). These emotion programs (e.g., fear) involve activating subprograms in a concerted way for solving the particular adaptive problem in question (e.g., activating an action package consisting of flight behavior, physiological changes, a fearful facial expression, screaming, etc.) while deactivating possibly interfering other programs (e.g., hunger).
Given these tenets, Tooby and Cosmides (2005) propose that the field of psychology ought to be a form of reverse engineering of the cognitive programs. This approach can be broken down as follows: First, gather knowledge about ancestral living conditions and environments from fields like paleoanthropology, hunter-gatherer archaeology and studies of living hunter-gatherer societies. These insights can be combined with evolutionary theory to determine the adaptive problems faced in these environments, i.e., the problems that, when solved, would lead to higher evolutionary fitness. From these adaptive problems, specifications of the computational requirements that these adaptive problems pose can be constructed. Given these specifications, models for cognitive programs that comply with these specifications can be developed – i.e., that manage to solve the adaptive problem in question. Alternatively, instead of starting the process by figuring out adaptive problems, one could start by observing behaviors of organisms and work backward to hypothesize a cognitive program that could give rise to this behavior. Finally, the hypothesized cognitive programs can be evaluated for coherence and consistency with previous or novel experimental observations from the cognitive, social and cultural sciences: Either, behavioral predictions of the hypothesized programs can be matched against cross-cultural behavioral observations, or design features of the programs can be identified in brains.
As I have alluded to already, the theoretical arguments that we have reviewed should only lead us to accept Tenets 1 and 2. All the other tenets posit discrete, functionally isolable cognitive programs that correspond to distinct adaptive problems. While Tooby and Cosmides take their argument against the domain-general fitness-maximizing mechanism to be an argument in favor of cognitive programs, I will argue in an upcoming article that this is not actually so. In particular, I will argue for two theses: (A) We can posit a domain-general learning mechanism while maintaining domain-specific innate knowledge. (B) We can posit domain-specific innate knowledge without positing cognitive programs under any fruitful definition of this term.
You can either subscribe to deep ideas by Email or subscribe to my Facebook page to stay updated
Cosmides, Leda and John Tooby (1987). “From evolution to behavior: Evolutionary psychology as the missing link”. In: The latest on the best: Essays on evolution and optimality. The MIT Press.
Lieberman, Debra, John Tooby, and Leda Cosmides (2003). “The evolution of human incest avoidance mechanisms: an evolutionary psychological approach”. In: Evolution and the moral emotions: appreciating Edward Westermarck. Citeseer.
Popper, Karl (2005). The logic of scientific discovery. Routledge.
Tooby, John and Leda Cosmides (1992). “The psychological foundations of culture”. In: The adapted mind: Evolutionary psychology and the generation of culture. Oxford University Press.
Tooby, John and Leda Cosmides (2005). “Conceptual foundations of evolutionary psychology”. In: The handbook of evolutionary psychology. John Wiley & Sons.
The post Introduction to Evolutionary Psychology appeared first on deep ideas.
]]>The post Building a Content-Based Multimedia Search Engine VI: Efficient Query Processing appeared first on deep ideas.
]]>This is part 6 in a series of tutorials in which we learn how to build a content-based search engine that retrieves multimedia objects based on their content rather than based on keywords, title or meta description.
The naive way to process a k-Nearest Neighbor query entails computing the distance between the query object and all database objects, resulting in a time complexity of $\mathcal{O}(|DB|)$ where $|DB|$ refers to the size of the database (cf. [BS13]). If the distance measure is costly to compute, which is usually the case when dealing with complex multimedia objects, this is infeasible. Therefore, we rely on methods that allow us to compute the k-Nearest Neighbors without having to compute the actual distance between the query and all database objects.
One way to speed up the query processing is by means of a lower bound to the distance function $\delta$, i.e. a function $\delta_{LB} : X \times X \rightarrow \mathbb{R}$ for which it holds that $\delta_{LB}(x, y) \leq \delta(x, y)$ for all $x, y \in X$. If we know that the k-Nearest Neighbors have a distance smaller than or equal to $\epsilon_{max} \in \mathbb{R}$, then we can exclude all objects $o$ for which it holds that $\delta_{LB}(q,o) > \epsilon_{max}$ without computing the actual distance $\delta(q, o)$, since this implies that $\delta(q,o) > \epsilon_{max}$.
For most distance functions, lower bounds can be found which are significantly more efficient to compute than the actual distance function, while at the same time allowing us to rule out a large proportion of the database as potential search results.
The Multi-Step kNN Algorithm, proposed by Seidl et al. in [SK98], utilizes a lower bound for efficient kNN search by iteratively updating the pruning distance $\epsilon_{max}$ while scanning the database. As shown in [SK98], this algorithm is optimal with respect to the number of performed computations of the utilized distance function $\delta$. Assuming that we have specified multiple lower bounds $\delta_{LB_1}, …,\delta_{LB_m}$, this algorithm reads as follows:
The efficiency of this algorithm highly depends on the utilized lower bounds. A good lower bound should meet the ICES criteria defined by Assent et al. in [AWS06]: It should be indexable such that multidimensional indexing structures like X-Tree or R-Tree can be applied. Furthermore, it should be complete, i.e. no false drops occur, which is guaranteed by the lower-bounding property. Moreover, it should be efficient, i.e. its computational time complexity should be significantly lower than the complexity of the actual distance function. Finally, it should be selective, i.e. it should allow us to exclude as many objects as possible from the actual distance computation, which is achieved by approximating the actual distance as good as possible.
If the lower bounds have different time complexities and selectivities, then the ordering of the lower bounds plays a role for the efficiency of the algorithm. In each iteration of the outer loop, we skip the actual distance computation when one of the lower bounds exceeds the current pruning radius $\epsilon_{max}$. Hence, a reasonable heuristic in order to skip as early as possible is to sort the lower bounds in ascending order of their time complexities.
Lower bounds can be devised by exploiting the inner workings of the utilized distance function. There are, however, some lower bounds that are generic in nature, i.e. they are applicable to a wide variety of distance functions, as long as these distance functions fulfill certain properties. In the following, we present a generic lower bound for metric distance functions and lower bounds for the Earth Mover’s Distance.
If we are dealing with a metric distance function, we can exploit the fact that the distance fulfills the triangle inequality to devise a lower bound (cf. [ZADB06]). Given a query $q$, a database object $o$ and a set of so-called pivot objects $P$, it follows from the triangle inequality that
$$\delta_{LB-Metric}(q, o) = max_{p \in P} |\delta(q, p) – \delta(p, o)| \leq \delta(q, o)$$
The distances $\delta(p, o)$ between all pivot objects $p \in P$ and all database objects $o \in DB$ can be computed in advance and stored in a so-called pivot table of size $|P| \cdot |DB|$. When processing a query, we only need to compute the distances $\delta(q, p)$ between the query and all pivot objects $p \in P$ and store those distances in a list. After that, $\delta_{LB-Metric}$ can be computed in $\mathcal{O}(|P|)$ efficiently by looking up the stored distance values. The selectivity of this approach is highly dependent on the choice of pivot objects $P$ and the distribution of the database objects and the query object.
As shown by Rubner et al. in [RTG00], when using a norm-induced ground distance, the Earth Mover’s Distance can be lower-bounded by the ground distance between the weighted means of the two signatures.
Let $X$ be a feature signature and $\delta : \mathbb{F} \times \mathbb{F} \rightarrow \mathbb{R}$ be a norm-induced ground distance. Then it holds that
$$Rubner(X, Y) = \delta(\overline{X}, \overline{Y}) \leq EMD_\delta(X, Y)$$
where $\overline{X}$ is the weighted mean of $X$:
$$\overline{X} = \frac{\sum_{f \in R_X} X(f) \cdot f}{\sum_{f \in R_X} X(f)}$$
Proposed in [UBSS14], the Independent Minimization Lower Bound for feature signatures (short: IM-Sig) is a lower bound for EMD that corresponds to the EMD when removing the Target constraint and replacing it with the IM-Sig Target constraint defined as follows:
$$\forall g \in R_X, h \in R_Y : f(g, h) \leq Y(h)$$
Intuitively, this modified target constraint allows to distribute the flow optimally for each representative $g \in R_X$ without considering whether the total flow coming into the target representatives exceeds their weights, as long as the flow from $g$ to $h$ does not exceed the weight $Y(h)$ for all target representatives $h \in R_Y$. We use $IMSig_\delta(X,Y)$ to denote the minimum cost flow with respect to the modified target constraint. Since the set of feasible flows for IM-Sig includes the set of feasible flows for EMD, it holds that $IMSig_\delta(X,Y) \leq EMD_\delta(X,Y)$.
An efficient algorithm for computing $IMSig_\delta(X,Y)$ is given in [UBSS14].
Let us review what we have learned so far. In Part I: Quantifying Similarity, we have learned how we can quantify the similarity or dissimilarity between two multimedia objects (or representations of these objects) by means of a similarity function or a distance function. Moreover, we have seen the types of query that exist with respect to such functions: The range query and the k-Nearest Neighbor Query. In Part II: Extracting Feature Vectors, we have learned how to extract feature vectors from multimedia objects that capture the visual content of those objects. We have demonstrated this for videos, but the general approach is applicable to a wide variety of other multimedia objects. In Part III: Feature Signatures, we have learned how we can summarize a set of feature vectors into a compact representation called a feature signature which effectively comprises a compressed summary of the contents of the multimedia object. In Part IV: Earth Mover’s Distance and Part V: Signature Quadratic Form Distance, we have learned about two distance measures on these feature signatures that allow us to calculate their dissimilarity and, in effect, the dissimilarity between the two multimedia objects that they represent. In Part VI: Efficient Query Processing, we have learned how to perform k-Nearest Neighbor queries efficiently, without having to compute the pairwise distances between the query object and every object in the database.
We now have all the necessary components in place to build a multimedia search engine. First of all, the database has to be fed with objects to query. For each of these objects, it should store a previously computed feature signature. If desired, we can generate a pivot table for efficient query processing. Next, a user interface has to be provided, which allows the user to upload a query object. For this query object, a feature signature has to be computed, and subsequently the database of feature signatures can be queried using the multi-step kNN algorithm.
[AWS06] Ira Assent, Marc Wichterich, and Thomas Seidl. Adaptable distance functions for similarity-based multimedia retrieval. Datenbank-Spektrum, 19:23–31, 2006.
[BS13] Christian Beecks and Thomas Seidl. Distance based similarity models for content-based multimedia retrieval. PhD thesis, Aachen, 2013. Zsfassung in dt. und engl. Sprache; Aachen, Techn. Hochsch., Diss., 2013.
[RTG00] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
[SK98] Thomas Seidl and Hans-Peter Kriegel. Optimal multi-step k-nearest neighbor search. In: ACM SIGMOD Record, volume 27, pages 154–165. ACM, 1998.
[UBSS14] Merih Seran Uysal, Christian Beecks, Jochen Schmücking, and Thomas Seidl. Efficient filter approximation using the earth mover’s distance in very large multimedia databases with feature signatures. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 979–988. ACM, 2014.
[ZADB06] Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Similarity search: the metric space approach, volume 32. Springer Science & Business Media, 2006.
The post Building a Content-Based Multimedia Search Engine VI: Efficient Query Processing appeared first on deep ideas.
]]>The post Building a Content-Based Multimedia Search Engine V: Signature Quadratic Form Distance appeared first on deep ideas.
]]>We have seen how we can compare two multimedia objects with respect to their similarity by computing the Earth Mover’s Distance on their feature signatures. To this date, the Earth Mover’s Distance has been shown experimentally to be the most effective distance measure on feature signatures, i.e. it captures the way that human beings judge the perceptual dissimilarity between the objects most adequately. However, we have also seen that the computational complexity of the Earth Mover’s Distance lies somewhere between $\mathcal{O}(n^3)$ and $\mathcal{O}(n^4)$, where $n$ is the total number of representatives of the two compared signatures. For large-scale multimedia retrieval applications, this becomes unfeasible.
An alternative distance measure, which is almost as effective as the Earth Mover’s Distance but significantly more efficient to compute, is the Signature Quadratic Form Distance (SQFD). It has been proposed in [BUS10b] and can be thought of as an adaption of the Quadratic Form Distance [FBF+94], a distance measure on histograms, to feature signatures.
Let X, Y be two feature signatures and let $s : \mathbb{F} \times \mathbb{F} \rightarrow \mathbb{R}^{\geq 0}$ be a similarity function on features. The Signature Quadratic Form Distance $SQFD_s : \mathbb{S} \times \mathbb{S} \rightarrow \mathbb{R}^{\geq 0}$ is defined as
$$SQFD_s(X, Y) = \sqrt{<X-Y, X-Y>_s}$$
where $<X, Y>_s : \mathbb{S} \times \mathbb{S} \rightarrow \mathbb{R}^{\geq 0}$ is the Similarity Correlation, which is defined as
$$<X, Y>_s = \sum_{f \in R_X} \sum_{g \in R_Y} X(f) Y(g) s(f, g)$$
Intuitively, the similarity correlation yields high values if representatives that are similar to each other also have high weights. If, on the other hand, the weight in one signature is distributed in distinct (i.e. dissimilar) regions of the feature space than for the other signature, it assigns low values. Therefore, $<X, Y>_s$ is a measure of the similarity of the two signatures.
$X-Y$ refers to the difference between the two feature signatures, which is defined as a new feature signature such that $(X-Y)(f) = X(f) – Y(f)$. The term $<X-Y, X-Y>_s$ yields the self-similarity of this difference . If $X = Y$, this value is 0. The more dissimilar $X$ and $Y$ are, the higher the distance value will be.
The previous definition pre-supposes some similarity function $s$ on feature vectors. Recall from Part 1: Quantifying Similarity that a similarity function assigns small values to objects that are dissimilar and larger values to objects that are more similar, reaching its maximum when the two compared objects are the same. Given some distance function $\delta$ on feature vectors (an obvious choice being the Euclidean Distance or the Manhattan Distance), we can define a similarity function by means of the Gaussian kernel:
$$s(f, g) = e^{-\frac{\delta(f, g)}{2\sigma^2}}$$
For a distance of 0, this similarity function yields a value of 1. For increasing distances, the similarity exponentially decays to 0. $\sigma$ controls the speed of this exponental decay, with larger values leading to a slower decay.
As we can see from the definition, the run-time complexity for computing the Signature Quadratic Form Distance is $\mathcal{O}(n^2)$, making it significantly more feasible than the Earth Mover’s Distance. Moreover, as opposed to the Earth Mover’s Distance, the computation of the sum can be fully parallelized on a GPU, effectively resulting in constant computation time for most signatures (see [KLBSS11]). Finally, it can be shown that the Signature Quadratic Form Distance is a metric, which will become important for indexing methods, as will be detailed in the next section.
While the Earth Mover’s Distance is slightly more effective than the Signature Quadratic Form Distance (see [BS13] for experimental comparisons), this difference may or may not be worth the increased computational complexity. The choice between the two distance metrics should therefore be based on the requirements of the project at hand.
We are now able to compute the similarity between arbitrary feature signatures. This effectively allows us to retrieve similar multimedia objects to a given query object. However, as of now, we would have to compute the distance between the query object and every single object in the database. If our database is large, this can lead to unacceptable computational demands and query times. For this reason, we are going to introduce indexing methods in the next section. These will allow us to retrieve similar objects to a query object without having to compute all distance values explicitly. Continue with the next section: Efficient Query Processing
[BUS10b] Christian Beecks, Merih Seran Uysal, and Thomas Seidl. Signature quadratic form distance. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 438–445. ACM, 2010.
[BS13] Christian Beecks and Thomas Seidl. Distance based similarity models for content based multimedia retrieval. PhD thesis, Aachen, 2013. Zsfassung in dt. und engl. Sprache; Aachen, Techn. Hochsch., Diss., 2013.
[KLBSS11] Kruliš, M., Lokoč, J., Beecks, C., Skopal, T., & Seidl, T. (2011, October). Processing the signature quadratic form distance on many-core gpu architectures. In: Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 2373-2376). ACM.
The post Building a Content-Based Multimedia Search Engine V: Signature Quadratic Form Distance appeared first on deep ideas.
]]>The post Building a Content-Based Multimedia Search Engine IV: Earth Mover’s Distance appeared first on deep ideas.
]]>We have seen how we can represent multimedia objects efficiently and expressively by summarizing a set of feature vectors into a data structure called a feature signature. Given two multimedia objects represented as feature signatures, we can measure the dissimilarity of the objects using a distance measure on their feature signatures. Numerous distance measures for feature signatures have been proposed (see [BS13] for an overview). The distance measure that has turned out to be the most effective is called the Earth Mover’s Distance.
Proposed in [RTG00] for the domain of content-based image retrieval, the Earth Mover’s Distance (short: EMD) is a distance measure on feature signatures that can be thought of as the minimum required cost for transforming one feature signature into the other one. This cost is formulated by means of a transportation problem: We determine the optimal way to move the weights from the representatives of the first signature ($X$) to the representatives of the second signature ($Y$). The cost for moving a certain amount of weight is given by the amount of weight multiplied by the distance over which it is transported.
The following image depicts an example. On the top, we see the feature signatures of two videos. Let’s call the left-hand signature $X$ and the right-hand signature $Y$. On the bottom, we see an isolated representative of $X$ and the representatives of $Y$ to which it moves weight. As we can see, the representatives in $Y$ to which $X$’s representative moves weight are quite similar to $X$’s representative, resulting in a relatively small “movement cost” or “transformation cost” for this representative.
Let’s see how we can formulate the Earth Mover’s Distance as an optimization problem. Let $\mathbb{F}$ be the set of all possible features, $\delta : \mathbb{F} \times \mathbb{F} \rightarrow \mathbb{R}_{\geq 0}$ be a distance function on features (called ground distance, e.g. the Euclidean distance) and $X, Y \in \mathbb{S}$ be two feature signatures. We call $f : R_X \times R_Y \rightarrow \mathbb{R}$ a flow from signature $X$ to signature $Y$. For two representatives $g \in R_X, h \in R_Y$, it tells us how much weight is moved from $g$ to $h$. $f$ is called a feasible flow if it fulfills the following constraints:
Now let $F = \{f \; | \; f \; \text{is a feasible flow}\}$. There are infinitely many feasible flows. We are interested in the flow with the minimum cost, where the cost is defined as the sum over all pairs of representatives $g$, $h$ of the flow $f(g, h)$ multiplied by their ground distance $\delta(g, h)$. Intuitively, this means that we want to find a flow that tends to move weights from representatives in $X$ to nearby (i.e. similar) representatives in $Y$. The Earth Mover’s Distance is then defined as the cost of the minimum cost flow, i.e. the cost required to transform one signature into the other one.
$$EMD_\delta(X, Y) = min_{f \in F} \left \{ \frac{ \sum_{g \in R_X} \sum_{h \in R_Y} f(g, h) \cdot \delta(g, h) }{ min\{ \sum_{g \in R_X} X(g), \sum_{h \in R_Y} Y(h) \} } \right \}$$
The denominator acts as a normalization term.
The definition of the EMD corresponds to a linear program, i.e. an optimization problem with a linear objective function and linear constraints. It can be solved, for instance, using the Simplex algorithm (cf. [Van01]), which has an exponential worst-time complexity. In practice, we would just use a library that calculates the Earth Mover’s Distance for us directly. According to [SJ08], the empirical time complexity for calculating the Earth Mover’s Distance between two signatures $X$ and $Y$ using the simplex algorithm lies between $\mathcal{O}(n^3)$ and $\mathcal{O}(n^4)$ where $n = |R_X| + |R_Y|$. An approximation to the Earth Mover’s Distance can, however, be computed in linear time [SJ08]. The Earth Mover’s Distance is a metric, provided that the ground distance is a metric and the compared signatures are normalized to the same total weight.
In the next section, we present another distance measure on feature signatures that is only slightly less effective, but can be computed in quadratic time: the Signature Quadratic Form Distance
[BS13] Christian Beecks and Thomas Seidl. Distance based similarity models for content based multimedia retrieval. PhD thesis, Aachen, 2013. Zsfassung in dt. und engl. Sprache; Aachen, Techn. Hochsch., Diss., 2013.
[RTG00] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
[Van01] Robert J Vanderbei. Linear programming. Foundations and extensions, International Series in Operations Research & Management Science, 37, 2001.
[SJ08] Sameer Shirdhonkar and David W Jacobs. Approximate earth mover’s distance in linear time. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
The post Building a Content-Based Multimedia Search Engine IV: Earth Mover’s Distance appeared first on deep ideas.
]]>The post Building a Content-Based Multimedia Search Engine III: Feature Signatures appeared first on deep ideas.
]]>When computing the distance between two multimedia objects, it would be highly inefficient to take into account all of the extracted feature vectors. For most practical purposes, however, it is not necessary to do this in order to achieve a good discriminability of the objects. Most of the vectors carry redundant information or fine-grained details that do not have a significant influence on the overall similarity of two objects. Hence, we summarize all of the extracted features into a structure called a feature signature.
Intuitively, a feature signature is characterized by a relatively small set of feature vectors, called the representatives, along with a weight for each representative.
Formally, if we let $\mathbb{F}$ denote the set of all possible features, a feature signature $X$ is a function $X : \mathbb{F} \rightarrow \mathbb{R}$ such that $|\{f \in \mathbb{F} \; | \; X(f) = 0\}| < \infty$ (i.e. it assigns a weight to only a finite number of vectors and assigns 0 everywhere else). We refer to $R_X =|\{f \in \mathbb{F} \; | \; X(f) = 0\}|$ as the representatives of $X$. We use $\mathbb{S}$ to refer to the set of all feature signatures.
A common way to calculate a feature signature is to apply a clustering algorithm (e.g. k-means) to the extracted set of feature vectors. From the resulting clustering, we devise a feature signature by defining the cluster means as the representatives and assigning them a weight corresponding to the relative size of the cluster, i.e. the number of cluster elements divided by the total number of extracted features. Here is an example depicting this process for 2-dimensional feature vectors:
First, the vectors are clustered, yielding 3 clusters (red, green and blue). Then we compute the cluster centers (depicted as the large red, green and blue dots), which we define as the representatives of the feature signature S, and assign them a weight corresponding to the relative cluster size (i.e. the number of feature vectors in the cluster divided by the total number of feature vectors).
Let’s formalize this process. Let $C = C_1, …, C_m$ be a clustering of feature vectors. We define the clustering-induced normalized feature signature $X_C$ as $X_C : \mathbb{F} \rightarrow \mathbb{R}$ with:
$$
X_C(f)=
\begin{cases}
\frac{|C_i|}{\sum_{1 \leq j \leq m} |C_j|} & \text{if } f = \frac{1}{|C_i|} \sum_{g \in C_i} g\\
0 & \text{else}
\end{cases}
$$
Before applying the clustering algorithm, we multiply each dimension by a certain weight, which allows us to control the importance of that dimension for the clustering. When using k-means to calculate the clustering, we can specify the desired number of representatives k in advance. This allows us to control the expressiveness of the feature signature. The higher we choose k, the more expressive the feature signature gets, with the downside of increasing the storage size and the computational complexity of the distance computation. There is a monotonous relation between k and the effectiveness (i.e. adequacy of the results) as well as the query processing time. Hence, k allows us to control the tradeoff between effectiveness and efficiency.
The following image depicts 3D visualizations of two videos and their feature signatures with k = 100. Here, the clusters are represented as spheres. Their position in the 3D coordinate system corresponds to the position (x and y) and the time (t), the color of the sphere corresponds to the L*a*b* color dimensions of the representatives and the volume of the sphere corresponds to the weight that the feature signature assigns to the representative. As we can see, the feature signature summarizes the visual content of the video as it unfolds over time using just 100 vectors.
We have seen how feature signatures reduce the rather large amount of information inherent in the feature vectors into a compact representation that still reveals a lot of information about the feature distribution, since it summarizes how many feature vectors are located at which locations in the feature space. In the next section, we will see how we can compute the similarity between two feature signatures. Continue with the next section: Earth Mover’s Distance
The post Building a Content-Based Multimedia Search Engine III: Feature Signatures appeared first on deep ideas.
]]>The post Building a Content-Based Multimedia Search Engine II: Extracting Feature Vectors appeared first on deep ideas.
]]>This is part 2 in a series of tutorials in which we learn how to build a content-based search engine that retrieves multimedia objects based on their content rather than based on keywords, title or meta description.
In the previous section, we saw how similarity between multimedia objects can be formalized and which types of queries exist with respect to this formalization. In a step towards efficiently computing similarity between two multimedia objects, we are now going to see how we can characterize the contents of individual multimedia objects (in our example, a video) by extracting a set of so-called feature vectors, which are vectors from the Euclidean space that describe certain local characteristic properties.
Since we are interested in visual similarity between two videos, our goal is to extract a set of vectors, each of which describes certain local visual properties of the video numerically. This process can be depicted visually as follows:
We first select a certain amount of sample frames from the video (e.g. 10 frames per second). For each of these frames, we select a fixed amount of equidistant sample pixels. Finally, for each sample pixel, we compute an 8-dimensional Euclidean vector $(x, y, L, a, b, \chi, \eta, t)$ describing the visual appearance of the pixel and its context. The choice of this vector is just a suggestion and it isn’t necessary to include all of features presented here.
The first two dimensions of this vector correspond to the x and y coordinates of the pixel inside the frame. The next 3 dimensions correspond to the color of the pixel in the L*a*b* color space, i.e. the lightness, the position between red and green and the position between blue and yellow (cf. [Wik15b]). The reason we choose this color space instead of e.g. RGB is the fact that Euclidean distances in this space have a significantly higher correlation with perceptual dissimilarity than other color spaces, making it more suitable for our task of measuring visual similarity. Additionally, we calculate the contrast $\chi$ of a 12 x 12 neighborhood of the pixel as proposed by Tamura et al. in [TMY78], which is a measure of the dynamic range of the colors. Furthermore, we calculate the coarseness $\eta$ of the pixel, as proposed in [TMY78], which is a measure of how big the structures surrounding that pixel are. Finally, we add the time t of the frame from which the pixel was sampled as another dimension (in seconds from the beginning of the video).
The whole set of extracted feature vectors, then, comprises a summary of how the visual contents of the video unfold over time.
The entries of the vectors all measure different aspects and stem from different ranges. Since we want all dimensions to have equal importance in the distance computations, irrespective of their value range, we normalize all 8 dimensions individually, yielding a vector whose entries lie between 0 and 1: The positions x and y are divided by the image width and height, respectively. The L* color coordinate ranges from 0 to 100 and is hence divided by 100. The a* and b* color coordinates range from -128 to 127. Therefore, we add 128 and divide by 255. The contrast $\chi$ ranges from 0 to 128 and is therefore divided by 128. The coarseness $\eta$ ranges from 0 to 5 and is hence divided by 5. Finally, the time is divided by the video duration.
The first 7 dimensions that describe a pixel in the context of its frame have yielded high effectiveness for the task of retrieving visually similar images (cf. [BUS10a]) and were hence adopted. Since a video can be thought of as a generalization of an image along another dimension (the time dimension), the image retrieval approach was extended simply by adding the time as another dimension to the feature vectors. The rationale for this is that there is no conceptual difference between the spatial dimensions and the time dimension. A video can be imagined to be an image changing over time. The fact that a video is usually represented as a sequence of frames is just a way to store a video digitally, and it has lead many of the video retrieval approaches to base their video representations on frame sequences, even though semantically a video can be treated reasonably as an image changing continuously over time rather than as a sequence of images.
We now know how we can express local visual properties of a video by means of a set of feature vectors. In the next section, we will see how we can summarize these vectors into a more compact representation scheme that allows us to store the contents of the video using less space, and to compute the visual similarity between two videos more efficiently. Continue with the next section: Feature Signatures
[Wik15b] Wikipedia. Lab color space http://en.wikipedia.org/wiki/Lab_color_space, 2015.
[TMY78] Hideyuki Tamura, Shunji Mori, and Takashi Yamawaki. Textural features corresponding to visual perception. IEEE Transactions on Systems, Man and Cybernetics, 8(6):460–473, 1978.
[BUS10a] Christian Beecks, Merih Seran Uysal, and Thomas Seidl. A comparative study of similarity measures for content-based multimedia retrieval. In Multimedia and Expo (ICME), 2010 IEEE International Conference on, pages 1552–1557. IEEE, 2010.
The post Building a Content-Based Multimedia Search Engine II: Extracting Feature Vectors appeared first on deep ideas.
]]>The post Building a Content-Based Multimedia Search Engine I: Quantifying Similarity appeared first on deep ideas.
]]>The explosion of user-generated content on the internet during the last decades has left the world of querying multimedia data with unprecedented challenges. There is a demand for this data to be processed and indexed in order to make it available for different types of queries, whilst ensuring acceptable response times.
An arguably important task is the retrieval of multimedia objects (e.g. images or videos) that are visually similar to a certain query object (e.g. a query image or a query video). We define two multimedia objects to be visually similar if they depict contents that “look similar” to humans. So far, this task has gained comparatively little research recognition.
Most of the major search engines or content suggestion engines only allow for text-based queries and the search only considers metadata such as title, description text or user-specified tags. The content of the multimedia objects is not taken into account. Such systems are very limited with respect to the types of queries that are possible, and with respect to the actual relevance of the retrieved results.
In this series of tutorials, I introduce a method for retrieving visually similar multimedia objects to a specified query object. This method is based on so-called feature signatures, which comprise an expressive summary of the content of a multimedia object that is significantly more compact than the object itself, allowing for an efficient comparison between objects. This method is applicable to virtually all kinds of multimedia objects. In the course of this tutorial, we’ll take content-based video similarity search as our main example. However, it should be clear how to adapt this method to your particular needs.
This section introduces some fundamental preliminaries for the problem of retrieving similar multimedia objects to a given query object. We will review how similarity between multimedia objects can be formalized and which types of queries exist with respect to this formalization.
In order to retrieve similar multimedia objects to a given query object, we need a way to compare the query object to the database objects and quantify the similarity or dissimilarity numerically. There are many ways in which similarity or dissimilarity between two objects can be measured, and it is highly dependent on the nature of the compared objects and on the aspects which we want to compare. For example, videos could be compared with respect to their visual content, their auditory content or meta-data such as the title or a description text.
The most common way to model similarity is by means of a distance function. A distance function assigns high values to objects that are dissimilar and small values to objects that are similar, reaching 0 when the two compared objects are the same. Mathematically, a distance function is defined as follows:
Let $X$ be a set. A function $\delta : X \times X \rightarrow R$ is called a distance function if it holds for all $x, y \in X$:
When it comes to efficient query processing, as we will see later, it is useful if the utilized distance function is a metric.
Let $\delta : X \times X \rightarrow R$ be a distance function. $\delta$ is called a metric if it holds for all $x, y, z \in X$:
An alternative way to model similarity between two objects is by means of a similarity function, which assigns small values to objects that are dissimilar and larger values to objects that are more similar, reaching its maximum when the two compared objects are the same (cf. [BS13]).
Let X be a set. A function $s : X \times X \rightarrow \mathbb{R}$ is called a similarity function if it is symmetric and if it holds for all $x, y \in X$ that $s(x, x) \geq s(x, y)$ (maximum self-similarity).
Once we have modeled the similarity for pairs of multimedia objects by means of a distance function, we can reformulate the problem of retrieving similar objects to the query object by utilizing such a function. A prominent query type is the so-called range query, which retrieves all database objects for which the distance to the query object lies below a certain threshold. The formal definition is given below (adopted from [BS13]).
Let $X$ be a set of objects, $\delta : X \times X \rightarrow R$ be a distance function, $DB \subseteq X$ be a database of objects, $q \in X$ be a query object and $\epsilon \in \mathbb{R}$ be a search radius. The range query $range(q, \delta, X)$ is defined as
$range_\epsilon(q, \delta, X) = \{x \in X \; | \; \delta(q, x) \leq \epsilon\}$
For range queries, it is hard to determine a suitable threshold $\epsilon$ to yield a result set of a desired size. When $\epsilon$ is too low, the result set might be very small or even empty. On the other hand, when choosing it too large, the result set might come near to including the entire database. This problem can be solved by issuing a k-Nearest Neighbor Query (short: kNN query) instead. In this query type, we specify the desired number of retrieved objects $k$ instead of a distance threshold. If we assume that the distances between the query object and the database objects are pairwise distinct, the k-Nearest Neighbors are the $k$ objects that have the smallest distance to the query object. The formal definition is given below (adopted from [SK98]).
Let $X$ be a set of objects, $\delta : X \times X \rightarrow R$ be a distance function, $DB \subseteq X$ be a database of objects, $q \in X$ be a query object and $k \in \mathbb{N}, k \leq |DB|$. We define the k-Nearest Neighbors of $q$ w.r.t. $\delta$ as the smallest set $NN_q(k) \subseteq DB$ with $|NN_q(k)| \geq k$ such that the following holds:
$\forall o \in NN_q(k), \forall o^\prime \in DB − NN_q(k) : \delta(o, q) < \delta(o^\prime , q)$
Our goal now is to devise a distance function that reflects human judgement of similarity. In the next section, we will learn how to extract features from multimedia objects, which are sets of vectors that characterize the content of that object. Continue with the next section: Extracting Feature Vectors.
[BS13] Christian Beecks and Thomas Seidl. Distance based similarity models for content based multimedia retrieval. PhD thesis, Aachen, 2013. Zsfassung in dt. und engl. Sprache; Aachen, Techn. Hochsch., Diss., 2013.
[SK98] Thomas Seidl and Hans-Peter Kriegel. Optimal multi-step k-nearest neighbor search. In ACM SIGMOD Record, volume 27, pages 154–165. ACM, 1998.
The post Building a Content-Based Multimedia Search Engine I: Quantifying Similarity appeared first on deep ideas.
]]>The post Deep Learning From Scratch VI: TensorFlow appeared first on deep ideas.
]]>It is now time to say goodbye to our own toy library and start to get professional by switching to the actual TensorFlow.
As we’ve learned already, TensorFlow conceptually works exactly the same as our implementation. So why not just stick to our own implementation? There are a couple of reasons:
TensorFlow is the product of years of effort in providing efficient implementations for all the algorithms relevant to our purposes. Fortunately, there are experts at Google whose everyday job is to optimize these implementations. We do not need to know all of these details. We only have to know what the algorithms do conceptually (which we do now) and how to call them.
TensorFlow allows us to train our neural networks on the GPU (graphical processing unit), resulting in an enormous speedup through massive parallelization.
Google is now building Tensor processing units, which are integrated circuits specifically built to run and train TensorFlow graphs, resulting in yet more enormous speedup.
TensorFlow comes pre-equipped with a lot of neural network architectures that would be cumbersome to build on our own.
TensorFlow comes with a high-level API called Keras that allows us to build neural network architectures way easier than by defining the computational graph by hand, as we did up until now.
So let’s get started. Installing TensorFlow is very easy.
pip install tensorflow
If we want GPU acceleration, we have to install the package tensorflow-gpu
:
pip install tensorflow-gpu
In our code, we import it as follows:
import tensorflow as tf
Since the syntax we are used to from the previous sections mimics the TensorFlow syntax, we already know how to use TensorFlow. We only have to make the following changes:
tf.
to the front of all our function calls and classessession.run(tf.global_variables_initializer())
after building the graphThe rest is exactly the same. Let’s recreate the multi-layer perceptron from the previous section using TensorFlow:
In the next lesson, we will learn about Keras, which is a high-level API on top of TensorFlow that allows us to define and train neural networks more abstractly – without having to specify the internal composition of all the operations everytime. You can either subscribe to deep ideas by Email or subscribe to my Facebook page to stay updated.
The post Deep Learning From Scratch VI: TensorFlow appeared first on deep ideas.
]]>The post Connectionist Models of Cognition appeared first on deep ideas.
]]>In this video, I give an introduction to the field of computational cognitive modeling (i.e. modeling minds through algorithms) in general, and connectionist modeling (i.e. using artificial neural networks for the modeling) in particular. We deal with the following topics:
The post Connectionist Models of Cognition appeared first on deep ideas.
]]>The post Robot Localization IV: The Particle Filter appeared first on deep ideas.
]]>The last filtering algorithm we are going to discuss is the Particle Filter. It is also an instance of the Bayes Filter and in some ways superior to both the Histogram filter and the Kalman Filter. For instance, it is capable of handling continuous state spaces like the Kalman Filter. Unlike the Kalman Filter, however, it is capable of approximately representing deliberate belief distributions, not only normal distributions. It is therefore suitable for non-linear dynamic systems as well.
The idea of the Particle Filter is to approximate the belief $bel(x_t)$ as a set of $n$ so-called particles $p_t^{[i]} \in dom(x_t)$: $\chi_t := \{ p_t^{[1]}, p_t^{[2]}, …, p_t^{[n]} \}$. Each of these particles is a concrete guess of the actual state vector. At each time step the particles are randomly sampled from the state space in such a way that $P(p_t^{[i]} \in \chi_t)$ is proportional to $P(x_t = p_t^{[i]} \, \vert \, e_{1:t})$.
This means that the probability of a particle being included in $\chi_t$ is proportional to the probability of it being the correct representation of the state, given the sensor measurements so far. This way, the update step can be thought of as a process similar to the evolutionary mechanism of natural selection: Strong theories, that are compatible with the new measurement, are likely to live on and reproduce, whereas poor theories are likely to die out. This results in the fact that the particles are likely to be centered around strong theories. We will see a visual example for this later.
We take the same approach as we did with all the previous Bayes Filters. First, we calculate a particle representation of $\overline{bel}(x_{t+1})$ from $\chi_t$, which we denote $\overline{\chi}_{t+1}$: For each particle $p_t^{[i]} \in \chi_t$, we sample a new particle $\overline{p}_{t+1}^{[i]}$ from the distribution $P(x_{t+1} \, \vert \, x_t = p_t^{[i]})$, which can be obtained from the transition model. We put all these new particle into the set $\overline{\chi}_{t+1}$.
As an example, let’s consider a moving robot in one dimension. The state contains only one variable, the location. From time $t$ to $t + 1$ the robot has moved an expected distance of 1 meter to the right with Gaussian movement noise. In this case we would just add 1 to the locations of all the particles plus a random number that is sampled from the transition model.
Now we calculate the particle representation of $bel(x_{t+1})$, namely $\chi_{t+1}$, from $\overline{\chi}_{t+1}$. The key idea here is to assign a so-called importance weight, denoted $\omega[i]$, to each of the particles in $\overline{\chi}_{t+1}$. This importance weight is a measure of how compatible the particle $\overline{p}_{t+1}^{[i]}$ is with the new measurement $e_{t+1}$. This probability can be obtained from the sensor model. $\chi_{t+1}$ is then constructed by randomly picking $n$ particles from $\overline{p}_{t+1}^{[i]}$ with a probability proportional to their weight. The same particle may be picked multiple times. This procedure is called resampling.
We elucidate the Particle Filter with a localization example that’s similar to the Kalman Filter example, i.e. we use the same transition and sensor models as well as the same position and measurement chains. Since the particles are drawn from the state space, they are simply real numbers. This time, we start with a uniform distribution over the interval $[0, 5]$. In this instance, we use 30 particles. For obvious reasons, a numerical representation of the particle sets at each time step will not be given, but a graphical representation can be seen in the following figure. Each of the black/gray lines represents one or more particles. Since multiple particles can fall on the same pixel, the opacities of the lines are proportional to the number of particles on that pixel. Again, the blue line represents the actual position and the red graph represents $P(x_t \, \vert \, e_t)$.
In this series of articles, we have introduced the Bayes Filter as a means to maintain a belief about the state of a system over time and periodically update it according to how the state evolves and which observations are made. We came across the problem that, for a continuous state space, the belief could generally not be represented in a computationally tractable way. We saw three solutions to this problem, all of which have their advantages and disadvantages.
The first solution, the Histogram Filter, solves the problem by slicing the state space into a finite amount of bins and representing the belief as a discrete probability distribution over these bins. This allows us to approximately represent arbitrary probability distributions.
The second solution, the Kalman Filter, assumes the transition and sensor mod- els to be linear Gaussians and the initial belief to be Gaussian, which makes it inapplicable for non-linear dynamic systems – at least in its original form. As we showed, this assumption results in the fact that the belief distribution is always a Gaussian and can thus be represented by a mean and a variance only, which is very memory efficient.
The last solution, the Particle Filter, solves the problem by representing the belief as a finite set of guesses at the state, which are approximately distributed according to the actual belief distribution and are therefore a good representation for it. Like the Histogram Filter, it is able to represent arbitrary belief distributions, with the difference that the state space is not binned and therefore the approximation is more accurate.
[NORVIG] Peter Norvig, Stuart Russel (2010) Artificial Intelligence – A Modern Approach. 3rd edition, Prentice Hall International
[THRUN] Sebastian Thrun, Wolfram Burgard, Dieter Fox (2005) Probabilistic Robotics
[NEGENBORN] Rudy Negenborn (2003) Robot Localization and Kalman Filters
[DEGROOT] Morris DeGroot, Mark Schervish (2012) Probability and Statistics. 4th edition, Addison-Wesley
[BESSIERE] Pierre Bessire, Christian Laugier, Roland Siegwart (2008) Probabilistic Reasoning and Decision Making in Sensory-Motor Systems
The post Robot Localization IV: The Particle Filter appeared first on deep ideas.
]]>