What is Kalman Filter? | A Bayesian Probabilistic View

SUBTITLE'S INFO:

Language: English

Type: Human

Number of phrases: 388

Number of words: 4328

Number of symbols: 20986

SUBTITLES:

Subtitles prepared by human
00:01
Welcome to this tutorial on Kalman filter. Today I'm going to explain Kalman filter from a Bayesian probabilistic viewpoint. We will end up formulating a generic framework for state estimation called Bayesian filtering, and then from that viewpoint realize that Kalman filter is just a special case. Of that generic formulation. But before we embark on this journey, let's be clear with the mission of Bayesian filters, Bayesian filter, and by extension Kalman filter, is an algorithm to perform state estimation using the measurements. These measurements are of course quite often noisy. That is, have errors. I am emphasizing on this mission formulation instead of a simpler one or a working one that seems to suggest that the goal of Kalman filter is to reduce the noise from measurement. And while correct in its own way, this formulation that I have stated enables many other use
01:04
cases. For example, what about the situation when measurements go missing? And maybe having the estimation of state could be enough. And of course, having both estimated States and measurements are better than having either just one of them. This also begs the question, what is this state thingy after all? Let me start by first going over the notations, terminologies as well as some essential building blocks to have a good foundation to build upon. This foundation should be helpful not only for this tutorial but for many other concepts in Bayesian probabilistic modeling. I start with our first notation, which is X. If I put a circle around it, it means that X is a random variable. More specifically, it's an UN observed random variable. If I fill the circle with some color, then it means that X is an observed random variable.
02:07
In control theory and many other fields, we call an observed random variable the measurement. Essentially, this is what your sensor will give you as its output or a reading. Other symbol of interest is Z. Again a random variable. Since the circle is not filled, then it is an UN observed random variable. In statistics this type of a random variable is called latent random variable. And from control theory viewpoint it is called state. That said, there is one problem here. The issue is that the notation used by the people from Bayesian statistics side is opposite of what you would see in the control theory literature, where the Kalman filter was invented. As I am showing you on the screen in the control theory X is called state and Z is called measurement. So which one should I choose from this point onwards?
03:07
I firmly believe that paying attention to the notations and being compliant to them has lot of virtue, but this time around I have a dilemma since this tutorial is primarily about first creating a generic formulation for from Bayesian view on filtering, I am not able to bring myself to use the symbol Z. OK. For measurement As it is done in control theory. The reason is that in both machine learning and Bayesian statistics, symbol Z signifies the unknown variable. The latent random variable. So in order to find some middle ground, I'm going to use the symbol S to indicate the state and M for the measurement. So the next thing to wonder about is the relationship between these two random variables. The observed one that is M and the unobserved one that is S.
04:11
One simple example of relationship could be that the measurement M is a distorted reflection of the state variable. I say reflection because the sensors we use to measure the physical phenomena are not always capturing it accurately. And this equation relating the two that I have put on screen may quantify that potential relationship. I keep on saying potential relationship because the actual relationship could be way more sophisticated than what I am showing here and somewhere. It also depends on what you are modeling. This arrangement of latent and observed variables with an arrow originating from state to measurements show the causal relationship as well, which means that the state could be more than just another version of measurement. For example, maybe the measurement is the location of robot in a room given by the sensor, where the state is the speed or velocity at which the robot is moving. Clearly the location of Robot was caused by the movement and one influencing factor could
05:15
have been the velocity of the robot. In simple words, the state does not need to be the same information as what you are measuring. That said, state here is seen as the cause, and if we can find the cause, it opens few more opportunities to improve both the modeling of our physical phenomena as well as the algorithms that operate on the model. If we have a good state estimation, then maybe we can work around the steps where our measurement goes missing, as we know that measurement was a consequence of the state. In Bayesian statistics, this type of a formulation of finding unobserved variables using observed variables is called latent variable model, where the goal is to 1st estimate the S, the latent variable, and then include it as part of the full parameter estimation. As mentioned earlier, the primary task in a latent variable model or a state variable
06:18
model is to find the state S, and one thought could be to use the measurement to find it. In statistics, the process of finding the unknown variables using the known variables is called inference. Unfortunately, in machine learning the words inference and predictions are used interchangeably, but they don't really mean the same thing. You predict the label of a class in a classification problem, or you predict the output given some inputs in a regression problem, whereas you infer the weights or unknown variables that relate the inputs to outputs. So when I say inference, it means that we are after finding the unknown variables of our process, which happen to be S in this case. Also, Please note the dashed arrow. This is also a convention to indicate that you are doing inference or you are in an inference mode. Another way to look at this is also to say that given an effect M, you aim to find the
07:20
cause of it. That is, the state, S How does one find the cause given an effect? And this is where we will take help from Amazing Thomas Bayes, as this is the kind of problems he was chasing. He was curious about understanding the relationship between two events and how one could get one event given the other One, Bayes rule or Bayes theorem is what will help us get our state from the measurement, so let's see how we could formulate our problem using his rule. So this is one way that we can use our random variables that is S and M in Bayes rule. Now here there are few more terms to be understood as they are not necessarily captured by the original Bayes rule that works on events. Here the formulation indicates that we first express our belief on the state. Yes, without seeing any data and it's called prior.
08:25
Likelihood. Is the probability of getting the measurement given the state and the result is what is called the posterior. That is, it is an updated prior. After taking into consideration the measurements, now we need to divide the product of likelihood and prior by a normalizing constant so that the posterior. Is a valid probability distribution. We will get to revisit this Bayes rule few more times during the tutorial, so more details will be explained at the appropriate moment. For now, what is important to understand is the fact that you start with an assumption or belief about your state, and this belief is updated when you observe the measurements. The updated belief is called the posterior. Coming back to our latent variable or a state estimation model. Let's spend a minute or two looking at how we can go beyond one random variable.
09:30
That is to scale our thinking when we have more than one random variable. Here I have picked the measurement variable as an example, but as such this next concept can be explained using the state. Variables as well. There is nothing specific about the measurement variable here that's set. A phenomena generally consists of many random variables and in some cases while they jointly result in a phenomena, they are independent of each other. And so we can express the joint probability like this. That is simply the product of individual or marginal probabilities of our random variables. This makes the computations and the factoring very easy, but is not necessarily reflective of the sophistication. And the complexities that our physical world has. Here is a description of a phenomenon that assumes a relationship between random variables.
10:33
For example, when we think about sequential data, a time series data, for example, or a signal, then a measurement or state at a time step depends on a measurement or state at previous steps in that case. We formulate the probability at a time. Yep, as a conditional probability, for example here the graph is showing that the measurement at step three depends on measurement at Step 2, so I have accordingly specified its conditional probability as M3 conditioned on M2. There is a name for this type of conditional formulation. Where the conditioning is only on the immediate previous step and is called the 1st order Markov chain. Reality is that there is nothing that prevents considering a situation where there is a dependence on two previous steps as shown here. This is called 2nd order Markov chain and you can see that the conditional probability
11:33
at step three is now conditioned on both M2 and M1 that said. The higher order Markov chains, while maybe capturing the causal relationship bit more accurately than the 1st order Markov chain, they do have higher computational and memory costs. You should appreciate that both modeling and computational aspects are important in delivering a solution that is usable. In this tutorial, I am only considering the 1st order Markov chain for both measurements and the state sequence variables time to bring back our state random variables and this time around they would appear as Markov Chain 1st order Markov chain. That is, our hypothesis is that state at a given step depends. Only on the state immediately before it. As in earlier graph, where we had a causal relationship between state S and Measurement M, Let's form that connection as well. Now, given this formulation, the next hypothesis or claim we can make is that maybe it is sufficient
12:41
to only model the. Causal relationship between the state between the States and assume that measurements at various time steps are independent. Correct, which is reflected by the removal of arrows between the measurement steps. Again, this is done to further reduce the computational needs and the hypothesis kind of makes sense, as if we can successfully model the transition between the states, the transition between the measurements may be redundant this. Type of a model where states form a Markov chain and measurements are considered independent is called a state space model. The arrows between the random variables also indicate that there is a function that is evolving or changing the random variable from one time step to other in state space modeling there is a name for these functions. The function between the state random variables is called transition.
13:43
Function and the one between the state and the measurement is called emission function. I like both terms as they really capture the essence of what is happening here. Now let's put our state space model in an inference mode. See that the arrows are now dashed arrows, indicating that we are trying to figure out the cause given an effect, and in order to develop a more general algorithm or general formulation, let me pick one time step and call it K. This way we can start with Simple Bayes rule that we have already seen and then generalize it by including few more steps. Here is a Bayes rule written without any symbols you should be familiar with this by now. From the Bayesian viewpoint or latent variable viewpoint, the state at time. Step K is considered our posterior. Likelihood is easy to understand. And the normalizing constant as well.
14:45
That said, the constant will be different at different timestamps and will need to be re computed. Now one of the conceptually difficult parts in Bayesian status 6 is related to dealing with the prior. Prior generally is explained as belief about a random variable. Before you see the data. Now in this case where we are modeling sequential phenomena, we can be more specific and say that prior is a belief before seeing the data at Step K. That said, the notion of belief could include the measurements and state evolution that has been observed or computed until the step K -- 1 in terms of symbol. Here is how we could indicate it. That is, our prior is the conditional probability of step. OK. Given the measurements until K -- 1. This means that before we could compute our posterior. We should be able to figure out our prior now at the very very first step where we have
15:50
no measurement. Oh This will indeed be an assumption about our state, and you could use the domain knowledge and other empirical observations. But after that we should be able to have an updated belief. Given the state evolution and measurements. This is why computation of this so called prior is also called a prediction step, and the computation of posterior using it is called an update step as given the measure given the new measurements at timestep K, We are now updating our belief in state at step K. So the problem has now moved into. How do we compute this prior at each step? Let me update the diagram of the model to see what are the different parts that we can make use of from the previous step. You can see that we do not see any direct connection between the measurement.
16:51
At STEP K -- 1 and state at step K. However, we do see that in the arrows from measurement K -- 1 we go via the state at K -- 1. So if we can express our belief using this pathway that is using this one level of indirection, then we should be able. To further refine its formulation and more importantly, compute it. This is where we make use of an equation called Chapman Kolmogorov equation. I hope I'm saying it right, that helps in representing our conditional probability of state at K given the measurement at K -- 1 using the marginal representation. Explaining this equation is out of scope for this tutorial. For now, you can consider this to be a trick to reformulate our problem in terms of the
17:52
components that we have access to. The only thing I would say or recommend is that please do not be intimidated by the name of this equation or this, get this. Heart and this one is really a simple trick that can be easily understood provided you do understand the notion of marginalization in probability theory. Given that we have a way to compute the prior or perform the predict step using this new formulation, thanks to Chapman Kolmogorov equation, we can move on. List the predict step and update step like this. And this, my friend is called Bayesian filtering. These two Bayesian filtering equations can now be solved or computed using a variety of algorithms and technique. Thanks, you should notice that I have kept the notation very generic. I have not said anything about the kind of probability distributions that your state or measurement random variables follow, or for that matter what the transition function
18:55
and emission function would look like. So far this is a Bayesian treatment for state space models, with the primary goal of estimating the state. That is the posterior at various time steps. Now we would talk about Kalman filter and to help you understand I do need to make few minor notational adjustments to our state space model. So let's look at it one more time. So this is our state space model showing the state variables, measurement variables, the transition function between the state variables, the emission function between the state variable and the measurement variable for each time step. The first modification I am making here is to say that both transition and emission functions are linear and the second assumption is that the state random variable as well as the measurement
19:59
random variable follow the Gaussian distribution. And a state space model with these two assumptions is called linear Gaussian state space model. Please appreciate that nothing has changed in terms of connectivity between the random variables or a structure of the model as a whole. It's just that I have put some assumptions or rather constraints on the functions that is transition function and an emission function and the choice of distribution for our random variables. That is the random variables, both state and measurement variables should follow the Gaussian distribution. Now Kalman filter is a Bayesian filtering algorithm for state estimation that assumes that your state space model is a linear Gaussian state space model. And the next obvious thing to ask is why Kalman filter algorithm makes these assumptions.
21:04
So let's look again at the Bayesian filtering equations to get our answers. The trouble here is that even though the formulation of state estimation using Bayesian perspective is very intuitive and simple. It is not computationally friendly. The integrals during the predict step and the normalizing constant, which also, by the way, expands to an integral, are difficult to compute, especially when you are dealing with multivariate or multi dimensional problems. Here is the same point, just putting the Bayes formulation on the spot now. There is a scenario where we can work around this devil that is the computation of high dimensional integral and which is when we assume that our prior and the likelihood follow the Gaussian distribution. In that case we know that the posterior is going to be Gaussian and we.
22:09
Have a functional form for computing the normalizing constant for Gaussian functions. That is, no need to compute the integral. Well, we do have to have a normalizing constant is just that we will have an analytical way of doing it. Instead of computing the integrals. So if our likelihood and prior or Gaussian we are good. Now while we are at the topic of Gaussian distribution, here are few more things that you should know if you do not know. This is how the Gaussian or the normal distribution looks like when you add two normal distributions, you get a normal distribution as an output. When you multiply 2 normal distributions, you get a normal distribution as an output and. When you when you have a random variable that follows a normal distribution and it is conditioned on another random variable that also follows a normal distribution, then you get again a normal distribution.
23:11
These properties are important to know, as you saw that in our Bayesian filtering equations we have conditional distributions. Then we were multiplying them as well, and as long as we can have normal distribution as a component in those equations, we have analytical solution. That is, we have a work around our difficult to compute integrals. So why Gaussian assumption should be clear? But this does not explain why linear. Now, assumption of linearity is kind of not comforting as the world we live in and the physical phenomena are highly non-linear. But let's ignore this for a minute. And figure out why linear assumption as a matter of fact, the reason for linear function assumption is also somewhat related to the requirement of using Gaussian distribution.
24:14
Here I am demonstrating that if you have data that follows Gaussian distribution and when this data is passed through the linear function, then we still have the output that is normally distributed. You can see from these histograms that both inputs and outputs. Are Gaussian as long as the function is linear. On the contrary, if the function is not linear, then we do not have the Gaussian output. You can see this from this demonstration. Let me show you in the context of our state space model, what is happening. As described earlier, the state evolution across time steps using the transition function. If this transition function is not linear, then the distribution at time step three in this particular figure will not be Gaussian, and then we will not be able to apply our
25:14
Bayesian simplification that was happening, thanks to the assumption of a Gaussian distribution. Well, I should not say that we will not be able to apply. It will be computationally inefficient. And This is why the Kalman filter, which is nothing but a Bayesian filter, have these assumptions on our state space model. That said, let's talk about these linear transition functions and the emission function as well. Here, the Matrix A is the transition matrix and Q is the process noise, which is also assumed to be normally distributed with mean of zero and covariance indicated using symbol Q. The emission function is also linear, and the matrix there is called H. It has a measurement noise which is also assumed to be normally distributed with mean of zero and covariance indicated using symbol R.
26:15
Now we can represent these using the probabilistic model like this. So you can see that the mean of the normal distribution for the conditional probability between the state at step K and and when it is conditioned on state at step K -- 1. We can have a normal distribution which is parameterized like that with a mean and a covariance. And the same goes for the the measurement given the state which is by the way, our likelihood. So these we can now put them in our prediction and update Bayesian filtering equations, which means that we have essentially an analytical form of various components that these predict and update steps require. Take a moment to see that the components marked in orange are normally distributed and we know the mean and variance of these distributions. By the way, you can also notice that there is a recursion element here.
27:17
Conditional probability of state at step K -- 1 given the measurement at step K -- 1 can be taken directly from the previous update step. So all the components are covered. That's essentially what I'm trying to convey here. And the predict and update steps can then be further simplified, resulting in a normal distribution, again with their respective mean and covariance. Again, one more time. Remember that the product and the additions of normal distributions will result into another normal distribution, and that is why we are observing here for our product and. Update steps These new mean and covariance equations for both the steps are shown here. So how did I get to them? I'm not showing you the proof, which is nothing but the simplification of products of normal distributions that lead to this as it takes few pages. But if you are interested in following that or interested in seeing how somebody gets
28:20
to these kind of equations, I will put a reference to a paper that you can consult to see the. Final derivation. Long story short, Kalman filter ends up providing the closed form or the analytical solution to the state space model. Now we can implement all these computations, even in a relatively resource constrained devices, and this is what makes kanval filters kind of special thing. Here is one more time the final summary or a description of a Kalman filter. Essentially it's a two step algorithm where in a step one which is we call prediction. We are getting a rough estimate of state and then refining that rough estimate with the help of measurement that makes it an update step. There is nothing new on this screen as such, just a summary of what Kalman filter is. That said, the \$1,000,000 question here is while we have worked around our computational
29:22
challenges in the in a Bayes rule, or more specifically in in our Bayesian filtering equations, are these assumptions really valid? Well, of course they are valid for some scenarios. But not so much for quite a lot. And there are extensions and remedies to Kalman filter which work quite well in many cases and then not so well in more complex scenarios. More importantly, This is why I focused on obtaining a more generic Bayesian filtering framework so that we can apply other algorithms when we face non linearity and use distributions other than the normal distributions. I do plan to make few more tutorials on some of the extensions. Kalman filtering that provide the remedies in the light of non linearity while still respecting the computational properties and benefits of Kalman filter. And after that maybe a tutorial on something called particle filtering.
30:26
And this brings us to the end of this tutorial on understanding Kalman filter as a special case of Bayesian filtering and also the modeling assumptions it makes. I hope this was useful and helpful. Please let me know if you have questions and comments and. I would try to help clarify with that goodbye for now. See you in another tutorial. Bye bye.