### SUBTITLES:

Subtitles prepared by human

00:01

Welcome to this tutorial on Kalman filter.
Today I'm going to explain Kalman filter from
a Bayesian probabilistic viewpoint.
We will end up formulating a generic framework
for state estimation called Bayesian filtering,
and then from that viewpoint realize that
Kalman filter is just a special case.
Of that generic formulation.
But before we embark on this journey, let's
be clear with the mission of Bayesian filters,
Bayesian filter, and by extension Kalman filter,
is an algorithm to perform state estimation
using the measurements.
These measurements are of course quite often
noisy.
That is, have errors.
I am emphasizing on this mission formulation
instead of a simpler one or a working one
that seems to suggest that the goal of Kalman
filter is to reduce the noise from measurement.
And while correct in its own way, this formulation
that I have stated enables many other use

01:04

cases.
For example, what about the situation when
measurements go missing?
And maybe having the estimation of state could
be enough.
And of course, having both estimated States
and measurements are better than having either
just one of them.
This also begs the question, what is this
state thingy after all?
Let me start by first going over the notations,
terminologies as well as some essential building
blocks to have a good foundation to build
upon.
This foundation should be helpful not only
for this tutorial but for many other concepts
in Bayesian probabilistic modeling.
I start with our first notation, which is
X.
If I put a circle around it, it means that
X is a random variable.
More specifically, it's an UN observed random
variable.
If I fill the circle with some color, then
it means that X is an observed random variable.

02:07

In control theory and many other fields, we
call an observed random variable the measurement.
Essentially, this is what your sensor will
give you as its output or a reading.
Other symbol of interest is Z.
Again a random variable.
Since the circle is not filled, then it is
an UN observed random variable.
In statistics this type of a random variable
is called latent random variable.
And from control theory viewpoint it is called
state.
That said, there is one problem here.
The issue is that the notation used by the
people from Bayesian statistics side is opposite
of what you would see in the control theory
literature, where the Kalman filter was invented.
As I am showing you on the screen in the control
theory X is called state and Z is called measurement.
So which one should I choose from this point
onwards?

03:07

I firmly believe that paying attention to
the notations and being compliant to them
has lot of virtue, but this time around I
have a dilemma since this tutorial is primarily
about first creating a generic formulation
for from Bayesian view on filtering, I am
not able to bring myself to use the symbol
Z.
OK.
For measurement
As it is done in control theory.
The reason is that in both machine learning
and Bayesian statistics, symbol Z signifies
the unknown variable.
The latent random variable.
So in order to find some middle ground, I'm
going to use the symbol S to indicate the
state and M for the measurement.
So the next thing to wonder about is the relationship
between these two random variables.
The observed one that is M and the unobserved
one that is S.

04:11

One simple example of relationship could be
that the measurement M is a distorted reflection
of the state variable.
I say reflection because the sensors we use
to measure the physical phenomena are not
always capturing it accurately.
And this equation relating the two that I
have put on screen may quantify that potential
relationship.
I keep on saying potential relationship because
the actual relationship could be way more
sophisticated than what I am showing here
and somewhere.
It also depends on what you are modeling.
This arrangement of latent and observed variables
with an arrow originating from state to measurements
show the causal relationship as well, which
means that the state could be more than just
another version of measurement.
For example, maybe the measurement is the
location of robot in a room given by the sensor,
where
the state is the speed or velocity at which
the robot is moving.
Clearly the location of Robot was caused by
the movement and one influencing factor could

05:15

have been the velocity of the robot.
In simple words, the state does not need to
be the same information as what you are measuring.
That said, state here is seen as the cause,
and if we can find the cause, it opens few
more opportunities to improve both the modeling
of our physical phenomena as well as the algorithms
that operate on the model.
If we have a good state estimation, then maybe
we can work around the steps where our measurement
goes missing, as we know that measurement
was a consequence of the state.
In Bayesian statistics, this type of a formulation
of finding unobserved variables using observed
variables is called latent variable model,
where the goal is to 1st estimate the S, the
latent variable, and then include it as part
of the full parameter estimation.
As mentioned earlier, the primary task in
a latent variable model or a state variable

06:18

model is to find the state S, and one thought
could be to use the measurement to find it.
In statistics, the process of finding the
unknown variables using the known variables
is called inference.
Unfortunately, in machine learning the words
inference and predictions are used interchangeably,
but they don't really mean the same thing.
You predict the label of a class in a classification
problem, or you predict the output given some
inputs in a regression problem, whereas you
infer the weights or unknown variables that
relate the inputs to outputs.
So when I say inference, it means that we
are after finding the unknown variables of
our process, which happen to be S in this
case.
Also, Please note the dashed arrow.
This is also a convention to indicate that
you are doing inference or you are in an inference
mode.
Another way to look at this is also to say
that given an effect M, you aim to find the

07:20

cause of it.
That is, the state, S
How does one find the cause given an effect?
And this is where we will take help from Amazing
Thomas Bayes, as this is the kind of problems
he was chasing.
He was curious about understanding the relationship
between two events and how one could get one
event given the other One, Bayes rule or Bayes
theorem is what will help us
get our state from the measurement, so let's
see how we could formulate our problem using
his rule.
So this is one way that we can use our random
variables that is S and M in Bayes rule.
Now here there are few more terms to be understood
as they are not necessarily captured by the
original Bayes rule that works on events.
Here the formulation indicates that we first
express our belief on the state.
Yes, without seeing any data and it's called
prior.

08:25

Likelihood.
Is the probability of getting the measurement
given the state and the result is what is
called the posterior.
That is, it is an updated prior.
After taking into consideration the measurements,
now we need to divide the product of likelihood
and prior by a normalizing constant so that
the posterior.
Is a valid probability distribution.
We will get to revisit this Bayes rule few
more times during the tutorial, so more details
will be explained at the appropriate moment.
For now, what is important to understand is
the fact that you start with an assumption
or belief about your state, and this belief
is updated when you observe the measurements.
The updated belief is called the posterior.
Coming back to our latent variable or a state
estimation model.
Let's spend a minute or two looking at how
we can go beyond one random variable.

09:30

That is to scale our thinking when we have
more than one random variable.
Here I have picked the measurement variable
as an example, but as such this next concept
can be explained using the state.
Variables as well.
There is nothing specific about the measurement
variable here that's set.
A phenomena generally consists of many random
variables and in some cases while they jointly
result in a phenomena, they are independent
of each other.
And so we can express the joint probability
like this.
That is simply the product of individual or
marginal probabilities of our random variables.
This makes the computations and the factoring
very easy, but is not necessarily reflective
of the sophistication.
And the complexities that our physical world
has.
Here is a description of a phenomenon that
assumes a relationship between random variables.

10:33

For example, when we think about sequential
data, a time series data, for example, or
a signal, then a measurement or state at a
time step depends on a measurement or state
at previous steps in that case.
We formulate the probability at a time.
Yep, as a conditional probability, for example
here the graph is showing that the measurement
at step three depends on measurement at Step
2, so I have accordingly specified its conditional
probability as M3 conditioned on M2.
There is a name for this type of conditional
formulation.
Where the conditioning is only on the immediate
previous step and is called the 1st order
Markov chain.
Reality is that there is nothing that prevents
considering a situation where there is a dependence
on two previous steps as shown here.
This is called 2nd order Markov chain and
you can see that the conditional probability

11:33

at step three is now conditioned on both M2
and M1 that said.
The higher order Markov chains, while maybe
capturing the causal relationship bit more
accurately than the 1st order Markov chain,
they do have higher computational and memory
costs.
You should appreciate that both modeling and
computational aspects are important in delivering
a solution that is usable.
In this tutorial, I am only considering the
1st order Markov chain for both measurements
and the state sequence variables time to bring
back our state random variables and this time
around they would appear as Markov Chain 1st
order Markov chain.
That is, our hypothesis is that state at a
given step depends.
Only on the state immediately before it.
As in earlier graph, where we had a causal
relationship between state S and Measurement
M, Let's form that connection as well.
Now, given this formulation, the next hypothesis
or claim we can make is that maybe it is sufficient

12:41

to only model the.
Causal relationship between the state between
the States and assume that measurements at
various time steps are independent.
Correct, which is reflected by the removal
of arrows between the measurement steps.
Again, this is done to further reduce the
computational needs and the hypothesis kind
of makes sense, as if we can successfully
model the transition between the states, the
transition between the measurements may be
redundant this.
Type of a model where states form a Markov
chain and measurements are considered independent
is called a state space model.
The arrows between the random variables also
indicate that there is a function that is
evolving or changing the random variable from
one time step to other in state space modeling
there is a name for these functions.
The function between the state random variables
is called transition.

13:43

Function and the one between the state and
the measurement is called emission function.
I like both terms as they really capture the
essence of what is happening here.
Now let's put our state space model in an
inference mode.
See that the arrows are now dashed arrows,
indicating that we are trying to figure out
the cause given an effect, and in order to
develop a more general algorithm or general
formulation, let me pick one time step and
call it K.
This way we can start with Simple Bayes rule
that we have already seen and then generalize
it by including few more steps.
Here is a Bayes rule written without any symbols
you should be familiar with this by now.
From the Bayesian viewpoint or latent variable
viewpoint, the state at time.
Step K is considered our posterior.
Likelihood is easy to understand.
And the normalizing constant as well.

14:45

That said, the constant will be different
at different timestamps and will need to be
re computed.
Now one of the conceptually difficult parts
in Bayesian status 6 is related to dealing
with the prior.
Prior generally is explained as belief about
a random variable.
Before you see the data.
Now in this case where we are modeling sequential
phenomena, we can be more specific and say
that prior is a belief before seeing the data
at Step K.
That said, the notion of belief could include
the measurements and state evolution that
has been observed or computed until the step
K -- 1 in terms of symbol.
Here is how we could indicate it.
That is, our prior is the conditional probability
of step.
OK.
Given the measurements until K -- 1.
This means that before we could compute our
posterior.
We should be able to figure out our prior
now at the very very first step where we have

15:50

no measurement.
Oh
This will indeed be an assumption about our
state, and you could use the domain knowledge
and other empirical observations.
But after that we should be able to have an
updated belief.
Given the state evolution and measurements.
This is why computation of this so called
prior is also called a prediction step, and
the computation of posterior using it is called
an update step as given the measure given
the new measurements at timestep K, We are
now updating our belief in state at step K.
So the problem has now moved into.
How do we compute this prior at each step?
Let me update the diagram of the model to
see what are the different parts that we can
make use of from the previous step.
You can see that we do not see any direct
connection between the measurement.

16:51

At STEP K -- 1 and state at step K.
However, we do see that in the arrows from
measurement K -- 1 we go via the state at
K -- 1.
So if we can express our belief using this
pathway that is using this one level of indirection,
then we should be able.
To further refine its formulation and more
importantly, compute it.
This is where we make use of an equation called
Chapman Kolmogorov equation.
I hope I'm saying it right, that helps in
representing our conditional probability of
state at K given the measurement at K -- 1
using the marginal representation.
Explaining this equation is out of scope for
this tutorial.
For now, you can consider this to be a trick
to reformulate our problem in terms of the

17:52

components that we have access to.
The only thing I would say or recommend is
that please do not be intimidated by the name
of this equation or this, get this.
Heart and this one is really a simple trick
that can be easily understood provided you
do understand the notion of marginalization
in probability theory.
Given that we have a way to compute the prior
or perform the predict step using this new
formulation, thanks to Chapman Kolmogorov
equation, we can move on.
List the predict step and update step like
this.
And this, my friend is called Bayesian filtering.
These two Bayesian filtering equations can
now be solved or computed using a variety
of algorithms and technique.
Thanks, you should notice that I have kept
the notation very generic.
I have not said anything about the kind of
probability distributions that your state
or measurement random variables follow, or
for that matter what the transition function

18:55

and emission function would look like.
So far this is a Bayesian treatment for state
space models, with the primary goal of estimating
the state.
That is the posterior at various time steps.
Now we would talk about Kalman filter and
to help you understand I do need to make few
minor notational adjustments to our state
space model.
So let's look at it one more time.
So this is our state space model showing the
state variables, measurement variables, the
transition function between the state variables,
the emission function between the state variable
and the measurement variable for each time
step.
The first modification I am making here is
to say that both transition and emission functions
are linear and the second assumption is that
the state random variable as well as the measurement

19:59

random variable follow the Gaussian distribution.
And a state space model with these two assumptions
is called linear Gaussian state space model.
Please appreciate that nothing has changed
in terms of connectivity between the random
variables or a structure of the model as a
whole.
It's just that I have put some assumptions
or rather constraints on the functions that
is transition function and an emission function
and the choice of distribution for our random
variables.
That is the random variables, both state and
measurement variables should follow the Gaussian
distribution.
Now Kalman filter is a Bayesian filtering
algorithm for state estimation that assumes
that your state space model is a linear Gaussian
state space model.
And the next obvious thing to ask is why Kalman
filter algorithm makes these assumptions.

21:04

So let's look again at the Bayesian filtering
equations to get our answers.
The trouble here is that even though the formulation
of state estimation using Bayesian perspective
is very intuitive and simple.
It is not computationally friendly.
The integrals during the predict step and
the normalizing constant, which also, by the
way, expands to an integral, are difficult
to compute, especially when you are dealing
with multivariate or multi dimensional problems.
Here is the same point, just putting the Bayes
formulation on the spot now.
There is a scenario where we can work around
this devil that is the computation of high
dimensional integral and which is when we
assume that our prior and the likelihood follow
the Gaussian distribution.
In that case we know that the posterior is
going to be Gaussian and we.

22:09

Have a functional form for computing the normalizing
constant for Gaussian functions.
That is, no need to compute the integral.
Well, we do have to have a normalizing constant
is just that we will have an analytical way
of doing it.
Instead of computing the integrals.
So if our likelihood and prior or Gaussian
we are good.
Now while we are at the topic of Gaussian
distribution, here are few more things that
you should know if you do not know.
This is how the Gaussian or the normal distribution
looks like when you add two normal distributions,
you get a normal distribution as an output.
When you multiply 2 normal distributions,
you get a normal distribution as an output
and.
When you when you have a random variable that
follows a normal distribution and it is conditioned
on another random variable that also follows
a normal distribution, then you get again
a normal distribution.

23:11

These properties are important to know, as
you saw that in our Bayesian filtering equations
we have conditional distributions.
Then we were multiplying them as well, and
as long as we can have normal distribution
as a component in those equations, we have
analytical solution.
That is, we have a work around our difficult
to compute integrals.
So why Gaussian assumption should be clear?
But this does not explain why linear.
Now, assumption of linearity is kind of not
comforting as the world we live in and the
physical phenomena are highly non-linear.
But let's ignore this for a minute.
And figure out why linear assumption as a
matter of fact, the reason for linear function
assumption is also somewhat related to the
requirement of using Gaussian distribution.

24:14

Here I am demonstrating that if you have data
that follows Gaussian distribution and when
this data is passed through the linear function,
then we still have the output that is normally
distributed.
You can see from these histograms that both
inputs and outputs.
Are Gaussian as long as the function is linear.
On the contrary, if the function is not linear,
then we do not have the Gaussian output.
You can see this from this demonstration.
Let me show you in the context of our state
space model, what is happening.
As described earlier, the state evolution
across time steps using the transition function.
If this transition function is not linear,
then the distribution at time step three in
this particular figure will not be Gaussian,
and then we will not be able to apply our

25:14

Bayesian simplification that was happening,
thanks to the assumption of a Gaussian distribution.
Well, I should not say that we will not be
able to apply.
It will be computationally inefficient.
And This is why the Kalman filter, which is
nothing but a Bayesian filter, have these
assumptions on our state space model.
That said, let's talk about these linear transition
functions and the emission function as well.
Here, the Matrix A is the transition matrix
and Q is the process noise, which is also
assumed to be normally distributed with mean
of zero and covariance indicated using symbol
Q.
The emission function is also linear, and
the matrix there is called H.
It has a measurement noise which is also assumed
to be normally distributed with mean of zero
and covariance indicated using symbol R.

26:15

Now we can represent these using the probabilistic
model like this.
So you can see that the mean of the normal
distribution for the conditional probability
between the state at step K and and when it
is conditioned on state at step K -- 1.
We can have a normal distribution which is
parameterized like that with a mean and a
covariance.
And the same goes for the the measurement
given the state which is by the way, our likelihood.
So these we can now put them in our prediction
and update Bayesian filtering equations, which
means that we have essentially an analytical
form of various components that these predict
and update steps require.
Take a moment to see that the components marked
in orange are normally distributed and we
know the mean and variance of these distributions.
By the way, you can also notice that there
is a recursion element here.

27:17

Conditional probability of state at step K
-- 1 given the measurement at step K -- 1
can be taken directly from the previous update
step.
So all the components are covered.
That's essentially what I'm trying to convey
here.
And the predict and update steps can then
be further simplified, resulting in a normal
distribution, again with their respective
mean and covariance.
Again, one more time.
Remember that the product and the additions
of normal distributions will result into another
normal distribution, and that is why we are
observing here for our product and.
Update steps
These new mean and covariance equations for
both the steps are shown here.
So how did I get to them?
I'm not showing you the proof, which is nothing
but the simplification of products of normal
distributions that lead to this as it takes
few pages.
But if you are interested in following that
or interested in seeing how somebody gets

28:20

to these kind of equations, I will put a reference
to a paper that you can consult to see the.
Final derivation.
Long story short, Kalman filter ends up providing
the closed form or the analytical solution
to the state space model.
Now we can implement all these computations,
even in a relatively resource constrained
devices, and this is what makes kanval filters
kind of special thing.
Here is one more time the final summary or
a description of a Kalman filter.
Essentially it's a two step algorithm where
in a step one which is we call prediction.
We are getting a rough estimate of state and
then refining that rough estimate with the
help of measurement that makes it an update
step.
There is nothing new on this screen as such,
just a summary of what Kalman filter is.
That said, the $1,000,000 question here is
while we have worked around our computational

29:22

challenges in the in a Bayes rule, or more
specifically in in our Bayesian filtering
equations, are these assumptions really valid?
Well, of course they are valid for some scenarios.
But not so much for quite a lot.
And there are extensions and remedies to Kalman
filter which work quite well in many cases
and then not so well in more complex scenarios.
More importantly, This is why I focused on
obtaining a more generic Bayesian filtering
framework so that we can apply other algorithms
when we face non linearity and use distributions
other than the normal distributions.
I do plan to make few more tutorials on some
of the extensions.
Kalman filtering that provide the remedies
in the light of non linearity while still
respecting the computational properties and
benefits of Kalman filter.
And after that maybe a tutorial on something
called particle filtering.

30:26

And this brings us to the end of this tutorial
on understanding Kalman filter as a special
case of Bayesian filtering and also the modeling
assumptions it makes.
I hope this was useful and helpful.
Please let me know if you have questions and
comments and.
I would try to help clarify with that goodbye
for now.
See you in another tutorial.
Bye bye.

Watch, read, educate! © 2021