Shae's Ramblings

Stuff I find slightly meaningful and close to achievable

In this post, we aim to prove that any continuous density \(x\mapsto p(x)\) over a bounded subset \(\mathcal{X} \subset [a,b]\) can be approximated by a convex combination of deltas \(\delta_{x_i}\): there is always a \(N\) for which we have with \(\varepsilon\)-accuracy that \(p(x) \approx p_n(x) = \sum_{i=1}^n\gamma_i \delta_{x_i}(x)\) for all \(n > N\).

Because any bounded subset in \(\mathbb{R}\) can be finitely covered by closed intervals \([a_i, b_i]\), we can prove our result for each interval and use the fact that for any \(c,d\in\mathcal{X}\) bounded, the integral is equal to

$$ \int_c^dp(x)\mathrm{d}x = \sum_{i=1}^n\int_{a_i}^{b_i}p(x)\mathrm{d}x $$

Thus to simplify the problem, we can assume \(\mathcal{X} = [a,b]\) some interval.

Weak convergence

For continuous distributions \(f,g\) on \(\mathcal{X}\) define the inner product

$$ \langle f,g\rangle \triangleq \int_\mathcal{X}f(x)g(x)\,\mathrm{d}x $$ We say a sequence \(f_n\) converges weakly to a limit \(f_\star\) if for every distribution \(g\) on \(\mathcal{X}\) we have that \(\langle f_n, g\rangle \to \langle f_\star, g\rangle\) as \(n\to\infty\).

The proof

We can use Riemann sums to prove the statement. First, notice that because \(\int p(x)\mathrm{d}x = 1\), we can for any \(n\) find a partition \(a = x_0 < x_1 < \dots < x_n\) with gaps \(\Delta x_i = x_{i+1} – x_i\) so that

$$ \sum_{i=0}^{n-1}p(x_i^\star)\Delta x_i = 1 + u_n $$ with \(u_n \to 0\) as \(n\) tends to infinity. We can assume that \(u_n \leq 0\) by taking \(p(x_i^\star) = \inf\{p(x)\mid x_i \leq x \leq x_{i+1}\}\) which is attained on a closed interval by the continuity of \(p\). We can therefore for any \(n\) construct the following \(\gamma_i\):

$$ \gamma_i \triangleq p(x_i^\star)\Delta x_i – u_n/n $$ so that we always have \(\gamma_i \geq 0\) and \(\sum_{i=0}^{n-1}\gamma_i = 1 + u_n – u_n = 1\).

Finally, consider the convex sum \(p_n = \sum_{i=0}^{n-1} \gamma_i \delta_{x_i^\star}\), we have

$$ \langle p_n , f\rangle = \sum_{i=0}^{n-1} \gamma_i f(x_i^\star) = \sum_{i=0}^{n-1}\left( p(x_i^\star)\Delta x_i – u_n/n \right) f(x_i^\star) $$ which expands into

$$ \sum_{i=0}^{n-1} \gamma_i g(x_i^\star) = \sum_{i=0}^{n-1}p(x_i^\star)f(x_i^\star)\Delta x_i – \dfrac{u_n}{n}\sum_{i=0}^{n-1}f(x_i^\star) $$ which converges to \(\langle p, f \rangle\) because \(\dfrac{u_n}{n}\sum_{i=0}^{n-1} f(x_i^\star) \to 0\) from the bounded nature of \(f\).

This post is about a proof of Jensen's Inequality, and more specifically a way to go from finite spaces to countable and uncountable spaces.

Finite case

Recall that a function \(f\) is convex on \([a,b]\) iff the inequality

$$ f(ta + (1-t)b) \leq tf(a) + (1-t)f(b) $$

holds for any real \(t\) between \(0\) and \(1\). We can easily generalize this relationship to any convex sum of points (a convex sum is a weighted sum \(\sum_{i=1}^n \gamma_ia_i\) for which \(\gamma_i \geq 0\) and \(\sum_{i=1}^n \gamma_i = 1\)) using induction on \(n\). The above inequality serves as the initialization(\(n\) equal to \(1\) and \(2\)). Assuming the inequality holds for \(n-1\) points we can notice that

$$ f\left(\sum_{i=1}^n \gamma_ia_i\right) = f\left(\gamma_na_n + \sum_{i=1}^{n-1} \gamma_ia_i\right) = f( ta + (1-t)b) $$

if we set \(t = \gamma_n\), \(a = a_n\), \(b = \sum_{i=1}^{n-1} \gamma_i'a_i = \sum_{i=1}^{n-1} \frac{\gamma_i}{1-\gamma_n} a_i \)

thus, by the convexity of \(f\), we have

$$ f\left(\sum_{i=1}^n \gamma_ia_i\right) \leq \gamma_n f(a_n) + (1 – \gamma_n)f\left(\sum_{i=1}^{n-1} \gamma_i'a_i\right) $$

however, since the \(\gamma_i'\) form a convex combination of \(n-1\) points, we can perform the induction step: \(f\left(\sum_{i=1}^{n-1} \gamma_i'a_i\right) \leq \sum_{i=1}^{n-1} \gamma_i'f(a_i)\). Thus we obtain that \(f\left(\sum_{i=1}^n \gamma_ia_i\right) \leq \sum_{i=1}^n \gamma_if(a_i)\) for any finite convex combination \(\{ (\gamma_i,a_i) \}_{i=1}^n\).

Countable case

A countable collection of points \((a_n)_{n\in\mathbb{N}}\) can be combined in a convex fashion using any positive sequence \((\gamma_n)_{n\in\mathbb{N}}\) such that \(\sum_{i=1}^\infty \gamma_i= \lim_{n\to\infty}\sum_{i=1}^n \gamma_i = 1\). This implies that the coefficients vanish at infinity: \(\gamma_i \to 0\) as \(i \to \infty\).

Assuming both \(\sum_{i=1}^n \gamma_i a_i\) and \(\sum_{i=1}^n \gamma_i f(a_i)\) converge, and assuming \(f\) is continuous, we can take limits on both sides of our inequality to yield:

$$ f\left(\sum_{i=1}^\infty \gamma_ia_i\right) \leq \sum_{i=1}^\infty \gamma_if(a_i) $$

Continuous case

Assuming we have a distribution \(p\) defined on the real line \(\mathbb{R}\), the integral $$\mathbb{E}_p[f(x)] = \int_\mathcal{X}f(x)p(x)\,\mathrm{d}x$$ can be approximated using countable sums. Indeed, the function \(p(x)\) can be approximated using a convex combination of dirac deltas: \(p(x) \approx p_n(x) = \sum_{i=1}^n\gamma_i \delta_{x_i}(x)\) (formally, convex combinations of diracs are weakly dense in the space of distributions). We have that \(\mathbb{E}_{p_n}[f(x)] \to \mathbb{E}_{p}[f(x)]\) as \(n \to \infty\) and

$$ f\left(\mathbb{E}_{p_n}[x]\right) = f\left(\sum_{i=1}^n\gamma_i x_i\right) \leq \sum_{i=1}^n\gamma_i f(x_i) = \mathbb{E}_{p_n}[f(x)] $$ and we can let \(n\) tend to \(\infty\) on each side of the inequality to conclude.

This post is about a fundamental inequality in variational inference.

Intuition

In order to fit a model \(p_\theta(\cdot)\) to datapoints \(\mathrm{x}_1, \dots \mathrm{x}_n\) one has to compute the marginal likelihood \(p_\theta(\mathrm{x})\) of the data (also called the evidence ) under the parameter \(\theta\). In variational methods, one usually models a generative process \(p_\theta(\mathrm{x}\mid \mathrm{z})\) which maps a latent \(\mathrm{z}\) into an estimate of the data \(\mathrm{x}\). If one has a simple distribution \(p(\mathrm{z})\) over the latents, then one has the equality:

$$ p_\theta(\mathrm{x}) = \int_\mathcal{Z}p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} $$

therefore, one can approximate the marginal likelihood by sampling points \(\mathrm{z}_1, \dots, \mathrm{z}_m\) from \(p(\mathrm{z})\) and using monte-carlo integration:

$$p_\theta(\mathrm{x}) \approx \dfrac1m \sum_{i=1}^mp_\theta(\mathrm{x}\mid \mathrm{z}_i)$$

However, picking the \(\mathrm{z}_i\) at random will result in poor estimates, as a random \(\mathrm{z}_i\) is unlikely to have generated \(\mathrm{x}\) (that is, the value \(p_\theta(\mathrm{x}\mid \mathrm{z}_i)\) will be low). The best way to approximate the integral is to pick \(m\) latents which maximise the above sum.

Thus, instead of sampling from \(p(\mathrm{z})\) which is a fixed, simple and suboptimal distribution, we wish to sample from a parametric distribution \(\mathrm{z} \sim q_\phi(\mathrm{z}\mid \mathrm{x})\) called an encoder. This name comes from the fact that the datapoints \(\mathrm{x}\) are often high-dimensional, and the latent variables (called codes) are much simpler: \(\dim(\mathrm{x}) \gg \dim(\mathrm{z})\). The reason we wish to sample from \(q_\phi\) is that by optimizing over the parameters \(\phi\), we can put more mass on codes \(\mathrm{z}^\star\) that result in high likelihoods \(p_\theta(\mathrm{x}\mid \mathrm{z}^\star)\).

The variational equality

The Kullback-Leibler divergence\).

As we sample from \(q_\phi\), we should compute

\begin{align} \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p_\theta(\mathrm{z}\mid \mathrm{x})) &= \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log q_\phi(\mathrm{z}\mid \mathrm{x}) – \log p_\theta(\mathrm{z}\mid \mathrm{x})] \\ & = \log p_\theta(\mathrm{x}) + \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log q_\phi(\mathrm{z}\mid \mathrm{x}) – \log p_\theta(\mathrm{x}\mid \mathrm{z}) – \log p(\mathrm{z})] \\ & = \log p_\theta(\mathrm{x}) + \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \; – \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] \end{align}

Therefore, denoting

\begin{align} E(\theta, \phi, \mathrm{x}) &\triangleq \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p_\theta(\mathrm{z}\mid \mathrm{x})) \\ \Omega(\phi) &\triangleq \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \end{align}

we have the equality \begin{align} \log p_\theta(\mathrm{x})\; – E(\theta, \phi, \mathrm{x}) = \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] – \Omega(\phi) \end{align}

which reads: “ Evidence \(-\) error \(=\) approximation \(+\) regularizer “. The regularizer term \(\Omega(\phi)\) makes sure our new distribution \(q_\phi\) is not too far from a simple distribution \(p(\mathrm{z})\), and the \(E(\cdot)\) term quantifies the amount of error we incurred when sampling from \(q_\phi\) instead of \(p\) to estimate the likelihood \(\log p_\theta(\mathrm{x}\mid \mathrm{z})\). The regularizer includes a minus sign because we wish to minimize \(\Omega(\phi)\) while maximizing the approximate likelihood.

A first derivation of the inequality

Because the KL-divergence is always positive, we must that that \(E(\cdot)\) is positive, and thus

$$\log p_\theta(\mathrm{x}) \geq \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] + \Omega(\phi) \triangleq \mathcal{L}(\theta, \phi, \mathrm{x}) $$

Thus, to estimate an intractable evidence \(\log p_\theta(\mathrm{x})\), we can optimize the evidence lower bound \(\mathcal{L}(\theta, \phi, \mathrm{x})\) over \(\phi\) to tighten the bound so that \(\mathcal{L}(\theta, \phi, \mathrm{x}) \approx \log p_\theta(\mathrm{x})\).

The distributions \(p(\mathrm{z})\) and \(p_\theta(\mathrm{x}\mid\mathrm{z})\) are often chosen to be multivariate gaussians to obtain a closed form for \(\Omega(\phi)\). The expectation term \(\mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ]\) is often estimated using monte-carlo methods.

A second derivation

A second derivation, found in this famous paper on normalizing flows uses Jensen's inequality. Indeed, by the concavity of the logarithm, we have for any distribution \(p\) the inequality \(\log(\mathbb{E}_p[Y]) \geq \mathbb{E}_p[\log(Y)]\). In particular, we can apply this inequality to derive the inequality through the steps

\begin{align} \log p_\theta(\mathrm{x}) &= \log\int_\mathcal{Z} p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} \\ &= \log\int_\mathcal{Z} \dfrac{q_\phi(\mathrm{z}\mid \mathrm{x})}{q_\phi(\mathrm{z}\mid \mathrm{x})}\, p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} \\ &\geq \int_\mathcal{Z} q_\phi(\mathrm{z}\mid \mathrm{x})\log\dfrac{p_\theta(\mathrm{x}\mid \mathrm{z}) p(\mathrm{z})}{q_\phi(\mathrm{z}\mid \mathrm{x})}\, \,\mathrm{dz} \\ & = \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ]\; – \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \end{align}

This proof is elegant but does not give an analytic expression for the jensen gap \(\log p_\theta(\mathrm{x}) – \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] + \Omega(\phi)\). This gap has the exact form \(E(\cdot)\) derived in the previous section.

Suppose one wants to compute the expected value \(\mathbb{E}_X[X]\) of a quadratic form \(f(x) \triangleq x^\mathsf{T}\mathrm{A} x\). Here \(x\) is a \(n\) dimensional vector and \(\mathrm{A}\) a \(n\times n\) matrix. This post introduces a popular trick to compute the expectation of the quadratic form \(f\).

Trace definition and property

Recall the definition of the trace for a matrix \(\mathrm{X} = [\mathrm{X}_{ij}]_{i,j = 1}^n\): the trace is the sum of the diagonal elements $$ \mathrm{Tr}[\mathrm{X}] = \sum_{i=1}^n \mathrm{X}_{ii} $$ from which it is easy to derive that \(\mathrm{Tr}[\cdot]\) is linear. The trace of a \(1\times 1\) matrix is simply the value of its only element \(\mathrm{X}_{11}\). One easily finds, using any usual definition for integrals, that as a linear map one has

$$\int \mathrm{Tr}[g(\mathrm{X})]\,\mathrm{dX} = \mathrm{Tr}\left[ \int g(\mathrm{X})\,\mathrm{dX}\right] $$

Another important property is the cyclic nature of the trace: \(\mathrm{Tr}[\mathrm{XY}] = \mathrm{Tr}[\mathrm{YX}]\) which can be generalized to three matrices $$ \mathrm{Tr}[\mathrm{XYZ}] = \mathrm{Tr}[\mathrm{ZXY}] = \mathrm{Tr}[\mathrm{YZX}] $$ or any finite product of \(m\) matrices.

The trace trick

Suppose that \(X\) is a random variable distributed according to the density \(p\). Suppose the integral \(\Sigma = \mathrm{Cov}[X,X] = \mathbb{E}[XX^\mathsf{T}]\) exists. The expectation of the quadratic \(f\) is then equal to \(\mathrm{Tr}[\mathrm{A}\Sigma]\).

To prove this, we first use the fact that \(f(x)\) is a scalar, and therefore is equal to \(\mathrm{Tr}[f(x)]\). Then, by the cyclic property of the trace \(f(x) = \mathrm{Tr}[\mathrm{A}XX^\mathsf{T}]\). Finally, as the trace is linear, the trace of the expectation is equal to the expectation of the trace: $$ \mathbb{E}_X[\mathrm{Tr}[\mathrm{A}XX^\mathsf{T}]] = \mathrm{Tr}[\mathrm{A}\mathbb{E}_X[XX^\mathsf{T}]] $$

which concludes the proof.

This post summarizes the main ideas of a method to approximate the entropy of a continuous random variable \(X\). Recall the definition of \(h(X)\):

$$ h(X) \triangleq \;-\int_\mathcal{X} p(x)\log p(x)\,\mathrm{d}x $$

Maximum entropy

If we have information about the distribution \(p\) in the form of equality contraints

$$ \mathrm{E}_p[G_i(X)] = \int_\mathcal{X} p(x)G_i(x)\,\mathrm{d}x = c_i\,,\qquad i\in \{1, \dots, n \} $$

then the distribution \(\hat{p}\) which maximizes the entropy \(h\) while respecting the above constraints must be of the form:

$$ p_0(x) = A\,\exp\left(\sum_{i=1}^n a_iG_i(x)\right) $$

where \(A, a_i\) are constants determined by the \(c_i\) which are hard to compute.

Simplifying the candidate distribution

Let \(\phi\) be the standard normal distribution. The authors of the paper assume the candidate distribution \(p\) is not far from \(\phi\). Thus, they normalize the data and put extra constraints \(G_{n+1}(x) = x\) with \(c_{n+1} = 0\) and \(G_{n+2}(x) = x^2\) with \(c_{n+2} = 1\). They further assume the constraint functions \(G_j\) are orthonormal with respect to the inner product \(\langle f,g \rangle \triangleq \mathrm{E}_{\phi}[fg]\) which can be obtained through the Gram-Schmidt algorithm.

Near gaussianity implies that the coefficients \(a_i\) in the above expression for \(f_0\) are near zero for \(i \leq n\) compared to \(a_{n+2} \approx -\frac12\) since

\begin{align} p_0(x) &= A\,\exp\left(\sum_{i=1}^{n+2} a_iG_i(x)\right) \\ &= A\,\exp\left(a_{n+2}x^2 + \sum_{i=1}^n a_iG_i(x)\right) \\ & \approx \frac{1}{\sqrt{2\pi}}\exp( -\frac12x^2 ) \end{align}

therefore as \(\delta = \sum_{i=1}^n a_iG_i(x) \approx 0\) we can expand \(p_0\) as ( related to Edgeworth expansions, more on this later)

$$ p_0(x) \approx \phi(x)\left(1 + \sum_{i=1}^nc_iG_i(x)\right) $$

To be added.

This post is about a derivation of the Singular Value Decomposition of a matrix \(\mathrm{A}\) when several conditions are met. These conditions are quite restrictive, but make derivations very easy.

Symmetric PD matrices

Let us begin with a symmetric matrix \(\mathrm{S}\in\mathbb{R}^{n\times n}\) which is assumed positive-definite with \(n\) distinct eigenvalues \(\{\lambda_i\}_i\). It is easy to prove that the eigenvectors \(v_i\) are orthogonal :

$$0 < v_i^\mathsf{T}\mathrm{S}\,v_i = \lambda_i\lVert v_i \rVert^2 $$

gives that all eigenvalues are strictly positive, but we have for \(i \neq j\)

$$ \lambda_i\,v_i^\mathsf{T}v_j = (\mathrm{S}v_i)^\mathsf{T}v_j = v_i^\mathsf{T}\mathrm{S}^\mathsf{T}v_j = v_i^\mathsf{T}\mathrm{S}v_j = \lambda_j\,v_i^\mathsf{T}v_j$$

because \(\mathrm{S} = \mathrm{S}^\mathsf{T}\). As \(\lambda_i \neq \lambda_j\) we must have \(v_i^\mathsf{T}v_j = 0\). A symmetric PD matrix \(\mathrm{S}\) can therefore be decomposed as \(\mathrm{S} = \mathrm{Q}\mathrm{\Lambda}\mathrm{Q}^\mathsf{T}\) with \(\mathrm{Q}\) orthogonal and \(\mathrm{\Lambda}\) diagonal with positive entries.

Eigendecomposition of a product

Suppose \(\mathrm{A}\) is square, and further that the matrices \(\mathrm{A}^\mathsf{T} \mathrm{A}\) and \(\mathrm{A}\mathrm{A}^\mathsf{T}\) have distinct eigenvalues. Then, by our derivation in the previous section, we have that each of the two matrices can be decomposed as

$$ \mathrm{A}^\mathsf{T} \mathrm{A} = \mathrm{U}\mathrm{\Sigma}_1\mathrm{U}^\mathsf{T} \qquad\mathrm{and}\qquad \mathrm{A}\mathrm{A}^\mathsf{T} = \mathrm{V}\mathrm{\Sigma}_2\mathrm{V}^\mathsf{T} $$

because these two matrices are always symmetric and positive-definite. Further, the eigenvalues \(\sigma^1_i\) and \(\sigma^2_i\) stored in \(\mathrm{\Sigma}_1\) and \(\mathrm{\Sigma}_2\) are equal: if \(\mathrm{A}^\mathsf{T} \mathrm{A}u_i = \sigma^1_i u_i\) then

$$\mathrm{A}\mathrm{A}^\mathsf{T} (\mathrm{A}u_i) = \sigma^1_i (\mathrm{A}u_i) = \sigma^2_j v_j\qquad\text{for some}\;j $$

We can therefore always rearrange \(\mathrm{\Sigma}_1\) into \(\mathrm{\Sigma}_2\) by pre and post multiplying by a permutation: \(\mathrm{\Sigma}_1 = \mathrm{P}\mathrm{\Sigma}_2\mathrm{P}^\mathsf{T}\). By applying the same permutation to the matrix of eigenvectors we keep the decomposition

$$ \mathrm{A}\mathrm{A}^\mathsf{T} = \mathrm{V}\mathrm{\Sigma}_2\mathrm{V}^\mathsf{T} = (\mathrm{V}\mathrm{P}^\mathsf{T})\mathrm{P}\mathrm{\Sigma}_2\mathrm{P}^\mathsf{T}(\mathrm{V}\mathrm{P}^\mathsf{T})^\mathsf{T} = \tilde{\mathrm{V}}\mathrm{\Sigma}_1\tilde{\mathrm{V}}^\mathsf{T}$$

we can therefore assume \(\mathrm{\Sigma}_1 = \mathrm{\Sigma}_2 = \mathrm{\Sigma}\) for some appropriate orthogonal matrices \(\mathrm{U}, \mathrm{V}\).

Singular Value Decomposition

Because the eigenvalues stored in \(\mathrm{\Sigma}\) are strictly positive, we can take their square root \(\sqrt{\sigma_i}\) and store them in a new matrix \(\tilde{\mathrm{\Sigma}} = \mathrm{\Sigma}^{1 / 2}\) such that $$\tilde{\mathrm{\Sigma}}\tilde{\mathrm{\Sigma}} = \tilde{\mathrm{\Sigma}}^\mathsf{T}\tilde{\mathrm{\Sigma}} = \tilde{\mathrm{\Sigma}}\tilde{\mathrm{\Sigma}}^\mathsf{T} = \mathrm{\Sigma} $$

the trick behind the SVD is to construct a specififc \(\mathrm{V}\) from the \(\mathrm{U}\)-decomposition. Consider the equalities

\begin{align} \mathrm{A}\mathrm{A}^\mathsf{T} &= \mathrm{V}\mathrm{\Sigma}\mathrm{V}^\mathsf{T} \\ & = \left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)\mathrm{\Sigma}\left(\tilde{\mathrm{\Sigma}}^{-1}\mathrm{U}^\mathsf{T} \mathrm{A}^\mathsf{T}\right) \\ & = \left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)\mathrm{\Sigma}\left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)^\mathsf{T} \end{align}

therefore, by setting \(\mathrm{V} \triangleq \mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\) we can do two things: orthogonally diagonalize the matrix \(\mathrm{A}\mathrm{A}^\mathsf{T}\) and ensure that \(\mathrm{A} = \mathrm{U}^\mathsf{T}\tilde{\mathrm{\Sigma}}\mathrm{V}\). The matrices \(\mathrm{U}, \mathrm{V}\) are often named differently, so that the SVD decomposition of \(\mathrm{A}\) is

$$ \mathrm{A} = \mathrm{U}\mathrm{\Sigma}\mathrm{V}^\mathsf{T} $$ for some unitary matrices \(\mathrm{U}, \mathrm{V}\) and a positive diagonal matrix \(\mathrm{\Sigma}\).

A special kind of norm for \(n\times n\) square matrices \(\mathrm{A}\) is called the subordinate (or induced) norm. If we have a norm \(\lVert\cdot\rVert_\alpha\) on the input space, and a second norm \(\lVert\cdot\rVert_\beta\) on the output space, the subordinate norm is

$$ \lVert\mathrm{A}\rVert_{\alpha, \beta} = \sup_{x \neq 0}\dfrac{\;\lVert\mathrm{A}x\rVert_{\beta}}{\;\lVert x\rVert_\alpha} $$

Submultiplicative property

An important property of these induced norms is that for any third norm \(\lVert\cdot\rVert_{\gamma}\), we have:

$$ \lVert\mathrm{AB}\rVert_{\alpha, \beta} \leq \lVert\mathrm{A}\rVert_{\gamma, \beta} \lVert\mathrm{B}\rVert_{\alpha, \gamma} $$

This is easily proven via the computations

$$ \dfrac{\;\lVert\mathrm{ABx}\rVert_{\beta}}{\;\lVert x\rVert_\alpha} = \dfrac{\;\lVert\mathrm{A}y\rVert_{\beta}}{\;\lVert y\rVert_\gamma} \dfrac{\;\lVert\mathrm{B}x\rVert_{\gamma}}{\;\lVert x\rVert_\gamma} $$ for \(y = \mathrm{B}x\). As each supremum on the right is an upper bound, we get the inequality

$$ \dfrac{\;\lVert\mathrm{ABx}\rVert_{\beta}}{\;\lVert x\rVert_\alpha} \leq \lVert\mathrm{A}\rVert_{\gamma, \beta} \lVert\mathrm{B}\rVert_{\alpha, \gamma} $$

we get the final equality because the supremum is the least upper bound. A special case is when taking \(\alpha = \beta\), which yields that $$\lVert\mathrm{AB}\rVert_{\alpha, \alpha} \leq \lVert\mathrm{A}\rVert_{\alpha, \alpha} \lVert\mathrm{B}\rVert_{\alpha, \alpha} $$ we denote this norm \(\lVert\cdot\rVert_\alpha\).

Condition number

The condition number \(\kappa_\alpha(\mathrm{A})\) of a matrix is the quantity $$\kappa_\alpha(\mathrm{A}) = \lVert\mathrm{A}\rVert_\alpha\lVert\mathrm{A}^{-1}\rVert_\alpha $$ which is larger than \(1\) because $$ 1 = \lVert\mathrm{I}\rVert_\alpha = \lVert\mathrm{A}\mathrm{A}^{-1}\rVert_\alpha \leq \lVert\mathrm{A}\rVert_\alpha\lVert\mathrm{A}^{-1}\rVert_\alpha $$ this number is large when either \(\mathrm{A}\) or its inverse creates large distortions as the input varies on the unit circle.

Stability of a similarity

In linear algebra, a matrix \(\mathrm{B}\) is said to be similar to \(\mathrm{A}\) when there is an invertible transformation \(\mathrm{X}\) such that \(\mathrm{B} = \mathrm{X}\mathrm{A}\mathrm{X}^{-1}\). In numerical analyses, it is often that we wish to know the robustness of an operation \(f\colon x \mapsto f(x)\) with respect to a small perturbation of its input: if \(y = f(x)\), what is \(f(x+\varepsilon)\) ? For real functions, continuity tells us that as long as \(\lvert \varepsilon \rvert < \delta\), the perturbed output \(f(x+\varepsilon)\) is \(\tilde{\varepsilon}\)-close to \(y\) for some small \(\tilde{\varepsilon} > 0\). What is the equivalent bound on the error of the matrix function \(f_{\mathrm{X}}(\mathrm{A}) = \mathrm{X}\mathrm{A}\mathrm{X}^{-1}\) ? Suppose \(\mathrm{A}\) is perturbed by a matrix \(\mathrm{E}\) with small magnitude: \(\lVert\mathrm{E}\rVert_\alpha < \delta\) Then, the value of the function at this perturbed output is

$$ f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E}) = \mathrm{X}(\mathrm{A}+ \mathrm{E})\mathrm{X}^{-1} = f_{\mathrm{X}}(\mathrm{A}) + f_{\mathrm{X}}(\mathrm{E}) $$

thus, the norm of the error \(\mathrm{B}\, – f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E})\) is given by

$$ \lVert\mathrm{B}\,– f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E})\rVert_\alpha = \lVert\mathrm{X}\mathrm{E}\mathrm{X}^{-1}\rVert_\alpha \leq \kappa_\alpha(\mathrm{X})\,\lVert\mathrm{E}\rVert_\alpha $$

therefore, the condition number \(\kappa_\alpha\) serves as a measure of continuity for the similarity operation \(f_\mathrm{X}(\cdot)\). If the condition number is small, the matrix is said to be “well-behaved”, so that for small input errors \(\mathrm{E}\) the output is only perturbed at most by the small amout \(\kappa_\alpha(\mathrm{X})\,\lVert\mathrm{E}\rVert_\alpha\). If the condition number is large, even small input errors cause large deviations in the output.

Orthogonal matrices are stable

The norm induced by setting \(\alpha = 2\) reveals that orthogonal matrices (the matrices \(\mathrm{Q}\) for which \(\mathrm{Q}^\mathsf{T}\mathrm{Q} = \mathrm{I}\)) are computationally very stable. Indeed, the perturbation created by the similarity \(\mathrm{A}\mapsto \mathrm{Q}\mathrm{A}\mathrm{Q}^\mathsf{T}\) is equal to \(\lVert\mathrm{Q}\mathrm{E}\mathrm{Q}^\mathsf{T}\rVert_2 = \lVert\mathrm{E}\rVert_2\)

These transformations introduce no extra perturbations !

Stability of a time-varying linear system

Consider the time-varying \(n\times n\) system \(\mathrm{A}(t)\, x(t) = b(t)\). Here, the notion of “stable” is different: a time-varying linear system is unstable if small durations cause important variations in the state \(x\). A stable system has therefore a small relative variation \(\lVert \dot{x} \rVert / \lVert x \rVert\).

Differentiating this equality yields

$$ \dot{x}(t) =\mathrm{A}(t)^{-1}\dot{b}(t)\, – \mathrm{A}(t)^{-1}\dot{\mathrm{A}}(t)\,x(t) $$

Therefore the stability of the system can be upper-bounded (dropping the \(t\) for simplicity)

$$ \dfrac{\lVert\dot{x}\rVert}{\lVert x \rVert} \leq \kappa(\mathrm{A}) \left(\dfrac{\lVert \dot{b} \rVert}{\lVert \mathrm{A}\rVert \lVert x\rVert } + \dfrac{\lVert \dot{\mathrm{A}}\rVert}{\lVert \mathrm{A}\rVert}\right) $$

therefore, a system whose matrix has a small condition number at time \(t\) will stay relatively close to its current position in the near future.

This post is about the Gram-Schmidt procedure which transforms a set of vectors \(u_1, \dots u_n\) into a new set of vectors \(v_1, \dots, v_n\) which are orthonormal (ON) for some inner product \(\langle\cdot, \cdot \rangle\). This process is purely algebraïc and can be performed over any hilbert space, no matter its complexity: vectors, matrices, functions are vectors that can be ortho-normalized.

Deriving the algorithm

The first step

Suppose we want to achieve two things: we should have \(v_i \perp v_j\) and \(\lVert v_i\rVert = 1\) for any \(v_i, v_j\) in the new collection. Set \(v_1 = u_1 / \lVert u_1\rVert\) so that the second criterion is met. The simplest and most intuitive way to transform a set of vectors \(\{u_i\}_i\) into another \(\{v_i\}_i\) is by a linear map. Let us try to construct \(v_2\) as a linear combination of \(u_2\) and \(v_1\): \(v_2 = a\,u_2 + bv_1\). We must have

\begin{align} \langle v_1, v_2\rangle & = a\langle v_1, u_2\rangle + b\langle v_2, v_2\rangle \\ & = a\langle v_1, u_2\rangle + b = 0 \end{align}

So \(b\) must be proportional to \(-\langle v_1, u_2\rangle\) the projection of the new vector \(u_2\) onto \(v_1\). Further, if we want that \(\langle v_2, v_2\rangle = 1\) we must set \(a= 1/\lVert u_2 \rVert\) and thus $$b = -a\langle v_1, u_2\rangle =\; – \frac{\langle v_1, u_2\rangle}{\lVert u_2 \rVert}$$

Iterating the process

Suppose we have applied the process \(i-1\) times, so that we have an orthonormal family of \(i-1\) vectors \(\{v_j\}_j\) and a new vector \(u_i\). We wish to find a linear combination again:

$$ v_i = \gamma_0u_i + \sum_{j=1}^{i-1}\gamma_{j}v_j $$

If we solve for \(1 \leq k \leq i-1\) the equation \(\langle v_i, v_k\rangle = 0\), because we have \(\langle v_j, v_k\rangle = 0\) for any \(j\neq k\), we obtain that

$$ \langle v_i, v_k\rangle = \gamma_0\langle u_i, v_k\rangle + \gamma_k\langle v_k, v_k\rangle = \gamma_0\langle u_i, v_k\rangle + \gamma_k = 0 $$

this condition, together with the unit norm constraint, gives solutions

$$ \begin{cases} & \gamma_0 = \dfrac{1}{\lVert u_i \rVert} \\ & \\ & \gamma_k =\; -\gamma_0\langle u_i, v_k\rangle =\; -\dfrac{\langle u_i, v_k\rangle}{\lVert u_i \rVert} \end{cases} $$ we thus obtain the Gram-Schmidt algorithm at step \(i\):

$$ v_i \gets \dfrac{1}{\lVert u_i \rVert}\,u_i -\,\sum_{j=1}^{i-1}\dfrac{\langle u_i, v_j\rangle}{\lVert u_i \rVert}\,v_j $$

The QR decomposition

Assume the vectors \(u_1, \dots u_n\) are the columns \(a_j\in\mathbb{R}^n\) of a matrix \(\mathrm{A} = \left[ a_{ij} \right]_{i,j = 1}^n\). We can apply the algorithm to the vectors \(a_j\) with the inner product \(\langle x, y \rangle \triangleq x^\mathsf{T} y\). Denote \(q_1, \dots, q_n\) the ON family obtained. We can then revert the Gram-Schmidt equations to get

\begin{align} a_i = \lVert a_i\rVert\cdot\sum_{j=1}^i \langle a_i, q_j \rangle q_j &= \sum_{j=1}^i \lVert a_i\rVert\cdot\langle a_i, q_j \rangle q_j \\ &= \sum_{j=1}^i r_{ji}q_j = \left[\mathrm{QR}\right]_i \end{align} With \(\mathrm{Q}\) a unitary matrix and \(\mathrm{R}\) upper triangular, as \(r_{ji} = 0\) for \(j > i\).

Linear dependence

Finally, notice that if we have linearly dependent vectors in the original family \(\{u_i\}_i\), the Gram-Schmidt update becomes \(0\) after the rank of the original family is reached. If \(\{u_i\}_{i=1}^n\) has rank \(p\), then the ON family is

$$\{v_j\}_{j=1}^n = \{ v_1, \dots, v_p, 0, \dots, 0 \}$$

Suppose \(\langle\cdot, \cdot\rangle\) is any real-valued inner product, be it on a space \(V\) of vectors, matrices, continuous functions, or even Sobolev spaces. We can prove using simple algebra that we always have

$$ \lvert \langle\phi, \psi\rangle \vert \leq \lVert \phi \rVert\,\lVert \psi \rVert$$

for any two elements \(\phi, \psi\) in \(V\). The norm \(\lVert \cdot \rVert\) is the usual inner-product associated norm \(\lVert \phi \rVert = {\langle\phi,\phi\rangle}^{1 / 2}\)

The algebraïc lemma

It is well-known that the quadratic \(q(x) \triangleq ax^2 + bx + c\) with can be factorized as

$$ q(x) = a\left(x + \dfrac{b}{2a}\right)^2 – \dfrac{b^2 – 4ac}{4a}$$

therefore, the equation \(q(x) = 0\) has no solution if and only if \(b^2 – 4ac < 0\).

The proof

Let us assume \(\phi \neq \lambda\psi\) and compute \(\langle\phi – \lambda\psi, \phi – \lambda\psi\rangle > 0\), we get the inequality

$$ \lambda^2\langle\psi, \psi\rangle – 2\lambda\langle\phi, \psi\rangle + \langle\phi, \phi\rangle $$ which is a strictly positive quadratic form in the variable \(\lambda\). Using our lemma, we therefore must have $$4\langle\phi, \psi\rangle^2 < 4\langle\psi, \psi\rangle\langle\phi, \phi\rangle$$ this yields the cauchy-schwartz inequality which turns into an equality when \(\phi = \lambda\psi\).

This post aims at gathering different inequalities pertaining to the product \(\mathrm{A}x\) where \(\mathrm{A}\) is a \(m\times n\) matrix, and \(v\in\mathbb{R}^n\) is a vector. We will decompose the matrix \(\mathrm{A}\) in terms of its column vectors \(a_j\) and its row vectors \(\tilde{a}_i^\mathsf{T}\):

$$ \mathrm{A} = \begin{bmatrix} a_1 & \dots & a_n \end{bmatrix} = \begin{bmatrix} \tilde{a}_1^\mathsf{T} \\ \vdots \\ \tilde{a}_m^\mathsf{T} \end{bmatrix} $$

Euclidean norm

The quantity $$\lVert\mathrm{A}x \rVert_2^2 = \sum_{i=1}^m \left(\tilde{a}_i^\mathsf{T}x\right)^2$$ can be decomposed using the cauchy-schwartz inequality to obtain

$$\lVert\mathrm{A}x \rVert_2^2 \leq \left(\sum_{i=1}^m \lVert\tilde{a}_i\rVert_2^2\right)\, \lVert x\rVert_2^2 = \lVert \mathrm{A} \rVert_{\mathrm{F}}^2 \, \lVert x\rVert_2^2 $$

thus there is a \(M = \lVert \mathrm{A} \rVert_{\mathrm{F}}\) such that \( \lVert\mathrm{A}x \rVert_2 \leq M\lVert x\rVert_2^2\).

This means all matrices are continuous & bounded linear maps for the \(l_2\) norm.

Manhattan norm

Suppose we wish to measure with input and output spaces using the \(l_1\) norm. The quantity $$\lVert\mathrm{A}x \rVert_1 = \sum_{i=1}^m \lvert\tilde{a}_i^\mathsf{T}x\rvert = \sum_{i=1}^m \lvert\sum_{j=1}^n a_{ij}x_j\rvert $$ can be upper-bounded using the finite-sum generalization of the scalar inequality \(\lvert a + b \rvert \leq \lvert a\rvert + \lvert b \rvert\). Denoting \(\lVert \mathrm{A} \rVert_1\) the sum of the unsigned entries \(\lvert a_{ij} \rvert\) we have the inequality:

$$ \lVert\mathrm{A}x \rVert_1 \leq \lVert\mathrm{A} \rVert_1 \lVert x \rVert_1$$

we can conclude similarly to the first section.

Infinity norm

Assume we now use the infinity norm \(\lVert \cdot \rVert_\infty\), we can upper bound the output \(\mathrm{A}x\) as follows:

\begin{align} \lVert \mathrm{A}x \rVert_\infty = \max_i\lvert \tilde{a}_i^\mathsf{T} x\rvert &= \max_i \left\lvert \sum_{j=1}^n a_{ij}x_j\right\rvert \\ & \leq \max_i \sum_{j=1}^n \left\lvert a_{ij} \right\rvert \left\lvert x_j\right\rvert \\ & \leq \lVert x\rVert_\infty \max_i \sum_{j=1}^n \left\lvert a_{ij} \right\rvert = M\,\lVert x\rVert_\infty \end{align}

we thus obtain that any linear operator is bounded and continuous for \(l_\infty\) in finite dimensions.

Duality

To be added.

Enter your email to subscribe to updates.