Shae's Ramblings

Convex sum of diracs on a bounded subset of the real line

April 4, 2021

In this post, we aim to prove that any continuous density $x\mapsto p(x)$ over a bounded subset $\mathcal{X} \subset [a,b]$ can be approximated by a convex combination of deltas $\delta_{x_i}$: there is always a $N$ for which we have with $\varepsilon$-accuracy that $p(x) \approx p_n(x) = \sum_{i=1}^n\gamma_i \delta_{x_i}(x)$ for all $n > N$.

Because any bounded subset in $\mathbb{R}$ can be finitely covered by closed intervals $[a_i, b_i]$, we can prove our result for each interval and use the fact that for any $c,d\in\mathcal{X}$ bounded, the integral is equal to

$$ \int_c^dp(x)\mathrm{d}x = \sum_{i=1}^n\int_{a_i}^{b_i}p(x)\mathrm{d}x $$

Thus to simplify the problem, we can assume $\mathcal{X} = [a,b]$ some interval.

Weak convergence

For continuous distributions $f,g$ on $\mathcal{X}$ define the inner product

$$ \langle f,g\rangle \triangleq \int_\mathcal{X}f(x)g(x)\,\mathrm{d}x $$ We say a sequence $f_n$ converges weakly to a limit $f_\star$ if for every distribution $g$ on $\mathcal{X}$ we have that $\langle f_n, g\rangle \to \langle f_\star, g\rangle$ as $n\to\infty$.

The proof

We can use Riemann sums to prove the statement. First, notice that because $\int p(x)\mathrm{d}x = 1$, we can for any $n$ find a partition $a = x_0 < x_1 < \dots < x_n$ with gaps $\Delta x_i = x_{i+1} – x_i$ so that

$$ \sum_{i=0}^{n-1}p(x_i^\star)\Delta x_i = 1 + u_n $$ with $u_n \to 0$ as $n$ tends to infinity. We can assume that $u_n \leq 0$ by taking $p(x_i^\star) = \inf\{p(x)\mid x_i \leq x \leq x_{i+1}\}$ which is attained on a closed interval by the continuity of $p$. We can therefore for any $n$ construct the following $\gamma_i$:

$$ \gamma_i \triangleq p(x_i^\star)\Delta x_i – u_n/n $$ so that we always have $\gamma_i \geq 0$ and $\sum_{i=0}^{n-1}\gamma_i = 1 + u_n – u_n = 1$.

Finally, consider the convex sum $p_n = \sum_{i=0}^{n-1} \gamma_i \delta_{x_i^\star}$, we have

$$ \langle p_n , f\rangle = \sum_{i=0}^{n-1} \gamma_i f(x_i^\star) = \sum_{i=0}^{n-1}\left( p(x_i^\star)\Delta x_i – u_n/n \right) f(x_i^\star) $$ which expands into

$$ \sum_{i=0}^{n-1} \gamma_i g(x_i^\star) = \sum_{i=0}^{n-1}p(x_i^\star)f(x_i^\star)\Delta x_i – \dfrac{u_n}{n}\sum_{i=0}^{n-1}f(x_i^\star) $$ which converges to $\langle p, f \rangle$ because $\dfrac{u_n}{n}\sum_{i=0}^{n-1} f(x_i^\star) \to 0$ from the bounded nature of $f$.

Jensen's inequality: going to the limit

April 4, 2021

This post is about a proof of Jensen's Inequality, and more specifically a way to go from finite spaces to countable and uncountable spaces.

Finite case

Recall that a function $f$ is convex on $[a,b]$ iff the inequality

$$ f(ta + (1-t)b) \leq tf(a) + (1-t)f(b) $$

holds for any real $t$ between $0$ and $1$. We can easily generalize this relationship to any convex sum of points (a convex sum is a weighted sum $\sum_{i=1}^n \gamma_ia_i$ for which $\gamma_i \geq 0$ and $\sum_{i=1}^n \gamma_i = 1$) using induction on $n$. The above inequality serves as the initialization($n$ equal to $1$ and $2$). Assuming the inequality holds for $n-1$ points we can notice that

$$ f\left(\sum_{i=1}^n \gamma_ia_i\right) = f\left(\gamma_na_n + \sum_{i=1}^{n-1} \gamma_ia_i\right) = f( ta + (1-t)b) $$

if we set $t = \gamma_n$, $a = a_n$, $b = \sum_{i=1}^{n-1} \gamma_i'a_i = \sum_{i=1}^{n-1} \frac{\gamma_i}{1-\gamma_n} a_i $

thus, by the convexity of $f$, we have

$$ f\left(\sum_{i=1}^n \gamma_ia_i\right) \leq \gamma_n f(a_n) + (1 – \gamma_n)f\left(\sum_{i=1}^{n-1} \gamma_i'a_i\right) $$

however, since the $\gamma_i'$ form a convex combination of $n-1$ points, we can perform the induction step: $f\left(\sum_{i=1}^{n-1} \gamma_i'a_i\right) \leq \sum_{i=1}^{n-1} \gamma_i'f(a_i)$. Thus we obtain that $f\left(\sum_{i=1}^n \gamma_ia_i\right) \leq \sum_{i=1}^n \gamma_if(a_i)$ for any finite convex combination $\{ (\gamma_i,a_i) \}_{i=1}^n$.

Countable case

A countable collection of points $(a_n)_{n\in\mathbb{N}}$ can be combined in a convex fashion using any positive sequence $(\gamma_n)_{n\in\mathbb{N}}$ such that $\sum_{i=1}^\infty \gamma_i= \lim_{n\to\infty}\sum_{i=1}^n \gamma_i = 1$. This implies that the coefficients vanish at infinity: $\gamma_i \to 0$ as $i \to \infty$.

Assuming both $\sum_{i=1}^n \gamma_i a_i$ and $\sum_{i=1}^n \gamma_i f(a_i)$ converge, and assuming $f$ is continuous, we can take limits on both sides of our inequality to yield:

$$ f\left(\sum_{i=1}^\infty \gamma_ia_i\right) \leq \sum_{i=1}^\infty \gamma_if(a_i) $$

Continuous case

Assuming we have a distribution $p$ defined on the real line $\mathbb{R}$, the integral $$\mathbb{E}_p[f(x)] = \int_\mathcal{X}f(x)p(x)\,\mathrm{d}x$$ can be approximated using countable sums. Indeed, the function $p(x)$ can be approximated using a convex combination of dirac deltas: $p(x) \approx p_n(x) = \sum_{i=1}^n\gamma_i \delta_{x_i}(x)$ (formally, convex combinations of diracs are weakly dense in the space of distributions). We have that $\mathbb{E}_{p_n}[f(x)] \to \mathbb{E}_{p}[f(x)]$ as $n \to \infty$ and

$$ f\left(\mathbb{E}_{p_n}[x]\right) = f\left(\sum_{i=1}^n\gamma_i x_i\right) \leq \sum_{i=1}^n\gamma_i f(x_i) = \mathbb{E}_{p_n}[f(x)] $$ and we can let $n$ tend to $\infty$ on each side of the inequality to conclude.

Evidence lower-bound: two derivations

April 4, 2021

This post is about a fundamental inequality in variational inference.

Intuition

In order to fit a model $p_\theta(\cdot)$ to datapoints $\mathrm{x}_1, \dots \mathrm{x}_n$ one has to compute the marginal likelihood $p_\theta(\mathrm{x})$ of the data (also called the evidence ) under the parameter $\theta$. In variational methods, one usually models a generative process $p_\theta(\mathrm{x}\mid \mathrm{z})$ which maps a latent $\mathrm{z}$ into an estimate of the data $\mathrm{x}$. If one has a simple distribution $p(\mathrm{z})$ over the latents, then one has the equality:

$$ p_\theta(\mathrm{x}) = \int_\mathcal{Z}p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} $$

therefore, one can approximate the marginal likelihood by sampling points $\mathrm{z}_1, \dots, \mathrm{z}_m$ from $p(\mathrm{z})$ and using monte-carlo integration:

$$p_\theta(\mathrm{x}) \approx \dfrac1m \sum_{i=1}^mp_\theta(\mathrm{x}\mid \mathrm{z}_i)$$

However, picking the $\mathrm{z}_i$ at random will result in poor estimates, as a random $\mathrm{z}_i$ is unlikely to have generated $\mathrm{x}$ (that is, the value $p_\theta(\mathrm{x}\mid \mathrm{z}_i)$ will be low). The best way to approximate the integral is to pick $m$ latents which maximise the above sum.

Thus, instead of sampling from $p(\mathrm{z})$ which is a fixed, simple and suboptimal distribution, we wish to sample from a parametric distribution $\mathrm{z} \sim q_\phi(\mathrm{z}\mid \mathrm{x})$ called an encoder. This name comes from the fact that the datapoints $\mathrm{x}$ are often high-dimensional, and the latent variables (called codes) are much simpler: $\dim(\mathrm{x}) \gg \dim(\mathrm{z})$. The reason we wish to sample from $q_\phi$ is that by optimizing over the parameters $\phi$, we can put more mass on codes $\mathrm{z}^\star$ that result in high likelihoods $p_\theta(\mathrm{x}\mid \mathrm{z}^\star)$.

The variational equality

The Kullback-Leibler divergence\).

As we sample from $q_\phi$, we should compute

\begin{align} \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p_\theta(\mathrm{z}\mid \mathrm{x})) &= \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log q_\phi(\mathrm{z}\mid \mathrm{x}) – \log p_\theta(\mathrm{z}\mid \mathrm{x})] \\ & = \log p_\theta(\mathrm{x}) + \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log q_\phi(\mathrm{z}\mid \mathrm{x}) – \log p_\theta(\mathrm{x}\mid \mathrm{z}) – \log p(\mathrm{z})] \\ & = \log p_\theta(\mathrm{x}) + \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \; – \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] \end{align}

Therefore, denoting

\begin{align} E(\theta, \phi, \mathrm{x}) &\triangleq \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p_\theta(\mathrm{z}\mid \mathrm{x})) \\ \Omega(\phi) &\triangleq \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \end{align}

we have the equality \begin{align} \log p_\theta(\mathrm{x})\; – E(\theta, \phi, \mathrm{x}) = \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] – \Omega(\phi) \end{align}

which reads: “ Evidence $-$ error $=$ approximation $+$ regularizer “. The regularizer term $\Omega(\phi)$ makes sure our new distribution $q_\phi$ is not too far from a simple distribution $p(\mathrm{z})$, and the $E(\cdot)$ term quantifies the amount of error we incurred when sampling from $q_\phi$ instead of $p$ to estimate the likelihood $\log p_\theta(\mathrm{x}\mid \mathrm{z})$. The regularizer includes a minus sign because we wish to minimize $\Omega(\phi)$ while maximizing the approximate likelihood.

A first derivation of the inequality

Because the KL-divergence is always positive, we must that that $E(\cdot)$ is positive, and thus

$$\log p_\theta(\mathrm{x}) \geq \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] + \Omega(\phi) \triangleq \mathcal{L}(\theta, \phi, \mathrm{x}) $$

Thus, to estimate an intractable evidence $\log p_\theta(\mathrm{x})$, we can optimize the evidence lower bound $\mathcal{L}(\theta, \phi, \mathrm{x})$ over $\phi$ to tighten the bound so that $\mathcal{L}(\theta, \phi, \mathrm{x}) \approx \log p_\theta(\mathrm{x})$.

The distributions $p(\mathrm{z})$ and $p_\theta(\mathrm{x}\mid\mathrm{z})$ are often chosen to be multivariate gaussians to obtain a closed form for $\Omega(\phi)$. The expectation term $\mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ]$ is often estimated using monte-carlo methods.

A second derivation

A second derivation, found in this famous paper on normalizing flows uses Jensen's inequality. Indeed, by the concavity of the logarithm, we have for any distribution $p$ the inequality $\log(\mathbb{E}_p[Y]) \geq \mathbb{E}_p[\log(Y)]$. In particular, we can apply this inequality to derive the inequality through the steps

\begin{align} \log p_\theta(\mathrm{x}) &= \log\int_\mathcal{Z} p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} \\ &= \log\int_\mathcal{Z} \dfrac{q_\phi(\mathrm{z}\mid \mathrm{x})}{q_\phi(\mathrm{z}\mid \mathrm{x})}\, p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} \\ &\geq \int_\mathcal{Z} q_\phi(\mathrm{z}\mid \mathrm{x})\log\dfrac{p_\theta(\mathrm{x}\mid \mathrm{z}) p(\mathrm{z})}{q_\phi(\mathrm{z}\mid \mathrm{x})}\, \,\mathrm{dz} \\ & = \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ]\; – \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \end{align}

This proof is elegant but does not give an analytic expression for the jensen gap $\log p_\theta(\mathrm{x}) – \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] + \Omega(\phi)$. This gap has the exact form $E(\cdot)$ derived in the previous section.

Expectation of a quadratic form: the trace trick

March 31, 2021

Suppose one wants to compute the expected value $\mathbb{E}_X[X]$ of a quadratic form $f(x) \triangleq x^\mathsf{T}\mathrm{A} x$. Here $x$ is a $n$ dimensional vector and $\mathrm{A}$ a $n\times n$ matrix. This post introduces a popular trick to compute the expectation of the quadratic form $f$.

Trace definition and property

Recall the definition of the trace for a matrix $\mathrm{X} = [\mathrm{X}_{ij}]_{i,j = 1}^n$: the trace is the sum of the diagonal elements $$ \mathrm{Tr}[\mathrm{X}] = \sum_{i=1}^n \mathrm{X}_{ii} $$ from which it is easy to derive that $\mathrm{Tr}[\cdot]$ is linear. The trace of a $1\times 1$ matrix is simply the value of its only element $\mathrm{X}_{11}$. One easily finds, using any usual definition for integrals, that as a linear map one has

$$\int \mathrm{Tr}[g(\mathrm{X})]\,\mathrm{dX} = \mathrm{Tr}\left[ \int g(\mathrm{X})\,\mathrm{dX}\right] $$

Another important property is the cyclic nature of the trace: $\mathrm{Tr}[\mathrm{XY}] = \mathrm{Tr}[\mathrm{YX}]$ which can be generalized to three matrices $$ \mathrm{Tr}[\mathrm{XYZ}] = \mathrm{Tr}[\mathrm{ZXY}] = \mathrm{Tr}[\mathrm{YZX}] $$ or any finite product of $m$ matrices.

The trace trick

Suppose that $X$ is a random variable distributed according to the density $p$. Suppose the integral $\Sigma = \mathrm{Cov}[X,X] = \mathbb{E}[XX^\mathsf{T}]$ exists. The expectation of the quadratic $f$ is then equal to $\mathrm{Tr}[\mathrm{A}\Sigma]$.

To prove this, we first use the fact that $f(x)$ is a scalar, and therefore is equal to $\mathrm{Tr}[f(x)]$. Then, by the cyclic property of the trace $f(x) = \mathrm{Tr}[\mathrm{A}XX^\mathsf{T}]$. Finally, as the trace is linear, the trace of the expectation is equal to the expectation of the trace: $$ \mathbb{E}_X[\mathrm{Tr}[\mathrm{A}XX^\mathsf{T}]] = \mathrm{Tr}[\mathrm{A}\mathbb{E}_X[XX^\mathsf{T}]] $$

which concludes the proof.

Approximating the differential entropy

March 26, 2021

This post summarizes the main ideas of a method to approximate the entropy of a continuous random variable $X$. Recall the definition of $h(X)$:

$$ h(X) \triangleq \;-\int_\mathcal{X} p(x)\log p(x)\,\mathrm{d}x $$

Maximum entropy

If we have information about the distribution $p$ in the form of equality contraints

$$ \mathrm{E}_p[G_i(X)] = \int_\mathcal{X} p(x)G_i(x)\,\mathrm{d}x = c_i\,,\qquad i\in \{1, \dots, n \} $$

then the distribution $\hat{p}$ which maximizes the entropy $h$ while respecting the above constraints must be of the form:

$$ p_0(x) = A\,\exp\left(\sum_{i=1}^n a_iG_i(x)\right) $$

where $A, a_i$ are constants determined by the $c_i$ which are hard to compute.

Simplifying the candidate distribution

Let $\phi$ be the standard normal distribution. The authors of the paper assume the candidate distribution $p$ is not far from $\phi$. Thus, they normalize the data and put extra constraints $G_{n+1}(x) = x$ with $c_{n+1} = 0$ and $G_{n+2}(x) = x^2$ with $c_{n+2} = 1$. They further assume the constraint functions $G_j$ are orthonormal with respect to the inner product $\langle f,g \rangle \triangleq \mathrm{E}_{\phi}[fg]$ which can be obtained through the Gram-Schmidt algorithm.

Near gaussianity implies that the coefficients $a_i$ in the above expression for $f_0$ are near zero for $i \leq n$ compared to $a_{n+2} \approx -\frac12$ since

\begin{align} p_0(x) &= A\,\exp\left(\sum_{i=1}^{n+2} a_iG_i(x)\right) \\ &= A\,\exp\left(a_{n+2}x^2 + \sum_{i=1}^n a_iG_i(x)\right) \\ & \approx \frac{1}{\sqrt{2\pi}}\exp( -\frac12x^2 ) \end{align}

therefore as $\delta = \sum_{i=1}^n a_iG_i(x) \approx 0$ we can expand $p_0$ as ( related to Edgeworth expansions, more on this later)

$$ p_0(x) \approx \phi(x)\left(1 + \sum_{i=1}^nc_iG_i(x)\right) $$

To be added.

SVD when everything goes right

March 25, 2021

This post is about a derivation of the Singular Value Decomposition of a matrix $\mathrm{A}$ when several conditions are met. These conditions are quite restrictive, but make derivations very easy.

Symmetric PD matrices

Let us begin with a symmetric matrix $\mathrm{S}\in\mathbb{R}^{n\times n}$ which is assumed positive-definite with $n$ distinct eigenvalues $\{\lambda_i\}_i$. It is easy to prove that the eigenvectors $v_i$ are orthogonal :

$$0 < v_i^\mathsf{T}\mathrm{S}\,v_i = \lambda_i\lVert v_i \rVert^2 $$

gives that all eigenvalues are strictly positive, but we have for $i \neq j$

$$ \lambda_i\,v_i^\mathsf{T}v_j = (\mathrm{S}v_i)^\mathsf{T}v_j = v_i^\mathsf{T}\mathrm{S}^\mathsf{T}v_j = v_i^\mathsf{T}\mathrm{S}v_j = \lambda_j\,v_i^\mathsf{T}v_j$$

because $\mathrm{S} = \mathrm{S}^\mathsf{T}$. As $\lambda_i \neq \lambda_j$ we must have $v_i^\mathsf{T}v_j = 0$. A symmetric PD matrix $\mathrm{S}$ can therefore be decomposed as $\mathrm{S} = \mathrm{Q}\mathrm{\Lambda}\mathrm{Q}^\mathsf{T}$ with $\mathrm{Q}$ orthogonal and $\mathrm{\Lambda}$ diagonal with positive entries.

Eigendecomposition of a product

Suppose $\mathrm{A}$ is square, and further that the matrices $\mathrm{A}^\mathsf{T} \mathrm{A}$ and $\mathrm{A}\mathrm{A}^\mathsf{T}$ have distinct eigenvalues. Then, by our derivation in the previous section, we have that each of the two matrices can be decomposed as

$$ \mathrm{A}^\mathsf{T} \mathrm{A} = \mathrm{U}\mathrm{\Sigma}_1\mathrm{U}^\mathsf{T} \qquad\mathrm{and}\qquad \mathrm{A}\mathrm{A}^\mathsf{T} = \mathrm{V}\mathrm{\Sigma}_2\mathrm{V}^\mathsf{T} $$

because these two matrices are always symmetric and positive-definite. Further, the eigenvalues $\sigma^1_i$ and $\sigma^2_i$ stored in $\mathrm{\Sigma}_1$ and $\mathrm{\Sigma}_2$ are equal: if $\mathrm{A}^\mathsf{T} \mathrm{A}u_i = \sigma^1_i u_i$ then

$$\mathrm{A}\mathrm{A}^\mathsf{T} (\mathrm{A}u_i) = \sigma^1_i (\mathrm{A}u_i) = \sigma^2_j v_j\qquad\text{for some}\;j $$

We can therefore always rearrange $\mathrm{\Sigma}_1$ into $\mathrm{\Sigma}_2$ by pre and post multiplying by a permutation: $\mathrm{\Sigma}_1 = \mathrm{P}\mathrm{\Sigma}_2\mathrm{P}^\mathsf{T}$. By applying the same permutation to the matrix of eigenvectors we keep the decomposition

$$ \mathrm{A}\mathrm{A}^\mathsf{T} = \mathrm{V}\mathrm{\Sigma}_2\mathrm{V}^\mathsf{T} = (\mathrm{V}\mathrm{P}^\mathsf{T})\mathrm{P}\mathrm{\Sigma}_2\mathrm{P}^\mathsf{T}(\mathrm{V}\mathrm{P}^\mathsf{T})^\mathsf{T} = \tilde{\mathrm{V}}\mathrm{\Sigma}_1\tilde{\mathrm{V}}^\mathsf{T}$$

we can therefore assume $\mathrm{\Sigma}_1 = \mathrm{\Sigma}_2 = \mathrm{\Sigma}$ for some appropriate orthogonal matrices $\mathrm{U}, \mathrm{V}$.

Singular Value Decomposition

Because the eigenvalues stored in $\mathrm{\Sigma}$ are strictly positive, we can take their square root $\sqrt{\sigma_i}$ and store them in a new matrix $\tilde{\mathrm{\Sigma}} = \mathrm{\Sigma}^{1 / 2}$ such that $$\tilde{\mathrm{\Sigma}}\tilde{\mathrm{\Sigma}} = \tilde{\mathrm{\Sigma}}^\mathsf{T}\tilde{\mathrm{\Sigma}} = \tilde{\mathrm{\Sigma}}\tilde{\mathrm{\Sigma}}^\mathsf{T} = \mathrm{\Sigma} $$

the trick behind the SVD is to construct a specififc $\mathrm{V}$ from the $\mathrm{U}$-decomposition. Consider the equalities

\begin{align} \mathrm{A}\mathrm{A}^\mathsf{T} &= \mathrm{V}\mathrm{\Sigma}\mathrm{V}^\mathsf{T} \\ & = \left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)\mathrm{\Sigma}\left(\tilde{\mathrm{\Sigma}}^{-1}\mathrm{U}^\mathsf{T} \mathrm{A}^\mathsf{T}\right) \\ & = \left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)\mathrm{\Sigma}\left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)^\mathsf{T} \end{align}

therefore, by setting $\mathrm{V} \triangleq \mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}$ we can do two things: orthogonally diagonalize the matrix $\mathrm{A}\mathrm{A}^\mathsf{T}$ and ensure that $\mathrm{A} = \mathrm{U}^\mathsf{T}\tilde{\mathrm{\Sigma}}\mathrm{V}$. The matrices $\mathrm{U}, \mathrm{V}$ are often named differently, so that the SVD decomposition of $\mathrm{A}$ is

$$ \mathrm{A} = \mathrm{U}\mathrm{\Sigma}\mathrm{V}^\mathsf{T} $$ for some unitary matrices $\mathrm{U}, \mathrm{V}$ and a positive diagonal matrix $\mathrm{\Sigma}$.

Condition number of a matrix: basics

March 24, 2021

A special kind of norm for $n\times n$ square matrices $\mathrm{A}$ is called the subordinate (or induced) norm. If we have a norm $\lVert\cdot\rVert_\alpha$ on the input space, and a second norm $\lVert\cdot\rVert_\beta$ on the output space, the subordinate norm is

$$ \lVert\mathrm{A}\rVert_{\alpha, \beta} = \sup_{x \neq 0}\dfrac{\;\lVert\mathrm{A}x\rVert_{\beta}}{\;\lVert x\rVert_\alpha} $$

Submultiplicative property

An important property of these induced norms is that for any third norm $\lVert\cdot\rVert_{\gamma}$, we have:

$$ \lVert\mathrm{AB}\rVert_{\alpha, \beta} \leq \lVert\mathrm{A}\rVert_{\gamma, \beta} \lVert\mathrm{B}\rVert_{\alpha, \gamma} $$

This is easily proven via the computations

$$ \dfrac{\;\lVert\mathrm{ABx}\rVert_{\beta}}{\;\lVert x\rVert_\alpha} = \dfrac{\;\lVert\mathrm{A}y\rVert_{\beta}}{\;\lVert y\rVert_\gamma} \dfrac{\;\lVert\mathrm{B}x\rVert_{\gamma}}{\;\lVert x\rVert_\gamma} $$ for $y = \mathrm{B}x$. As each supremum on the right is an upper bound, we get the inequality

$$ \dfrac{\;\lVert\mathrm{ABx}\rVert_{\beta}}{\;\lVert x\rVert_\alpha} \leq \lVert\mathrm{A}\rVert_{\gamma, \beta} \lVert\mathrm{B}\rVert_{\alpha, \gamma} $$

we get the final equality because the supremum is the least upper bound. A special case is when taking $\alpha = \beta$, which yields that $$\lVert\mathrm{AB}\rVert_{\alpha, \alpha} \leq \lVert\mathrm{A}\rVert_{\alpha, \alpha} \lVert\mathrm{B}\rVert_{\alpha, \alpha} $$ we denote this norm $\lVert\cdot\rVert_\alpha$.

Condition number

The condition number $\kappa_\alpha(\mathrm{A})$ of a matrix is the quantity $$\kappa_\alpha(\mathrm{A}) = \lVert\mathrm{A}\rVert_\alpha\lVert\mathrm{A}^{-1}\rVert_\alpha $$ which is larger than $1$ because $$ 1 = \lVert\mathrm{I}\rVert_\alpha = \lVert\mathrm{A}\mathrm{A}^{-1}\rVert_\alpha \leq \lVert\mathrm{A}\rVert_\alpha\lVert\mathrm{A}^{-1}\rVert_\alpha $$ this number is large when either $\mathrm{A}$ or its inverse creates large distortions as the input varies on the unit circle.

Stability of a similarity

In linear algebra, a matrix $\mathrm{B}$ is said to be similar to $\mathrm{A}$ when there is an invertible transformation $\mathrm{X}$ such that $\mathrm{B} = \mathrm{X}\mathrm{A}\mathrm{X}^{-1}$. In numerical analyses, it is often that we wish to know the robustness of an operation $f\colon x \mapsto f(x)$ with respect to a small perturbation of its input: if $y = f(x)$, what is $f(x+\varepsilon)$ ? For real functions, continuity tells us that as long as $\lvert \varepsilon \rvert < \delta$, the perturbed output $f(x+\varepsilon)$ is $\tilde{\varepsilon}$-close to $y$ for some small $\tilde{\varepsilon} > 0$. What is the equivalent bound on the error of the matrix function $f_{\mathrm{X}}(\mathrm{A}) = \mathrm{X}\mathrm{A}\mathrm{X}^{-1}$ ? Suppose $\mathrm{A}$ is perturbed by a matrix $\mathrm{E}$ with small magnitude: $\lVert\mathrm{E}\rVert_\alpha < \delta$ Then, the value of the function at this perturbed output is

$$ f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E}) = \mathrm{X}(\mathrm{A}+ \mathrm{E})\mathrm{X}^{-1} = f_{\mathrm{X}}(\mathrm{A}) + f_{\mathrm{X}}(\mathrm{E}) $$

thus, the norm of the error $\mathrm{B}\, – f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E})$ is given by

$$ \lVert\mathrm{B}\,– f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E})\rVert_\alpha = \lVert\mathrm{X}\mathrm{E}\mathrm{X}^{-1}\rVert_\alpha \leq \kappa_\alpha(\mathrm{X})\,\lVert\mathrm{E}\rVert_\alpha $$

therefore, the condition number $\kappa_\alpha$ serves as a measure of continuity for the similarity operation $f_\mathrm{X}(\cdot)$. If the condition number is small, the matrix is said to be “well-behaved”, so that for small input errors $\mathrm{E}$ the output is only perturbed at most by the small amout $\kappa_\alpha(\mathrm{X})\,\lVert\mathrm{E}\rVert_\alpha$. If the condition number is large, even small input errors cause large deviations in the output.

Orthogonal matrices are stable

The norm induced by setting $\alpha = 2$ reveals that orthogonal matrices (the matrices $\mathrm{Q}$ for which $\mathrm{Q}^\mathsf{T}\mathrm{Q} = \mathrm{I}$) are computationally very stable. Indeed, the perturbation created by the similarity $\mathrm{A}\mapsto \mathrm{Q}\mathrm{A}\mathrm{Q}^\mathsf{T}$ is equal to $\lVert\mathrm{Q}\mathrm{E}\mathrm{Q}^\mathsf{T}\rVert_2 = \lVert\mathrm{E}\rVert_2$

These transformations introduce no extra perturbations !

Stability of a time-varying linear system

Consider the time-varying $n\times n$ system $\mathrm{A}(t)\, x(t) = b(t)$. Here, the notion of “stable” is different: a time-varying linear system is unstable if small durations cause important variations in the state $x$. A stable system has therefore a small relative variation $\lVert \dot{x} \rVert / \lVert x \rVert$.

Differentiating this equality yields

$$ \dot{x}(t) =\mathrm{A}(t)^{-1}\dot{b}(t)\, – \mathrm{A}(t)^{-1}\dot{\mathrm{A}}(t)\,x(t) $$

Therefore the stability of the system can be upper-bounded (dropping the $t$ for simplicity)

$$ \dfrac{\lVert\dot{x}\rVert}{\lVert x \rVert} \leq \kappa(\mathrm{A}) \left(\dfrac{\lVert \dot{b} \rVert}{\lVert \mathrm{A}\rVert \lVert x\rVert } + \dfrac{\lVert \dot{\mathrm{A}}\rVert}{\lVert \mathrm{A}\rVert}\right) $$

therefore, a system whose matrix has a small condition number at time $t$ will stay relatively close to its current position in the near future.

The Gram-Schmidt algorithm

March 24, 2021

This post is about the Gram-Schmidt procedure which transforms a set of vectors $u_1, \dots u_n$ into a new set of vectors $v_1, \dots, v_n$ which are orthonormal (ON) for some inner product $\langle\cdot, \cdot \rangle$. This process is purely algebraïc and can be performed over any hilbert space, no matter its complexity: vectors, matrices, functions are vectors that can be ortho-normalized.

Deriving the algorithm

The first step

Suppose we want to achieve two things: we should have $v_i \perp v_j$ and $\lVert v_i\rVert = 1$ for any $v_i, v_j$ in the new collection. Set $v_1 = u_1 / \lVert u_1\rVert$ so that the second criterion is met. The simplest and most intuitive way to transform a set of vectors $\{u_i\}_i$ into another $\{v_i\}_i$ is by a linear map. Let us try to construct $v_2$ as a linear combination of $u_2$ and $v_1$: $v_2 = a\,u_2 + bv_1$. We must have

\begin{align} \langle v_1, v_2\rangle & = a\langle v_1, u_2\rangle + b\langle v_2, v_2\rangle \\ & = a\langle v_1, u_2\rangle + b = 0 \end{align}

So $b$ must be proportional to $-\langle v_1, u_2\rangle$ the projection of the new vector $u_2$ onto $v_1$. Further, if we want that $\langle v_2, v_2\rangle = 1$ we must set $a= 1/\lVert u_2 \rVert$ and thus $$b = -a\langle v_1, u_2\rangle =\; – \frac{\langle v_1, u_2\rangle}{\lVert u_2 \rVert}$$

Iterating the process

Suppose we have applied the process $i-1$ times, so that we have an orthonormal family of $i-1$ vectors $\{v_j\}_j$ and a new vector $u_i$. We wish to find a linear combination again:

$$ v_i = \gamma_0u_i + \sum_{j=1}^{i-1}\gamma_{j}v_j $$

If we solve for $1 \leq k \leq i-1$ the equation $\langle v_i, v_k\rangle = 0$, because we have $\langle v_j, v_k\rangle = 0$ for any $j\neq k$, we obtain that

$$ \langle v_i, v_k\rangle = \gamma_0\langle u_i, v_k\rangle + \gamma_k\langle v_k, v_k\rangle = \gamma_0\langle u_i, v_k\rangle + \gamma_k = 0 $$

this condition, together with the unit norm constraint, gives solutions

$$ \begin{cases} & \gamma_0 = \dfrac{1}{\lVert u_i \rVert} \\ & \\ & \gamma_k =\; -\gamma_0\langle u_i, v_k\rangle =\; -\dfrac{\langle u_i, v_k\rangle}{\lVert u_i \rVert} \end{cases} $$ we thus obtain the Gram-Schmidt algorithm at step $i$:

$$ v_i \gets \dfrac{1}{\lVert u_i \rVert}\,u_i -\,\sum_{j=1}^{i-1}\dfrac{\langle u_i, v_j\rangle}{\lVert u_i \rVert}\,v_j $$

The QR decomposition

Assume the vectors $u_1, \dots u_n$ are the columns $a_j\in\mathbb{R}^n$ of a matrix $\mathrm{A} = \left[ a_{ij} \right]_{i,j = 1}^n$. We can apply the algorithm to the vectors $a_j$ with the inner product $\langle x, y \rangle \triangleq x^\mathsf{T} y$. Denote $q_1, \dots, q_n$ the ON family obtained. We can then revert the Gram-Schmidt equations to get

\begin{align} a_i = \lVert a_i\rVert\cdot\sum_{j=1}^i \langle a_i, q_j \rangle q_j &= \sum_{j=1}^i \lVert a_i\rVert\cdot\langle a_i, q_j \rangle q_j \\ &= \sum_{j=1}^i r_{ji}q_j = \left[\mathrm{QR}\right]_i \end{align} With $\mathrm{Q}$ a unitary matrix and $\mathrm{R}$ upper triangular, as $r_{ji} = 0$ for $j > i$.

Linear dependence

Finally, notice that if we have linearly dependent vectors in the original family $\{u_i\}_i$, the Gram-Schmidt update becomes $0$ after the rank of the original family is reached. If $\{u_i\}_{i=1}^n$ has rank $p$, then the ON family is

$$\{v_j\}_{j=1}^n = \{ v_1, \dots, v_p, 0, \dots, 0 \}$$

A well-known proof of cauchy-schwartz

March 23, 2021

Suppose $\langle\cdot, \cdot\rangle$ is any real-valued inner product, be it on a space $V$ of vectors, matrices, continuous functions, or even Sobolev spaces. We can prove using simple algebra that we always have

$$ \lvert \langle\phi, \psi\rangle \vert \leq \lVert \phi \rVert\,\lVert \psi \rVert$$

for any two elements $\phi, \psi$ in $V$. The norm $\lVert \cdot \rVert$ is the usual inner-product associated norm $\lVert \phi \rVert = {\langle\phi,\phi\rangle}^{1 / 2}$

The algebraïc lemma

It is well-known that the quadratic $q(x) \triangleq ax^2 + bx + c$ with can be factorized as

$$ q(x) = a\left(x + \dfrac{b}{2a}\right)^2 – \dfrac{b^2 – 4ac}{4a}$$

therefore, the equation $q(x) = 0$ has no solution if and only if $b^2 – 4ac < 0$.

The proof

Let us assume $\phi \neq \lambda\psi$ and compute $\langle\phi – \lambda\psi, \phi – \lambda\psi\rangle > 0$, we get the inequality

$$ \lambda^2\langle\psi, \psi\rangle – 2\lambda\langle\phi, \psi\rangle + \langle\phi, \phi\rangle $$ which is a strictly positive quadratic form in the variable $\lambda$. Using our lemma, we therefore must have $$4\langle\phi, \psi\rangle^2 < 4\langle\psi, \psi\rangle\langle\phi, \phi\rangle$$ this yields the cauchy-schwartz inequality which turns into an equality when $\phi = \lambda\psi$.

Matrix-vector product inequalities

March 22, 2021

This post aims at gathering different inequalities pertaining to the product $\mathrm{A}x$ where $\mathrm{A}$ is a $m\times n$ matrix, and $v\in\mathbb{R}^n$ is a vector. We will decompose the matrix $\mathrm{A}$ in terms of its column vectors $a_j$ and its row vectors $\tilde{a}_i^\mathsf{T}$:

$$ \mathrm{A} = \begin{bmatrix} a_1 & \dots & a_n \end{bmatrix} = \begin{bmatrix} \tilde{a}_1^\mathsf{T} \\ \vdots \\ \tilde{a}_m^\mathsf{T} \end{bmatrix} $$

Euclidean norm

The quantity $$\lVert\mathrm{A}x \rVert_2^2 = \sum_{i=1}^m \left(\tilde{a}_i^\mathsf{T}x\right)^2$$ can be decomposed using the cauchy-schwartz inequality to obtain

$$\lVert\mathrm{A}x \rVert_2^2 \leq \left(\sum_{i=1}^m \lVert\tilde{a}_i\rVert_2^2\right)\, \lVert x\rVert_2^2 = \lVert \mathrm{A} \rVert_{\mathrm{F}}^2 \, \lVert x\rVert_2^2 $$

thus there is a $M = \lVert \mathrm{A} \rVert_{\mathrm{F}}$ such that $ \lVert\mathrm{A}x \rVert_2 \leq M\lVert x\rVert_2^2$.

This means all matrices are continuous & bounded linear maps for the $l_2$ norm.

Manhattan norm

Suppose we wish to measure with input and output spaces using the $l_1$ norm. The quantity $$\lVert\mathrm{A}x \rVert_1 = \sum_{i=1}^m \lvert\tilde{a}_i^\mathsf{T}x\rvert = \sum_{i=1}^m \lvert\sum_{j=1}^n a_{ij}x_j\rvert $$ can be upper-bounded using the finite-sum generalization of the scalar inequality $\lvert a + b \rvert \leq \lvert a\rvert + \lvert b \rvert$. Denoting $\lVert \mathrm{A} \rVert_1$ the sum of the unsigned entries $\lvert a_{ij} \rvert$ we have the inequality:

$$ \lVert\mathrm{A}x \rVert_1 \leq \lVert\mathrm{A} \rVert_1 \lVert x \rVert_1$$

we can conclude similarly to the first section.

Infinity norm

Assume we now use the infinity norm $\lVert \cdot \rVert_\infty$, we can upper bound the output $\mathrm{A}x$ as follows:

\begin{align} \lVert \mathrm{A}x \rVert_\infty = \max_i\lvert \tilde{a}_i^\mathsf{T} x\rvert &= \max_i \left\lvert \sum_{j=1}^n a_{ij}x_j\right\rvert \\ & \leq \max_i \sum_{j=1}^n \left\lvert a_{ij} \right\rvert \left\lvert x_j\right\rvert \\ & \leq \lVert x\rVert_\infty \max_i \sum_{j=1}^n \left\lvert a_{ij} \right\rvert = M\,\lVert x\rVert_\infty \end{align}

we thus obtain that any linear operator is bounded and continuous for $l_\infty$ in finite dimensions.

Duality

To be added.