Entropy, Information, Dynamical System and Non-Standard Analysis

Mar 13 2023 · LastMod: Mar 13 2023

I'm just starting to probe the field. For the time being things I'd like to understand are

First, a connection table between dynamical systems and algorithmic complexity,

Dynamical Systems Complexity
Incompressibility of the source Algorithmic complexity of $\mu$-almost all trajectories
Kolmogorov-Sinai entropy of trajectories Kolmogorov complexity of trajectories

Source Coding (Data Compression)

Optimal v.s. Optimal on an finite alphabet

Usually, when talking about the limit of lossless compression, the source coding theorem of Shannon is invoked

Theorem(Source Coding) . Let $\mathcal{A}$ be a finite alphabet. Let $f:\mathcal{A}\to \{0,1\}^\ast$ be uniquely decodable, and let $f^+:\mathcal{A}^+ \to \{0,1\}^\ast$ be its extension. Let $l(f(X))$ be the bit length of $f(X)\in \{0,1\}^\ast$, then $$ H(X) \leq \mathbb{E}[l(f(X))] \leq H(X) + 1, $$ where $\mathbb{E}$ denote the expectation value operator.

More generally, one can take $f$ to be a uniquely decodable code $f:\mathcal{A}\to \mathcal{B}^\ast$ where $\mathcal{B}$ is also a finite alphabet (the above case is when $\mathcal{B} = \{0,1\}$), then $$ \frac{H(X)}{\log_2 |\mathcal{B}|}\leq \mathbb{E}[l(f(X))] \leq \frac{H(X)}{\log_2 |\mathcal{B}|} + 1. $$ This is easily seen by changing the base in the definition of Shannon entropy.

However, it needs to be stressed that a coding $f$ is not necessarily from a finite alphabet, and that actual optimal coding can have expected code length smaller than Shannon entropy.

First note that lossless compression is only possible for discrete random variables since the target alphabet is finite, and in particular, $\{0,1\}^\ast$ is countable; the coding $f:\mathcal{X} \to \{0,1\}^\ast$ must be injective for $f$ to be a lossless compression, where $\mathcal{X}$ is the space of random variables. Thus $\mathcal{X}\cong\mathbb{N}$. Now relabel $\mathcal{X}$, that is, let $\mathcal{X}=\{X_1, X_2, ...\}$, s.t. $P_X(i) \geq P_X(i+1)$. An optimal coding of $\mathcal{X}$ is then the coding $f_{op}$ that is

  1. maps $X_1$ to $\emptyset$,
  2. maps $X_i$ to the binary sequence that represents $i-1$.

The length of the codeword is given by $l(f_{op}(X_n)) = \lfloor \log_2 n \rfloor $ (obvious). The coding $f_{op}$ is optimal, in that for any lossless coding $f$ the length $l(f(X))$ stochastic dominates $l(f_{op})$ for any $X$: the cumulative distribution function of $l(f(X))$ is smaller or equal to $l(f_{op}(X))$ pointwise (i.e. for every $X$), or equivalently $$ \mathbb{P}[l(f(X)) \leq k] \leq \mathbb{P}[l(f_{op}(X))\leq k] $$ for any $k\in \mathbb{R}$ and $X\in\mathcal{X}$. The proof is easy. For the coding $f_{op}$, we have $$ H(X) - \log_2[\exp(H(X)+1)] \leq \mathbb{E}[l(f_{op}(X))]\leq H(X), $$ so while the bound is again given by $H(X)$, the Shannon entropy $H(X)$ is not really the "limit" of compression of $X$.

Of course the problem with $f_{op}$ is that one needs to order the probability mass functions first, and the domain of $f_{op}$ is not a finite alphabet. By contrast, a uniquely decodable code $f:\mathcal{A}\to \{0,1\}^*$ that is the subject of the source coding theorem is defined on an finite alphabet $\mathcal{A}$, and then extended to $\mathcal{A}^+$ to implement an actual coding, so that the coding can be defined on a finite alphabet and be implemented on any words on the alphabet. Also, something about $\mathcal{X}$ is hidden in the definition of $f_{op}$, namely, the ordering given by the pmf, whereas in a uniquely decodable code only the ordering on $\mathcal{A}$ can be "hidden". Nevertheless this is still a bound, that can said to be "the limit of compression", while actually one should consider Kolmogorov complexity, and further complications arise due to various non-invariant constructions in computability theory, e.g. the choice of Turing machines.