# Entropy, Information, Dynamical System and Non-Standard Analysis

I'm just starting to probe the field. For the time being things I'd like to understand are

- How is the contact geometry of thermodynamics related to the information theory of dynamical systems?
- More notions of entropy, e.g. measure-theoretic entropy, topological entropy, and how are various entropies related.
- Loosely speaking, on how complex information, such as that underlying chaotic systems, be described compactly. The
*nature*of randomness. - Duistermaat–Heckman, Heisenberg group, all the semiclassical/microlocal analysis related phenomena in symplectic geometry, in particular
*geometric quantization*, through the lense of dynamical system and information theory. - A particularly interesting one: the
*informational nature of entropy*and*non-standard analysis*,*chaos*and*intuitionistic continuum*, etc. in particular, in light of the "correspondence" between higher category and operator algebra, and Dirac geometry.

First, a connection table between dynamical systems and algorithmic complexity,

Dynamical Systems | Complexity |
---|---|

Incompressibility of the source | Algorithmic complexity of $\mu$-almost all trajectories |

Kolmogorov-Sinai entropy of trajectories | Kolmogorov complexity of trajectories |

- See Lesne, A. (2014). Shannon entropy: A rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics. Mathematical Structures in Computer Science, 24(3), E240311. URL for the literature to the above table.
- Mikhail Gromov, In a Search for a Structure, Part I: On Entropy. URL For nonstandard analysis and entropy.

## Source Coding (Data Compression)

### Optimal v.s. Optimal on an finite alphabet

Usually, when talking about the *limit of lossless compression*, the *source coding theorem* of Shannon is invoked

More generally, one can take $f$ to be a uniquely decodable code $f:\mathcal{A}\to \mathcal{B}^\ast$ where $\mathcal{B}$ is also a finite alphabet (the above case is when $\mathcal{B} = \{0,1\}$), then $$ \frac{H(X)}{\log_2 |\mathcal{B}|}\leq \mathbb{E}[l(f(X))] \leq \frac{H(X)}{\log_2 |\mathcal{B}|} + 1. $$ This is easily seen by changing the base in the definition of Shannon entropy.

However, it needs to be stressed that a coding $f$ is *not necessarily from a finite alphabet*, and that actual optimal coding can have expected code length *smaller* than Shannon entropy.

First note that lossless compression is only possible for discrete random variables since the target alphabet is finite, and in particular, $\{0,1\}^\ast$ is countable; the coding $f:\mathcal{X} \to \{0,1\}^\ast$ must be injective for $f$ to be a lossless compression, where $\mathcal{X}$ is the space of random variables. Thus $\mathcal{X}\cong\mathbb{N}$. Now relabel $\mathcal{X}$, that is, let $\mathcal{X}=\{X_1, X_2, ...\}$, s.t. $P_X(i) \geq P_X(i+1)$. An optimal coding of $\mathcal{X}$ is then the coding $f_{op}$ that is

- maps $X_1$ to $\emptyset$,
- maps $X_i$ to the binary sequence that represents $i-1$.

The length of the codeword is given by $l(f_{op}(X_n)) = \lfloor \log_2 n \rfloor $ (obvious). The coding $f_{op}$ is *optimal*, in that for any lossless coding $f$ the length $l(f(X))$ *stochastic dominates* $l(f_{op})$ for any $X$: the cumulative distribution function of $l(f(X))$ is *smaller* or equal to $l(f_{op}(X))$ pointwise (i.e. for every $X$), or equivalently
$$
\mathbb{P}[l(f(X)) \leq k] \leq \mathbb{P}[l(f_{op}(X))\leq k]
$$
for any $k\in \mathbb{R}$ and $X\in\mathcal{X}$. The proof is easy. For the coding $f_{op}$, we have
$$
H(X) - \log_2[\exp(H(X)+1)] \leq \mathbb{E}[l(f_{op}(X))]\leq H(X),
$$
so while the bound is again given by $H(X)$, the Shannon entropy $H(X)$ is not really the "limit" of compression of $X$.

Of course the problem with $f_{op}$ is that one needs to order the probability mass functions first, and the domain of $f_{op}$ is not a finite alphabet. By contrast, a *uniquely decodable code* $f:\mathcal{A}\to \{0,1\}^*$ that is the subject of the source coding theorem is defined on an finite alphabet $\mathcal{A}$, and then extended to $\mathcal{A}^+$ to implement an actual coding, so that the coding can be defined on a finite alphabet and be implemented on any words on the alphabet. Also, something about $\mathcal{X}$ is hidden in the definition of $f_{op}$, namely, the ordering given by the pmf, whereas in a uniquely decodable code only the ordering on $\mathcal{A}$ can be "hidden". Nevertheless this is still a bound, that can said to be "the limit of compression", while actually one should consider Kolmogorov complexity, and further complications arise due to various non-invariant constructions in computability theory, e.g. the choice of Turing machines.