Thermodynamics and Information Theory
Not strictly a note since I've learned the subject years before. Revised and revising for my rusty memory of thermodynamics.
Basics in Thermodynamics
Equilibrium states
Found some interesting passages in Callen's book. First of all, in Callen's formulation, equilibrium states are those states of simple systems that are (macroscopically) characterized completely by $U,V,N_i$. Now he states that,
The quasi-static process simply is an ordered succession of equilibrium states, whereas a real process is a temporal succession of equilibrium and nonequilibrium states.
the reason being that
For a real process always involves nonequilibrium intermediate states have no representation in the thermodynamic configuration space.
The emphasis is my own. Now what is a nonequilibrium intermediate state? It is not representable, but does it exist? What does it mean that some nonrepresentable states exist? Why do we need to postulate this sort of entity? From my experience, physicists will tell you that this is because thermodynamic quantities are only defined when in equilibrium (but it seems that in his formulation the equilibrium is defined by $U,V,N_i$). Now Callen rightly points out that, during a physical process, the system disappears from the thermodynamic configuration space that it inhabited when in its initial state, and reappears in the configuration space when the final equilibrium state is reached. There is a strange analogy with quantum mechanics: equilibrium states correspond to eigenstates.
Why not ditch those metaphysical constructs such as time and "intermediate states" and rethink what the thermodynamic quantaties "really" are?
Thermodynamic potentials
I never quite understood when and why choose specific thermodynamic potential over others, or never tried to understand it. It seems that now it is unavoidable.
The internal energy $U(S,V,N)$ is a function of extensive properties $S,V,N$. The first law of thermodynamics states that for quasi-static processes of closed systems, $$ d U =\delta Q - dW + \mu \delta N = \delta Q - PdV + \mu d N $$ whereas the second law assets $$ \delta Q = T dS $$ again for quasi-static processes. Hence the fundamental equation $U=U(S,V,N)$ has the following differential, $$ dU = TdS - PdV + \mu d N. $$ Keep in mind that for reversible and quasi-static processes, $\delta Q = T dS$ and $dW = PdV$, and usually one is interested in these processes.
In the following we will say that the non-mechanical work received $\delta W_{ex}$ is measured by $\mu dN$ where $\mu$ is the chemical potential and $N$ is number of particles, and call it intrinsic energy, but $\mu dN$ can be understood to be (the infinitesimal of) any kind of non-mechanical work received.
- For constant $V$ and $N$, $dU$ measures the heat flow into the system.
- For constant $S$ and $N$, whatever that means, $dU$ measures the work done to surrounding system.
- For constant $S$ and $V$, $dU$ measures the exchange of internal energies between the system and its surrounding.
Here 2. and 3. are not that helpful. Internal energy can be said to measure the heat that can be transferred to the surroundings when there is no intrinsic change and the volume is constant (no work is done). In other words, fixing $S$ and $V$, the system will evolve to a state that minimizes $U(S,V)$.
One would also like to measure the property of the system when $P$, $T$ and $N$ vary. This is the reason for other thermodynamic potentials to be introduced.
Enthalpy
The enthalpy $H = H(S,P,N)$ is defined by $H = U + PV$. The differential is $$ dH = TdS + VdP + \mu dN, $$ then
- For constant $N$ and $S$, whatever that means, $H$ measures… something.
- For constant $N$ and $P$, $dH$ measures the heat flow into the system (hence the name enthalpy, "internal heat").
- For constant $S$ and $P$, $dH$ measures the change of intrinsitc energy.
So enthalpy is basically the measure of the heat that can be transfered to the surroundings, when the system is the pressure is constant and no intrinsic change is involved. In other words, leaving $S,P$ fixed, the system evolve to a state minimizing $H(S,P)$.
Helmholtz free energy
The Helmholtz free energy $F= F(T,V,N)$ is defined by $F = U-TS$. The differential gives $$ dF = -SdT - PdV + \mu dN. $$
Notice that $T$ is far easier to observe than $S$.
- For constant $T$ and $V$, $dF$ measures the change of intrinsic energy.
- For constant $T$ and $N$, $dF$ measures the work done to outer systems (hence the name of free energy).
- For constant $V$ and $N$, unclear.
Then $F$ is the measure of the work that can be done to outer systems when the system is in quasi-static equilibrium with the outer system and no intrinsitc change is involved. In other words, leving $T,N$ fixed, the system evolves to a state that minimizes $F(T,V,N)$.
One can argue that it is also the measure of the exchanges of constituents when the system is in quasi-static equilibrium with the surrounding and has constant volume, but it is rare that interchanges of constituents leaves $V$ invariant.
Gibbs free energy
Also called "free enthalpy" since the Gibbs free energy $G=G(T,P,N)$ is defined by $G = H - TS$. The differential is $$ dG = -SdT + VdP + \mu dN. $$
- For constant $T$ and $P$, the change of intrinsic energy.
- For constant $P$ and $N$, unclear.
- For constant $T$ and $N$, unclear.
Then $G$ is the measure of how much non-mechanical work (that come from some intrinsic change, i.e., not from volume expansion) can be done to outer systems when in equilibrium with the surroundings and with the pressure remaining constant. Fixing $T,P$, the system choose a state that minimizes $G(T,P,N)$.
Information-theoretic interpretation of thermodynamics
Gibbs equilibrium distributions
Consider a system, whose macroscopic state is cahracterized by a set of $n$ observables $O_a$. These observables are assumed to be the average values of some functions over the microscopic space $\Gamma$, w.r.t. a measure $\mu$ on $\Gamma$: $$ O_a := \langle F_a \rangle = \frac{\int_\Gamma F_a \mu}{\int_\Gamma \mu} $$ where $\{F_a\}^n_{i=1}$ is some set of functions $F_a: \Gamma \to \mathbb{R}$.
A crucial point is that a choice of a measure needs to be carried out. For example:
The choice of $\rho$ is done by considering the microscopic entropy of a distribution and extremize the macroscopic, i.e. thermodynamic, entropy, given by considering the distribution as a probability distribution $\rho$, and taking its Shannon entropy.
Consider, for a generic normalized (i.e. $\int_\Gamma \rho dx = 1$) distribution $\rho$, the microscopic entropy (see What is Information Entropy for what $s(\rho) = - \ln \rho$ means) $$ s(\rho) = -\ln \rho $$ and impose that its average - or equivalently the Shannon entropy of $\rho$ - coincides at equilibrium with the thermodynamic entropy of the system, $$ S_\rho = \langle s(\rho)\rangle = -\int_\Gamma \rho\ln \rho dx, $$ By the Second Law of Thermodynamics, subject to the constraints
\begin{gather*} O_a = \int_\Gamma F_a \rho dx\\ \int_\Gamma \rho dx = 1 \end{gather*}the entropy $S_\rho$ should be extremized. Thus the following functional is extremized: $$ \tilde{S}_\rho = -\int_\Gamma \rho\ln\rho dx - \tilde{w}\Big( \int_\Gamma dx -1\Big) + \tilde{q}^a \Big( \int_\Gamma F_a(x) \rho dx - O_a \Big). $$ There is a summation a la Einstein over the index $a$.
The solution is the family of Gibbs' equilibrium distributions $$ \rho_0(\Gamma;\tilde{w},\tilde{q}) = \exp(-\tilde{w}+ F_a(x) \tilde{q}^a). $$ From the normalization condition of $\rho$, $$ \tilde{w} = \ln \int_\Gamma \exp(F_a(x) \tilde{q}^a) dx = \ln\mathcal{Z} $$ where $\mathcal{Z}$ is the partition function. Thus $\tilde{w} = \tilde{w}(\tilde{q}^a)$ is the thermodynamic potential for the system in the ensemble characterized by the constraints given by $O_a$. The derivatives of $\tilde{w}$ w.r.t. $\tilde{q}^a$ give the equations of state $$ O_a = \frac{\partial \tilde{w}}{\partial \tilde{q}^a}. $$ The entropy reads $$ S_{\rho_0} = \tilde{w} - O_a \tilde{q}^a. $$
Hence at equilibrium the thermodynamic potentials are Legendre transforms of the entropy.
The First Law of Thermodynamics: information theoretic derivation
We write $p_a$ for $O_a$ in the following and drop the tilde over $\tilde{w},\tilde{q}^a$. The First Law of Thermodynamic is written as $$ d w - p_a d q^a = 0 $$ since the First Law is invariant under a Legendre transformation. There is an alternative "derivation" of the First Law. Consider the microscopic entropy $s(\rho) = s(w,q^a)$ and write $\lambda^i = w, q^a$, then its differential is $$ ds = -\frac{\partial \ln\rho}{\partial \lambda^i} d\lambda^i. $$ Its expectation value, or the first moment of $ds$ is $$ \langle ds \rangle = -\int_\Gamma \frac{\partial \ln\rho}{\partial \lambda^i} d\lambda^i \rho dx = - \langle \frac{\partial \ln \rho}{\partial \lambda^i}\rangle d \lambda^i, $$ now with the Second Law, the distribution $\rho_0$ is given, which yields $$ \langle ds \rangle = \langle dw - F_a(x) dq^a \rangle = dw - p_a d q^a. $$ Thus the First Law just says that the first moment of $ds$ vanishes. This is a statement about the quality of the estimators $F_a$. The equations of state, being results of the First Law $$ p_a = \langle F_a \rangle = \partial w/\partial q^a $$ can be read as meaning that, if for some $a$, the equation of state holds, then $F_a$ is an unbiased estimator for the value of the variable $\partial w/\partial q$. Thus, whenever the First Law is not satisfied, there is an estimator $F_a$ that is biased.
[…]
What is Information Entropy?
A quantity that was somehow arbitrary introduced in Information-theoretic interpretation of thermodynamics was the microscopic entropy $s(\rho) = -\ln \rho$. Why the logarithm? Since $s$ appears in $S := - \int \rho \ln\rho dx$, what really needs to be explained is the meaning of the Shannon entropy formula $$ S_I = -\sum_i p_i \log p_i $$ For simplicity just consider the discrete form. The choice of the base ($\log$ here is base 2) doesn't matter much since changing base just gives a constant. $S_I$ can be seen as an expectation of the function $\log p_i$, if there is a notion of probability.
Operational Interpretation
Suppose that there are $N$ messages. One wants to pick an arbitrary message from these $N$ messages by asking questions the answer to which are true or false (and would be answered). If one picks an optimal set of questions, one would need, at most, $\log N$ questions: each answer selects half of the messages, solving $2^x = N$ for $x$, $x=\log N$. This gives an intuition of the rough meaning of the logarithm.
Now if there are classes of messages with sizes $\{N_i\}_i$, let $p_i = N_i / \sum_i N_i$, then $\log p_i = \log N_i - \log N$, and $$ \sum_i p_i \log p_i = \frac{\sum_i N_i \log N_i}{N} - \log N. $$ Thus what is important is $\sum_i N_i \log N_i$. This is the weighted sum of specifying one particular message in each class.
Consider a probability density $\rho$, intuitively, then $\int \rho \log \rho$ is the weighted integral of $\log \rho$. $\log \rho(x)$ is the "bits" needed to describe a specific message over the point $x$. Thus $\rho$ itself "classifies" the points in a space $X$ into neigborhoods, over which are $\rho$ "number" of messages, and in each infinitesimal neighborhood $\log \rho (x)$ bits are needed to specify a particular message.
Why weighting over the number of the messages? Lets say that we know that there are $N$ messages, and know that these are partitioned into classes $N_1, N_2 ...$ each with $N_i$ messages. The weighted sum is then the expectation of the bits specifying a particular message in these $N$ messages, provided that one knows that $N$ is classified into classes, and in the process of specifying one is really specifying one message in $N_i$ messages, with the particular class represented by $i$ not determined. The explanation can be copy-pasted to the continuous case.
In the above, $S_I = - \int_\Gamma \rho \log \rho dx $ is extremized, in fact maximized, i.e. $\int_\Gamma \rho \log \rho dx$ is minimized. This means some set of messages are partitioned in such a way - the partition given by $\rho$ - that to specify a particular message, provided that the partition $\rho$ is known, one needs the least bits. In turn, this means that the partition $\rho$ helps much, since the total number of messages are independent of the partition $\rho$.
Maximal Shannon entropy then means the partition $\rho$ "helps much" in reducing the length of bits needed to specify a particular message: more information is contained in the partition $\rho$. But the meaning of "helps much" should be taken with a grain of salt, since one can also regard $\rho$ to be representing what is unknown about the messages: it is unsure in which class the spcification process will occur.
A particularly interesting "bijection" $$ \text{Prefix codes} \leftrightarrow \text{Binary tree (codewords are leaves)} \leftrightarrow \text{Yes/No question strategy}. $$
Axiomatic framework
Let $P = (p_1,\ldots,p_m)$ be a probability distribution on $m$ letters, consider a functional $H_m(p_1,\ldots,p_m)$ that satisfies the following axioms:
- Permutation invariance: $H_m(\sigma_m(p_1,\ldots,p_m)) = H_m(p_1,\ldots,p_m))$.
- Expansibility: $H_m(p_1,\ldots,p_{m-1},0) = H_{m-1}(p_1,\ldots,p_{m-1})$.
- Normalization: $H_2(\frac{1}{2},\frac{1}{2}) = \log 2$.
- Continuity: $\lim_{p\to 0} H_2(p,1-p) = 0$.
- Subadditivity: $H(X,Y)\leq H(X) + H(Y)$ for vector random variables $X,Y$. This is equivalently $H_{mn}(r_{11},\ldots,r_{mn})\leq H_m(\sum^n_{j=1} r_{1j},\ldots,\sum^n_{j=m}) + H_n(\sum^m_{i=1}r_{i1},\ldots,\sum^m_{i=1}r_{in})$.
- Additivity: $H(X,Y) = H(X) + H(Y)$ if $X,Y$ independent; equivalently $H_{mn}(p_1 q_1,\ldots,p_m q_n)\leq H_m(p_1,\ldots,p_m) + H_n(q_1,\ldots,q_n)$.
Then $$ H_m(p_1,\ldots,p_m) = -\sum_{i=1}^m p_i \log p_i $$ is the only possibility.
Thermodynamics and Contact Geometry
Misc.
A complaint on Callen's problem 8.1-1
Callen's thermodynamics book, p.207. This is a pointless problem. The reason that $$ \frac{\partial^2 S}{\partial U^2}\frac{\partial^2 S}{\partial V^2} - \left(\frac{\partial^2 S}{\partial U \partial V}\right)^2 \geq 0 $$ holds, is that for concave 2-dimensional functions, the Hessian needs to be negative semi-definite (i.e. all eigenvalues are negative), and the determinant in 2-dimension is the product of 2 eigenvalues. $\partial^2 S/\partial U^2$ and $\partial^2 S/\partial V^2$ are all negative semi-definite, so if the determinant of the Hessian is positive semi-definite, then the Hessian itself is automatically negative semi-definite.