Warning: this blog is not on teaching you how to diagnose fever, or Advil.
I was recently working with Aki, Dan and Andrew on a paper with a fun name “Yes, but Did It Work?: Evaluating Variational Inference” (Yes, name credited to Dan, of course). The motivation is straightforward: the Stan team and some users are not extremely satisfied with the current version of Automatic Differentiation Variational Inference (ADVI). It is fast and convenient, and I actually use ADVI a lot for some project on some applied classes — what else I could do if I were to submit a project the next day and found Stan sampling took 3.7 hours for for 10% of iterations?
There is some convergence assessment in ADVI, or whatever VI. For example, the software can monitor the running average of objective function or even a held-our test dataset. But that is not enough. Finite-iteration VI is an approximation (local /global optima) to the approximation (true posterior).
A funny but unfortunate story is I caught cold twice during writing, both coming with painful and dangerous (not really) fever. When I pick up my thermometer, I suddenly realize it is exactly what we are doing: the finite-time of a contact-thermometer cannot be accurate. The read temperature is an approximation.— By the way, I did study how to derive the heat kernel diffusion equation in my undergraduate first year, while I now only remember its name.—- OK, but so what, even if one day we have a super-accurate thermometer, or we are patient enough hold a thermometer under tongue for 25 days so as to ensure the thermometer reads exact body temperature, we still need some diagnostics for whether I am running fever. Yes, some subjective measurement can say that, like feeling headache, feeling cold, or whatever, just like ADVI will warn users if it drops too many samples. But those measurements is hard to compare across patients, just like ELBO, or test predictive density.
Therefore the medical community sets such threshold, based on some empirical measurement, such as armpit temperature being above 99 F (37.2 C).
Similarly, we introduce a diagnostic termed as “k hat”, which is typically a number between 0 and 1 (though it can be larger than 1 in practice). K hat is invariant under reparameterization, or log additive constant. Most important, it determines the finite sample convergence rate and Renyi divergence of VI approximation, any approximation with a k>0.7 is not reliable.
|Figure 1: A fast way to treat fever, given the correct diagnostic|