Does the soft-constraint converge to rigid-constraint

tex

title: “Does the soft-constraint converge to rigid-constraint?”
author: “Yuling Yao”
date: “10/21/2018”
output: html_document

tl;nr.
No.

Motivations

I have often seen the recommendation of soft-constraints. The main reason is because a bayesian will rarely prefer point mass. Apart from that, I am now curious whether the current implementation of these two in Stan are essentially equivalent, in the limit case.
Let me frame the problem more religiously. Consider the parameter space θΘ=Rd\theta \in \Theta = R^d , and a transformed parameters
z=ξ(θ),zRm z=\xi(\theta),\quad z \in R^m
where $\xi : R^d \to R^m, d>m $ is a smooth function. We also need some regularization on ξ\xi to make it full-rank. Denote ξ\nabla \xi as the d×md\times m gradient matrix and G is the gram matrix $G=\nabla \xi ^T\nabla \xi $, it should satisfy
rank(ξ)=m. \mathrm{rank} (\nabla \xi)=m.
When there is no constraint at all, the posterior distribution ν(θ)\nu(\theta) is proportional to exp(V(θ))\exp(-V(\theta)).
So far so good, until I start writing
transformed parameters
{
z = xi (theta);
}
model
{
z ~ distribution(....);
}

Rigid constraint

Suppose in some situation we need a constraint on ξ(θ)\xi(\theta). A non-Bayesian wants ξ(θ)z0,z0Rm.\xi(\theta) \approx z_0, z_0 \in R^m. zz is sometimes termed as reaction coordinate in physics.
Firstly, with rigid constraint, we need to sample from the submanifold
Σ(z0)={θΘξ(θ)=z0} \Sigma(z_0)= \{\theta\in\Theta | \xi(\theta) =z_0 \}
and the conditional distribution
πξ(dθz0)exp(V(θ))δξ(theta)z0dθ. \pi^{\xi}(d\theta | z_0 ) \propto \exp(-V(\theta))\delta_{\xi(theta)-z_0} d\theta.
That is the probability measure of ν\nu conditioned to a fixed value z0z_0 of the reaction coordinate.
To clarify, we now also have the surface measure on Σ(z):dσΣ.\Sigma(z): d\sigma_\Sigma. It is the lebesgue measure on Σ(z)\Sigma(z) induced by euclidean space.
The co-area formula gives
δξ(theta)z0dθ=(detG)1/2dσΣ. \delta_{\xi(theta)-z_0} d\theta= (det G)^{-1/2} d\sigma_\Sigma.
Later we will see the co-area formula is intrinsic, and it is not equivalent to the change-of-variable formula.
The latter one reads dy=Jx(y)dxdy = J_x( y ) dx.
The first question is: which measure should we sample from? If you think it is a trivial question, think about those transformed parameters that do not have an explicit prior in the model block, they enter the log density as if there is a uniform prior, but why do we never give a jacobian for them?
Short answer: here we are NOT doing variable transformation at all and we are always in the θ\theta space, but we do have the freedom to choose to stand in the ecludian space, or the surface. Of course, the typical choice is the conditional distribution πξ(dθz0)\pi^{\xi}(d\theta | z_0 ).
Sampling from a contained dynamic is not supported in Stan, unless it is a simple case and we can easily reparametrize (e.g., simplex, sum-to-zero), which is equivalent to explicitly solve the inverse θ=ξ1(z0)\theta= \xi^{-1}(z_0). In general this can be done by adding the Lagrangian multiplier λ\lambda in the Hamiltonian system (e.g., Hartmann and Schütte, 2005 ). The main idea is to modify the Hamiltonian by $V^{constraint}(\theta) = V(\theta)+ \lambda (\xi(\theta)- z_0) $. Due to the very first reason I mentioned in the beginning, I know we probably do not have strong motivation to implement it.

Soft constraint

Another strategy is to put a soft constraint. For example, we put a strong prior on ξ(θ)\xi(\theta): ξ(θ)N(z0,τ2Idm×m)\xi(\theta) \sim N (z_0, \tau ^2 \mathrm{Id}_{m\times m} ). That is target +=1τ2(ξ(θ)z0)2\frac{1}{\tau^2} (\xi(\theta)-z_0)^2.
In the limit case τ0\tau\to 0, the marginal density of ξ(θ)\xi(\theta) (the image of measure ν\nu by ξ\xi) will be guaranteed a Dirac measure at z0z_0.
OK, only having target +=1τ2(ξ(θ)z0)2\frac{1}{\tau^2} (\xi(\theta)-z_0)^2 is wrong because we do not take into account the variable transformation.
Intuitive, no matter how small $\tau $ is, the dynamic still see the variation around the surface Σ\Sigma. Indeed in the limits case τ0\tau \to 0, it converges to
exp(V(θ))dσΣ \exp(- V(\theta) ) d\sigma_\Sigma
which can also be rewritten as
exp(V(θ)+12logdetG)δξ(theta)z0dθ, \exp(- V(\theta) + \frac{1}{2} \log det G ) \delta_{\xi(theta)-z_0} d\theta,
where G is the gram matrix defined earlier (proof see e.g., Ciccotti et al 2008).
The term $ \frac{1}{2} \log det G $ is named Fixman correction (Fixman 1974).

Adding a Jacobian?

From my understanding, Stan recommends adding the log absolute det jacobian into the log density, whenever we have a model on the transformed parameters zz.
It is somewhat vague what it means. Jacobian means the derivative to most users. Sure, when \xi(theta) is a bijection and ξ(θ)\nabla \xi(\theta) is a squared matrix,
12logdetG=logdetξ \frac{1}{2} \log det G = \log |det \nabla \xi |
is an exact identity. However such bijection is impossible in reality, because it means we put a soft constraint on all parameters, and in the limit case the joint posterior is degenerated to a Dirac measure purely due to prior.
Other than it, ξ\xi is map from RdR^d to RmR^m with d>md>m, if not always much larger. From the old post 1, and old post 2
  1. If you have a one-to-many transformation then you do the above but sum over all of the possible values of the transform inverse. This behaves well except in cases where there are infinite number of values, for example when you try to map a circle into the real line by transforming (0, 2pi) to (-\infty, \infty), in which case the transform becomes ill-posed.
  2. If you have a many-to-one transformation then you have to be very careful to ensure that you can self-consistently define a probability distribution on the transformed space. There is no “implicit distribution theorem” – the only way to do this is to “buffer” the output with additional variables so that you can define a one-to-one transformation, solve for the appropriate density on the buffered space, and then marginalize out the buffer variables. This is, for example, how one goes from a 2D Gaussian to a Rayleigh distribution or a 3D to a Maxwell, or the N simplex to the N-1 dimensional unconstrained real space we use in Stan (as @Bob_Carpenter notes below) . Typically this approach is limited to cases where the marginalization can be done analytically so it’s hard to apply to complex problems like this.
That is, augment the transformed parameters z=(z,zaug)z^{*}=(z, z^{aug}), so as to make it one-to-one transformation between θz\theta \to z^{*}. And we calculate the jacobian and marginalize out the auxiliary variables.
The augmented Jacobian reads
Jaug=[ξ(θ)1:m0ξ(θ)m+1:dI(dm)×(dm)] J^{aug}=\begin{bmatrix} \nabla \xi(\theta)_{1:m} & 0 \\ \nabla \xi(\theta)_{m+1:d} & I_{(d-m)\times (d-m)} \\ \end{bmatrix}
You should be convinced that logdetJaug1/2logdetG\log |det J^{aug}| \neq 1/2 \log det G, no matter how they look like.
If we are talking about variable-transformation now, I totally agree. It like log-normal or inver-gamma. We put a prior on the transformed parameters, as if we generate such transformed parameters (e.g. normal), and transform back to the original space (log-normal).

Generative vs Embedding, or change-of-variable vs conditioning

They are different.
Stan manual write:
A transformation samples a parameter, then transforms it, whereas a change of variables transforms a parameter, then samples it. Only the latter requires a Jacobian adjustment.
Agree, we need a Jacobian as if we know the generative distribution in the transformed parameters space, and we want to transform back to its original space. In principle, we do not even need any extra prior on θ\theta.
But there is another situation: which I will call embedding. We know the distribution in θ\theta space ν\nu, but we want to find what is it embedding/conditional distribution on the submanifold Σz0\Sigma_{z_0} induced by those transformed parameters. The section above shows that in order to keep invariance, we need an adjustment by $ 1/2 \log det G$, not Jacobian. Notice that we do not perform change-of-variables.

Prior-Prior conflict?

I don’t think I have made a clear clarification. Though conceptually easy, the confusion between the gram matrix and Jacobian seems common, even for experts in molecular-simulation experts (see Otter 2013 and Wong and York 2012 for a very similar argument)
I have not discussed about the one-to-many transformation, or d>n. In the limit case it is over-identified. But there is nothing prevent me defining a regularization on all θ\theta, and θ2\theta^2, and θmN(0,αm)\theta^m \sim N(0,\alpha^m) . In that case I know the distribution of θ\theta, and I am loosely regularize all moments not so being wild. I do not even ask for consistency as τ\tau is fixed and is not close to 0. I think it is fine to leave all constraints by itself, without further adjustment. This is to echo what I have pointed put earlier: we do have the freedom to choose to stand in the ecludian space, or the surface.
We may also wonder, is it a prior-prior conflict, if we define a prior on theta space, and define another prior/embedding on the transformed parameters. Typically it is hard to detect as sum of multiple convex function is still context. But if m>dm>d, the sub-manifold can be empty in the limit.

Conclusion

  1. For variable transformation (e.g. constraint parameters, simplex,…) and generative prior defined by transformed parameters (e.g. log-normal, inv-cauchy), jacobian is always correct, and the ONLY correct option. There is a natural bijection, so the jacobian is well-defined.
  2. The purpose of embedding is to put soft constraint on some reaction coordinate, and it does not change variables. If it is the case, in order to keep asymptotic consistency, the log density should be adjusted by 1/2 log det G, not jacobian.
  3. Hard-constraint is another less-recommend but possible option for general transformations.
  4. When the soft constraint itself is loose (e.g. z ~N(0,2)), it only represents some approximate regularization. Lack of 1/2 log det G adjustment should also be reasonable.
Some questions I am not clear:
the embedding interpretation and importance-sampling; when should we put either adjustment in the optimization for MAP.
Main reference: Stoltz, G., & Rousset, M. (2010). Free energy computations: A mathematical perspective. World Scientific.

Comments