A Two Stage Optimal Transport based System Identification method

We present a novel system identification method that overcomes the local minima problem yet converges to accurate system parameters. In the first stage, the algorithm minimizes the wasserstein distance by the mean of an optimal transport solver without resorting to a gradient for the wasserstein distance cost. In the second stage, the algorithm achieves fast convergence using conventional gradient descent methods in the euclidean distance cost. This method is simple to implement using off the shelf software packages for optimal transport and for gradient descent optimization.


INTRODUCTION
System identification in the large class of linear time invariant systems, is completely defined by its impulse response.
For time discrete signals, the output of the system y(n) is entirely defined by the input x(n) and the impulse response h(n) through the convolution operator: The system identification is the inverse problem that consists in inferring h(n) given the system input and output signals denoted by x(n) and y(n), respectively.Here we concern ourselves with stochastic systems or at least positive signals and positive impulse responses scaled so that all samples sum up to 1.That is: ∑ₙ () = 1, () >= 0 ∀ ∑ₙ ℎ() = 1, ℎ() >= 0 ∀ The acclaimed Wiener deconvolution suits best frequency domain applications and is even robust to moderate noise additive [Wiener].In the time domain, gradient based methods are commonly used for this problem.For example, ADMM methods [Gray'2000] allow including a regularization term for robustness noise.The basic gradient descent method seeks a stationary point of the euclidean cost function given the impulse response: When the transform function is a convolution, it can be represented with a matrix multiplication.The system identification problem becomes:

𝑌ₘₓ₁ = 𝐻ₘₓₘ × 𝑋ₘₓ₁
The matrix H a convolution matrix or a circular matrix [Gray'2000 chap 1.1].Gradient descent of convex and constrained problems using ADMM method converges rapidly O(1/k) in the number of steps k [Goldstein'2014].However if the system's impulse response or transform function involves a translation, the problem becomes non-convex in the euclidean cost.This issue is well documented in the context of image registration (see for example [Taylor'2008]).
Thus, the search algorithm might end in a local minima and never reach the global minima.

PRIOR WORK
Previously, the gaussian clustering problem, also facing the local minima issue, was solved using the wd distance [Kolouri'2017].We take a similar approach for the system identification problem but our method is based on the transport map.[El Moselhy'2012] lays very relevant theoretical foundations for this work and to our knowledge their method is the closest prior art to ours.We also aim at finding the transform function that minimizes the wasserstein distance in an iterative fashion and subsequently build h as a composition of partial transform functions.We propose a simpler algorithm with explicit and separated Optimal Transport and Regression steps that does not require any gradient.From the computer vision community, the registration task aims at finding the transform that relates 2 images, given a translation and a 2D (or 3D) rotation or other affine transformations.The registration transform is commonly found using a randomized search and a cost function that involves both topological and euclidean distance terms [Bleyer'2011].

Optimal transport
The computation of the Wasserstein distance wd amounts to minimal the total cost of transforming an input probability density function (pdf) ₁ₓₘ, into an output pdf ₘₓ₁, by the mean of a transportation plan matrix ₘₓₘ subject to a cost matrix ₘₓₘsuch that: Recent Optimal Transport (OT) solvers effectively compute the wasserstein distance wd and the transport plan P [Flamary'2017].In the remainder, we use wd to refer to the wd distance where the distance cost in C increases as d t where d is the euclidean distance in the coordinates (of X and Y) and t is larger or equal to 1.

Transport plan and convolution
First, we observe that if h is known, the convolution matrix corresponds to a valid transportation plan.Given that the transport plan is a doubly stochastic matrix where the sum of the rows equals the input pdf X and the sum of the columns equals the output pdf Y. Let set the i th column ci of the transport plan as the impulse response h(n), with the centered h(0) at the diagonal, ie ci = h(n-i), times the amplitude of the corresponding input sample x[i].We have a valid column for the transport plant.Doing the same for all m columns results in a valid transport plan such that satisfies the output signals requirements:ₘₓ₁ =  ⨯ １since ₘₓ₁ = ₘₓₘ × ₘₓ₁.
Using an OT solver, we obtain the distance wd and the transport map.The transport map represents the most parsimonious (wrt the cost matrix C) warp of the input signal into the output signal.We now turn our attention to recoveringfrom from.

Transport plan and regression
The warp for a given input sample defines it's reallocated weights in the output signal.For 1D signals, the warp for a given sample is entirely defined by its column in the transport map.The OT solver does not require the warp to be independent of the location however that is a requirement for a time invariant system.
To recover the transform out of the transport plan, and enforce the time invariant constraint, we regress the allocations at each offset for each sample into a candidate filter, the new h as shown in Algorithm I.Note that we could allow the regression parameters to be adaptive with respect to location (n), or on other conditionals.For the regression step (regress(.) in step 2.), a simple average is sufficient in the case of a toeplitz cost matrix, however the regression here aims to minimize the resulting wd depending on the cost matrix C.

Regression: Linear time invariant system
The regression for h[] at offset  is estimated as follows for a toeplitz cost matrix: The regression for h[] at offset  is estimated as follows for a general (non-toeplitz) cost matrix: The cost matrix drives the wd distance and indicates where h should make the input signal more similar to the output signal.In practice we use a modified cost matrix C1 = C + , where  is a small constant so that C1 has no zero entries and the weighted sum in C1(ci, ci+) is always defined.
Looking at the update step 3., we are identifying the system as a sequence of convoluted filters h.The convoluted filters result in the optimal transform function.When the optimal transform function is found, wd is stationary at the global minima.

Regression: Translation
We have the a priori model: y(n) = x(n+), where T is an integer And we want to identify .
The linear time invariant or filtering approach is suitable for small translation -within the support of the filter.For a more general case, it is possible to resort to calculating the cross correlation and picking the maximal point.Using a regression on the optimal transport plan, we regress over all the allocations to extract the average transportation vector as the translation vector ⃗ :

Regression: Centered affine transform
We have the a priori model: , where a is a positive real number and b is an integer.This transform is morphological in nature and gradient descent on a cost based on euclidean distance is not convex.For this type of problem researchers might resort to exhaustive search, randomized algorithms [Bleyer'2011] or use their domain expertise to devise effective heuristics.
We observe that the transportation cost should be close to zero at the center of the affine transform and grow linearly as the distance from the center grows.say the center is at c * , we have:

which is formulated as a regression
The solution of this regression gives  ← ,  ← c * In all three regression cases, this method empirically converges on synthetic 1D data.The exact conditions for convergence remain to be defined.The only definitive requirement so far is for the OT solver to achieve convergence to a near optimal transport plan P with ||₁ₓₘ −  ⨯ １ᵀ||² <  and ||ₘₓ₁ −  ⨯ １||² < , and with  a small constant [see Cuturi'2013, Xie'2018].

Two Stage Optimal Transport System Identification
We observed using globally convex toy examples that where the Algorithm I would obtain an optimal solution with an euclidean error of the order 1e-2, a straightforward gradient descent would deliver an euclidean error of the order of 1e-5 with nearly the same CPU time.
For applications where a high accuracy is required, we propose a two phase algorithm in the same spirit as Brent's method for root finding [Brent'1973].The first stage uses OT to avoid the local minima trap and come in the vicinity of the global minima.The second stage using gradient descent guarantees fast convergence in the euclidean sense but within a bounded margin from the optimal point from the wasserstein distance point of view.The second stage is shown in Algorithm II.
Here, wd_min is the optimal wd distance obtained from algorithm I. Algorithm III 1. find the optimal h in the wd sense using algorithm I 2. refine h using gradient descent using algorithm II

EXTENSIONS
The extension to multi-dimensional signals is straightforward using the appropriate circular matrix [Gray'2000].
Because of the commutativity of the convolution, our method applies to deconvolution.We still need to assess the robustness of our method in the presence of noise.
During the first stage of the proposed method, the wasserstein distance can be plugged-in as a cost in a multi-criterion problem using an Alternating Direction Optimization method [Goldstein'2014], such as ADMM.Within the ADMM framework, additional constraints or projections can be enforced, for example to a subset of parametric transform functions h.The same additional constraint would also be applied during the second stage further using ADMM.

SUMMARY
We presented a novel system identification method that overcomes the local minima problem in cases where the transform function involves a translation or is otherwise not globally convex in the euclidean cost.This method is simple to implement using the transport map obtained through available OT solvers [Justin'2015, Xie'2018].
In practice, it shows strong convergence that can be monitored using the Wasserstein distance at each iteration.While a strong theoretical basis is provided in [El Moselhy'2012], the exact conditions for convergence remain to be defined.
As exemplified in [El Moselhy'2012], the TSOTSI method allows to overcoming the limitations of euclidean distance based gradient descent (local minima) and the convergence issue of Markov Chain Monte-Carlo (sample generation, convergence monitoring) in system identification.
, P = OT.solve(X,Y, C) stop if wd converged 2. h = regress(P), convert to the circular matrix H regress: weighted average over all samples of the allocation h[] for each offset  3. : =  × , go to step 1.
section 3.2 we stated that P(ci, ci+) ≈ h[] * x[n], where ci is the column index corresponding to the sample x[ci] of the input signal.Therefore we use the regression h[] =∑  =0 P(ci, ci+) ,  + ) *  * ⃗ and ⃗ =  * ⃗ Where⃗ is the unit vector for transportation.
The Two Stage Optimal Transport System Identification or TSOTSI reaches a guaranteed near global optimum solution.It is shown in Algorithm III.Note that in Algorithm II, step 1, automatic differentiation may be used[AD].Alternatively, we may use other gradient descent based methods including ADMM.