## What is *Mathematical Methods for Arbitrary Data Sources*?

The lecture series will collect talks on mathematical disciplines related to all kind of data, ranging from statistics and machine learning to model-based approaches and inverse problems. Each pair of talks will address a specific direction, e.g., a NoMADS session related to nonlocal approaches or a DeepMADS session related to deep learning.

The series is created in the spirit of the One World Series pioneered by the seminars in probability and PDE.

## Using Zoom

For this online seminar we will use zoom as video service. Approximately 15 minutes prior to the beginning of the lecture, a zoom link will be provided on this website and via mailing list.

## Mailing list

Please subscribe to our mailing list by filling this form.

## Materials and Video Recordings

Additional material, slides and video recordings are attached to the respective session in our session archive. Additionally please visit our YouTube channel for video recordings of past sessions.

## Past Sessions

### Apr 20, 2020: Session I

Speaker Title tap/hover for abstract Materials Gabriel PeyréCNRS, FR

Ecole Normale Supérieure, FRScaling Optimal Transport for High dimensional Learning

Scaling Optimal Transport for High dimensional Learning

Optimal transport (OT) has recently gained lot of interest in machine learning. It is a natural tool to compare in a geometrically faithful way probability distributions. It finds applications in both supervised learning (using geometric loss functions) and unsupervised learning (to perform generative model fitting). OT is however plagued by the curse of dimensionality, since it might require a number of samples which grows exponentially with the dimension. In this talk, I will review entropic regularization methods which define geometric loss functions approximating OT with a better sample complexity. More information and references can be found on the website of our book Computational Optimal Transport.video

slidesMarie-Therese WolframWarwick University, UK

joint work with:Andrew StuartCaltech, USInverse Optimal Transport

Inverse Optimal Transport

Discrete optimal transportation problems arise in various contexts in engineering, the sciences and the social sciences. Examples include the marriage market in economics or international migration flows in demographics. Often the underlying cost criterion is unknown, or only partly known, and the observed optimal solutions are corrupted by noise. In this talk we discuss a systematic approach to infer unknown costs from noisy observations of optimal transportation plans. The proposed methodologies are developed within the Bayesian framework for inverse problems and require only the ability to solve the forward optimal transport problem, which is a linear program, and to generate random numbers. We illustrate our approach using the example of international migration flows. Here reported migration flow data captures (noisily) the number of individuals moving from one country to another in a given period of time. It can be interpreted as a noisy observation of an optimal transportation map, with costs related to the geographical position of countries. We use a graph-based formulation of the problem, with countries at the nodes of graphs and non-zero weighted adjacencies only on edges between countries which share a border. We use the proposed algorithm to estimate the weights, which represent cost of transition, and to quantify uncertainty in these weights.video

slides### May 4, 2020: Session II

Speaker Title tap/hover for abstract Materials Lorenzo RosascoUniversitá di Genova, IT

MIT, USEfficient learning with random projections

Efficient learning with random projections

Despite stunning performances, state of the art machine learning approaches are often computational intensive and efficiency remains a challenge. Dimensionality reduction, if performed efficiently, provides a way to reduce the computational requirements of downstream tasks, but possibly at the expanses of the obtained accuracy. In this talk, we discuss the interplay between accuracy and efficiency when dimensionality reduction is performed by means of, possibly data dependent, random projections. The latter are related to discretization methods for integral operators, to sampling methods in randomized numerical linear algebra and to sketching methods. Our results show that there are number of different tasks and regimes where, using random projections and regularization, efficiency can be improved with no costs in accuracy. Theoretical results are used to derive scalable and fast kernel methods for datasets with millions of points.video

slidesMichaël FanuelKU Leuven, BE

joint work with:Joachim Schreurs, Johan SuykensDiversity sampling in kernel method

Diversity sampling in kernel method

A well-known technique for large scale kernel methods is the Nyström approximation. Based on a subset of landmarks, it gives a low rank approximation of the kernel matrix, and is known to provide a form of implicit regularization. We will discuss the impact of sampling diverse landmarks for constructing the Nyström approximation in supervised and unsupervised problems. In particular, three methods will be considered: uniform sampling, leverage score sampling and Determinantal Point Processes (DPP). The implicit regularization due the diversity of the landmarks will be made explicit by numerical simulations and analysed further in the case of DPP sampling by some theoretical results.video

slides### May 18, 2020: Session III

Please note: The second talk by Fracis Bach was jointly organized with the One World Optimization Seminar.

Speaker Title tap/hover for abstract Materials Lars RuthottoEmory University, US Machine Learning meets Optimal Transport: Old solutions for new problems and vice versa

Machine Learning meets Optimal Transport: Old solutions for new problems and vice versa

This talk presents new connections between optimal transport (OT), which has been a critical problem in applied mathematics for centuries, and machine learning (ML), which has been receiving enormous attention in the past decades. In recent years, OT and ML have become increasingly intertwined. This talk contributes to this booming intersection by providing efficient and scalable computational methods for OT and ML.

The first part of the talk shows how neural networks can be used to efficiently approximate the optimal transport map between two densities in high dimensions. To avoid the curse-of-dimensionality, we combine Lagrangian and Eulerian viewpoints and employ neural networks to solve the underlying Hamilton-Jacobi-Bellman equation. Our approach avoids any space discretization and can be implemented in existing machine learning frameworks. We present numerical results for OT in up to 100 dimensions and validate our solver in a two-dimensional setting.

The second part of the talk shows how optimal transport theory can improve the efficiency of training generative models and density estimators, which are critical in machine learning. We consider continuous normalizing flows (CNF) that have emerged as one of the most promising approaches for variational inference in the ML community. Our numerical implementation is a discretize-optimize method whose forward problem relies on manually derived gradients and Laplacian of the neural network and uses automatic differentiation in the optimization. In common benchmark challenges, our method outperforms state-of-the-art CNF approaches by reducing the network size by 8x, accelerate the training by 10x- 40x and allow 30x-50x faster inference.video

slidesFrancis BachUniversité PSL, FR

joint work with:Lénaïc ChizatOn the convergence of gradient descent for wide two-layer neural networks

On the convergence of gradient descent for wide two-layer neural networks

Many supervised learning methods are naturally cast as optimization problems. For prediction models which are linear in their parameters, this often leads to convex problems for which many guarantees exist. Models which are non-linear in their parameters such as neural networks lead to non-convex optimization problems for which guarantees are harder to obtain. In this talk, I will consider two-layer neural networks with homogeneous activation functions where the number of hidden neurons tends to infinity, and show how qualitative convergence guarantees may be derived. I will also highlight open problems related to the quantitative behavior of gradient descent for such models.video

slides### Jun 8, 2020: Session IV

Speaker Title tap/hover for abstract Materials Michael UnserBiomedical Imaging Group EPFL, CH Representer theorems for machine learning and inverse problems

Representer theorems for machine learning and inverse problems

Regularization addresses the ill-posedness of the training problem in machine learning or the reconstruction of a signal from a limited number of measurements. The standard strategy consists in augmenting the original cost functional by an energy that penalizes solutions with undesirable behaviour. In this presentation, I will present a general representer theorem that characterizes the solutions of a remarkably broad class of optimization problems in Banach spaces and helps us understand the effect of regularization. I will then use the theorem to retrieve some classical characterizations such as the celebrated representer theorem of machine leaning for RKHS, Tikhonov regularization, representer theorems for sparsity promoting functionals, as well as a few new ones, including a result for deep neural networks.video

slidesVincent DuvalInria, FR Representing the solutions of total variation regularized problems

Representing the solutions of total variation regularized problems

The total (gradient) variation is a regularizer which has been widely used in inverse problems arising in image processing, following the pioneering work of Rudin, Osher and Fatemi. In this talk, I will describe the structure the solutions to the total variation regularized variational problems when one has a finite number of measurements.

First, I will present a general representation principle for the solutions of convex problems, then I will apply it to the total variation by describing the faces of its unit ball.

It is a joint work with Claire Boyer, Antonin Chambolle, Yohann De Castro, Frédéric de Gournay and Pierre Weiss.video

slides### Jun 15, 2020: Session V

Speaker Title tap/hover for abstract Materials Andrea BraidesUniversità Roma Tor Vergata, IT Continuum limits of interfacial energies on (sparse and) dense graphs

Continuum limits of interfacial energies on (sparse and) dense graphs

I review some results on the convergence of energies defined on graphs. My interest in such energies comes from models in Solid Mechanics (where the bonds in the graph represent the relevant atomistic interactions) or Statistical Physics (Ising systems), but the nodes of the graph can also be thought as a collection of data on which the bonds describe some relation between the data.

The typical objective is an approximate (simplified) continuum description of problems of minimal cut as the number N of the nodes of the graphs diverges.

If the graphs are sparse (i.e. the number of bonds is much less than the total number of pairs of nodes as N goes to infinity), often (more precisely when we have some control on the range or on the decay of the interactions) such minimal-cut problems translate into minimal-perimeter problems for sets or partitions on the continuum. This description is easily understood for periodic lattice systems, but carries on also for random distributions of nodes. In the case of a (locally) uniform Poisson distribution, actually the limit minimal-cut problems are described by more regular energies than in the periodic-lattice case.

When we relax the hypothesis on the range of interactions, the description of the limit of sparse graphs becomes more complex, as it depends subtly on geometric characteristics of the graph, and is partially understood. Some easy examples show that, even though for the continuum limit we still remain in a similar analytical environment, the description as (sharp) interfacial energies can be lost in this case, and more “diffuse” interfaces must be taken into account.

If instead we consider dense sequences of graphs (i.e., the number of bonds is of the same order as the total number of pairs as N goes to infinity) then a completely different limit environment must be used, that of graphons (which are abstract limits of graphs), for which sophisticated combinatoric results can be used. We can re-read the existing notion of convergence of graphs to graphons as a convergence of the related cut functionals to non-local energies on a simple reference parameter set. This convergence provides an approximate description of the corresponding minimal-cup problems.

Works in collaboration with Alicandro, Cicalese, Piatnitski and Solci (sparse graphs) and Cermelli and Dovetta (dense graphs).video

slidesNicolás García TrillosUniversity of Wisconsin-Madison, US Regularity theory and uniform convergence in the large data limit of graph Laplacian eigenvectors on random data clouds

Regularity theory and uniform convergence in the large data limit of graph Laplacian eigenvectors on random data clouds

Graph Laplacians are omnipresent objects in machine learning that have been used in supervised, unsupervised and semi supervised settings due to their versatility in extracting local and global geometric information from data clouds. In this talk I will present an overview of how the mathematical theory built around them has gotten deeper and deeper, layer by layer, since the appearance of the first results on pointwise consistency in the 2000’s, until the most recent developments; this line of research has found strong connections between PDEs built on proximity graphs on data clouds and PDEs on manifolds, and has given a more precise mathematical meaning to the task of “manifold learning”. In the first part of the talk I will highlight how ideas from optimal transport made some of the initial steps, which provided L2 type error estimates between the spectra of graph Laplacians and Laplace-Beltrami operators, possible. In the second part of the talk, which is based on recent work with Jeff Calder and Marta Lewicka, I will present a newly developed regularity theory for graph Laplacians which among other things allow us to bootstrap the L2 error estimates developed through optimal transport and upgrade them to uniform convergence and almost C^{0,1} convergence rates. The talk can be seen as a tale of how a flow of ideas from optimal transport, PDEs, and in general, analysis, has made possible a finer understanding of concrete objects popular in data analysis and machine learning.video

slides### Jun 29, 2020: Session VI

Speaker Title tap/hover for abstract Materials Jana de WiljesUniversität Potsdam, DE Sequential learning for decision support under uncertainty

Sequential learning for decision support under uncertainty

In many applicational areas there is a need to determine a control variable that optimizes a pre-specified objective. This problem is particularly challenging when knowledge on the underlying dynamics is subject to various sources of uncertainty. A scenario such as that arises for instance in the context of therapy individualization to improve the efficacy and safety of medical treatment. Mathematical models describing the pharmacokinetics and pharmacodynamics of a drug together with data on associated biomarkers can be leveraged to support decision-making by predicting therapy outcomes. We present a continuous learning strategy which follows a novel sequential Monte Carlo tree search approach and explore how the underlying uncertainties reflect in the approximated control variable.video

slidesBjörn SprungkTU Freiburg, DE Noise-level robust Monte Carlo methods for Bayesian inference with infomative data

Noise-level robust Monte Carlo methods for Bayesian inference with infomative data

The Bayesian approach to inverse problems provides a rigorous framework for the incor-poration and quantification of uncertainties in measurements, parameters and models. However, sampling from or integrating w.r.t. the resultung posterior measure can become computationally challenging. In recent years, a lot of effort has been spent on deriving dimension-independent methods and to combine efficient sampling strategies with multilevel or surrogate methods in order to reduce the computational burden of Bayesian inverse problems.

In this talk, we are interested in designing numerical methods which are robust w.r.t. the size of the observational noise, i.e., methods which behave well in case of concentrated posterior measures. The concentration of the posterior is a highly desirable situation in practice, since it relates to informative or large data. However, it can pose as well a significant computational challenge for numerical methods based on the prior or reference measure. We propose to employ the Laplace approximation of the posterior as the base measure for numerical integration in this context. The Laplace approximation is a Gaussian measure centered at the maximum a-posteriori estimate (MAPE) and with covariance matrix depending on the Hessian of the log posterior density at the MAPE. We discuss convergence results of the Laplace approximation in terms of the Hellinger distance and analyze the efficiency of Monte Carlo methods based on it. In particular, we show that Laplace-based importance sampling and quasi-Monte-Carlo as well as Laplace-based Metropolis-Hastings algorithms are robust w.r.t. the concentration of the posterior for large classes of posterior distributions and integrands whereas prior-based Monte Carlo sampling methods are not.video

slides

## Organizers

Leon Bungert

Martin Burger

Antonio Esposito

Janic Föcke

Daniel Tenbrinck

Philipp Wacker