Loucas PillaudVivien

Briefly
I am an assistant professor at Ecole des Ponts ParisTech and a researcher at CERMICS, in the Applied Probability team. Prior to this, I was a Courant Instructor/Flatiron fellow both at the Courant institute of NYU and the Flatiron Institute, where I work mainly with Joan Bruna. Before, I was a postdoc in the Theory of Machine Learning group of Nicolas Flammarion at EPFL. I did my Ph.D. in the SIERRA Team, under the supervision of Francis Bach and Alessandro Rudi on stochastic approximation for high dimensionnal learning problems.
Contact
Physical address: 6 et 8 avenue Blaise Pascal, Cité Descartes  Champs sur Marne. [Cermics].
Email: loucas.pillaudvivien [at] enpc [dot] fr

Research interests
My main research interests are centred around optimization, statistics and stochastic models. More precisely, here is are a selection of research topics I am interested in:
Gradient flow for nonconvex learning problems (as we barely understand them, why bother with discrete updates?)
Implicit bias of overparametrized architectures in optimization learning
Stochastic Differential Equations (and PDEs) and how they can model and help us understand machine learning problems
Stochastic approximations in Hilbert spaces
Kernel methods
Selected Publications
For a complete list of my publications, rendezvous to my Google Scholar page. Do not worry, you'll find the same turtle photo there.
A. Bietti, J. Bruna, L. PillaudVivien. On Learning Gaussian Multiindex Models with Gradient Flow. [arxiv:2310.19793], Submitted, 2023. [Show Abstract]
We study gradient flow on the multiindex regression problem for highdimensional Gaussian data. Multiindex functions consist of a composition of an unknown lowrank linear projection and an arbitrary unknown, lowdimensional link function. As such, they constitute a natural template for feature learning in neural networks.
We consider a twotimescale algorithm, whereby the lowdimensional link function is learnt with a nonparametric model infinitely faster than the subspace parametrizing the lowrank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated ‘saddletosaddle’ dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related emph{planted} problem, where the link function is known and fixed, in fact has a rough optimization landscape, in which gradient flow dynamics might get trapped with high probability.
M. Andriushchenko, A.V. Varre, L. PillaudVivien, N. Flammarion. SGD with large step sizes learns sparse features. [arxiv:2210.05337], ICML, 2023. [Show Abstract]
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used: the regularization effect comes solely from the SGD dynamics influenced by the large step sizes schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. This analysis allows us to shed new light on some common practices and observed phenomena when training deep networks.
L. PillaudVivien, F. Bach. Kernelized Diffusion maps. [arxiv:2302.06757], COLT, 2023. [Show Abstract]
Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigenelements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the highdimension d. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert space method, which adapts naturally to the regularity of the problem. We provide nonasymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nyström subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.
E. Boursier, L. PillaudVivien, N. Flammarion. Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. [arxiv:2206.00939], NeurIPS, 2022. [Show Abstract]
Abstract: The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training onehidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite nonconvexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
L. PillaudVivien, J.Reygner, N. Flammarion. Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation. [arxiv:2206.09841], COLT, 2022. [Show Abstract]
Abstract: Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the role of the label noise in the training dynamics of a quadratically parametrised model through its continuous time version. We explicitly characterise the solution chosen by the stochastic flow and prove that it implicitly solves a Lasso program. To fully complete our analysis, we provide nonasymptotic convergence guarantees for the dynamics as well as conditions for support recovery. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and help explain the greater performances of stochastic dynamics as observed in practice.
L. PillaudVivien, A. Rudi, F. Bach. Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes. [arXiv:1805.10074], NeurIPS, 2018. [Show Abstract]
Abstract: We consider stochastic gradient descent (SGD) for leastsquares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for lowdimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinitedimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with nonlinear kernel methods and on a classical benchmark with a linear model.
PhD Thesis
I defended my thesis in October 2020.
You can download the final version of the manuscript via this link [Thesis].
You can also have a look at the slides. [Slides]
Some Presentations
Review
Reviewer for Journals:
Reviewer for Conferences:
