Loucas Pillaud-Vivien
|
Briefly
Since September 2022 I am a Courant Instructor/Flatiron fellow both at the Courant institute of NYU and the Flatiron Institute, where I work mainly with the amazing Joan Bruna, who is Professor at the Courant Institute and the Center for Data Science, New York University. Prior to this, I was a postdoc in the Theory of Machine Learning group of Nicolas Flammarion at EPFL. I did my Ph.D. in the SIERRA Team, which is part of the Computer Science Department of École Normale Supérieure Ulm and is also a joint team between CNRS and INRIA. There, I was advised by Francis Bach and Alessandro Rudi on stochastic approximation for high dimensionnal learning problems.
Prior to that, I graduated from École Polytechnique in 2016 and got a master degree in PDEs from Paris VI and École des Ponts Paris Tech (ANEDP). I did my master thesis in Molecular dynamics in the CERMICS, where I was advised by Julien Reygner and Tony Lelievre.
Contact
|
Research interests
My main research interests are centred around optimization, statistics and stochastic models. More precisely, here is are a selection of research topics I am interested in:
Gradient flow for non-convex learning problems (as we barely understand them, why bother with discrete updates?)
Implicit bias of overparametrized architectures in optimization learning
Stochastic Differential Equations (and PDEs) and how they can model and help us understand machine learning problems
Stochastic approximations in Hilbert spaces
Kernel methods
Publications
You can also consult them in my Google Scholar page. Do not worry, you'll find the same turtle photo there.
A. Bietti, J. Bruna, L. Pillaud-Vivien. On Learning Gaussian Multi-index Models with Gradient Flow. [arxiv:2310.19793], Preprint, 2023. [Show Abstract]
We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks.
We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated ‘saddle-to-saddle’ dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related emph{planted} problem, where the link function is known and fixed, in fact has a rough optimization landscape, in which gradient flow dynamics might get trapped with high probability.
J. Bruna, L. Pillaud-Vivien, A. Zweig. On Single Index Models beyond Gaussian Data. [arxiv:2302.06757], NeurIPS, 2023. [Show Abstract]
Sparse high-dimensional functions have arisen as a rich framework to study the behavior of gradient-descent methods using shallow neural networks, showcasing their ability to perform feature learning beyond linear models. Amongst those functions, the simplest are single-index models f(x)=ϕ(x⋅θ∗), where the labels are generated by an arbitrary non-linear scalar link function ϕ applied to an unknown one-dimensional projection θ∗ of the input data. By focusing on Gaussian data, several recent works have built a remarkable picture, where the so-called information exponent (related to the regularity of the link function) controls the required sample complexity. In essence, these tools exploit the stability and spherical symmetry of Gaussian distributions. In this work, building from the framework of cite{arous2020online}, we explore extensions of this picture beyond the Gaussian setting, where both stability or symmetry might be violated. Focusing on the planted setting where ϕ is known, our main results establish that Stochastic Gradient Descent can efficiently recover the unknown direction θ∗ in the high-dimensional regime, under assumptions that extend previous works cite{yehudai2020learning,wu2022learning}.
M. Andriushchenko, A.V. Varre, L. Pillaud-Vivien, N. Flammarion. SGD with large step sizes learns sparse features. [arxiv:2210.05337], ICML, 2023. [Show Abstract]
We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. Notably, no explicit regularization is used: the regularization effect comes solely from the SGD dynamics influenced by the large step sizes schedule. Therefore, these observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks. We justify these findings theoretically through the study of simple neural network models as well as qualitative arguments inspired from stochastic processes. This analysis allows us to shed new light on some common practices and observed phenomena when training deep networks.
L. Pillaud-Vivien, F. Bach. Kernelized Diffusion maps. [arxiv:2302.06757], COLT, 2023. [Show Abstract]
Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigen-elements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the high-dimension d. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert space method, which adapts naturally to the regularity of the problem. We provide non-asymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nyström subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.
E. Boursier, L. Pillaud-Vivien, N. Flammarion. Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. [arxiv:2206.00939, pdf], NeurIPS, 2022. [Show Abstract]
Abstract: The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
L. Pillaud-Vivien, J.Reygner, N. Flammarion. Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation. [arxiv:2206.09841, pdf], COLT, 2022. [Show Abstract]
Abstract: Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the role of the label noise in the training dynamics of a quadratically parametrised model through its continuous time version. We explicitly characterise the solution chosen by the stochastic flow and prove that it implicitly solves a Lasso program. To fully complete our analysis, we provide nonasymptotic convergence guarantees for the dynamics as well as conditions for support recovery. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and help explain the greater performances of stochastic dynamics as observed in practice.
S. Pesme, L. Pillaud-Vivien, N. Flammarion. Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity. [arxiv:2106.09524, pdf], NeurIPS, 2021. [Show Abstract]
Abstract: Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.
V. Cabannes, L. Pillaud-Vivien, F. Bach, A. Rudi. Overcoming the curse of dimensionality with Laplacian regularization in semi-supervised learning. [arxiv:2009.04324, pdf], NeurIPS, 2021. [Show Abstract]
Abstract: As annotations of data can be scarce in large-scale practical problems, leveraging unlabelled examples is one of the most important aspects of machine learning. This is the aim of semi-supervised learning. To benefit from the access to unlabelled data, it is natural to diffuse smoothly knowledge of labelled data to unlabelled one. This induces to the use of Laplacian regularization. Yet, current implementations of Laplacian regularization suffer from several drawbacks, notably the well-known curse of dimensionality. In this paper, we provide a statistical analysis to overcome those issues, and unveil a large body of spectral filtering methods that exhibit desirable behaviors. They are implemented through (reproducing) kernel methods, for which we provide realistic computational guidelines in order to make our method usable with large amounts of data.
A. Varre, L. Pillaud-Vivien, N. Flammarion. Last iterate convergence of SGD for Least-Squares in the Interpolation regime. [arxiv:2102.03183, pdf], NeurIPS, 2021. [Show Abstract]
Abstract: Motivated by the recent successes of neural networks that have the ability to fit the data perfectly and generalize well, we study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor fits perfectly inputs and outputs ⟨θ∗,ϕ(X)⟩=Y, where ϕ(X) stands for a possibly infinite dimensional non-linear feature map. To solve this problem, we consider the estimator given by the last iterate of stochastic gradient descent (SGD) with constant step-size. In this context, our contribution is two fold: (i) from a (stochastic) optimization perspective, we exhibit an archetypal problem where we can show explicitly the convergence of SGD final iterate for a non-strongly convex problem with constant step-size whereas usual results use some form of average and (ii) from a statistical perspective, we give explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a fine-grained parameterization of the problem to exhibit polynomial rates that can be faster than O(1/T). The link with reproducing kernel Hilbert spaces is established.
L. Pillaud-Vivien, F. Bach, T. Lelievre, A. Rudi, G. Stoltz. Statistical Estimation of the Poincaré constant and Application to Sampling Multimodal Distributions. [arxiv:1910.14564, pdf], AISTATS, 2019. [Show Abstract]
Abstract: Poincaré inequalities are ubiquitous in probability and analysis and have various applications in statistics (concentration of measure, rate of convergence of Markov chains). The Poincaré constant, for which the inequality is tight, is related to the typical convergence rate of diffusions to their equilibrium measure. In this paper, we show both theoretically and experimentally that, given sufficiently many samples of a measure, we can estimate its Poincaré constant. As a by-product of the estimation of the Poincaré constant, we derive an algorithm that captures a low dimensional representation of the data by finding directions which are difficult to sample. These directions are of crucial importance for sampling or in fields like molecular dynamics, where they are called reaction coordinates. Their knowledge can leverage, with a simple conditioning step, computational bottlenecks by using importance sampling techniques.
T. Lelievre, L. Pillaud-Vivien, J. Reygner. Central Limit Theorem for stationary Fleming-Viot particle systems in finite spaces. [arXiv:1806.04490, pdf], ALEA Latin American Journal of Probability and Mathematical Statistics, 2018. [Show Abstract]
Abstract: We consider the Fleming-Viot particle system associated with a continuous-time Markov chain in a finite space. Assuming irreducibility, it is known that the particle system possesses a unique stationary distribution, under which its empirical measure converges to the quasistationary distribution of the Markov chain. We complement this Law of Large Numbers with a Central Limit Theorem. Our proof essentially relies on elementary computations on the infinitesimal generator of the Fleming-Viot particle system, and involves the so-called π-return process in the expression of the asymptotic variance. Our work can be seen as an infinite-time version, in the setting of finite space Markov chains, of recent results by Cérou, Delyon, Guyader and Rousset [ arXiv:1611.00515, arXiv:1709.06771].
L. Pillaud-Vivien, A. Rudi, F. Bach. Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes. [arXiv:1805.10074, pdf, poster], NeurIPS, 2018. [Show Abstract]
Abstract: We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.
L. Pillaud-Vivien, A. Rudi, F. Bach. Exponential convergence of testing error for stochastic gradient methods. [arXiv:1712.04755, pdf, video, poster], COLT, 2018. [Show Abstract]
Abstract: We consider binary classification problems with positive definite kernels and square loss, and study the convergence rates of stochastic gradient methods. We show that while the excess testing loss (squared loss) converges slowly to zero as the number of observations (and thus iterations) goes to infinity, the testing error (classification error) converges exponentially fast if low-noise conditions are assumed.
PhD Thesis
I defended my thesis in October 2020.
You can download the final version of the manuscript via this link [Thesis].
You can also have a look at the slides. [Slides]
Some Presentations
Review
Reviewer for Journals:
Reviewer for Conferences:
|