Lmst

#toxiv_bot_toot

[2025-06-09 Mon (UTC), no new articles found for q-fin.PM Portfolio Management]

#toXiv_bot_toot #toXiv_bot_new_article_summary_toot

Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias

Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, Yaoqing Yang
https://arxiv.org/abs/2506.06280 https://arxiv.org/pdf/2506.06280 https://arxiv.org/html/2506.06280

arXiv:2506.06280v1 Announce Type: new
Abstract: Diagnosing deep neural networks (DNNs) through the eigenspectrum of weight matrices has been an active area of research in recent years. At a high level, eigenspectrum analysis of DNNs involves measuring the heavytailness of the empirical spectral densities (ESD) of weight matrices. It provides insight into how well a model is trained and can guide decisions on assigning better layer-wise training hyperparameters. In this paper, we address a challenge associated with such eigenspectrum methods: the impact of the aspect ratio of weight matrices on estimated heavytailness metrics. We demonstrate that matrices of varying sizes (and aspect ratios) introduce a non-negligible bias in estimating heavytailness metrics, leading to inaccurate model diagnosis and layer-wise hyperparameter assignment. To overcome this challenge, we propose FARMS (Fixed-Aspect-Ratio Matrix Subsampling), a method that normalizes the weight matrices by subsampling submatrices with a fixed aspect ratio. Instead of measuring the heavytailness of the original ESD, we measure the average ESD of these subsampled submatrices. We show that measuring the heavytailness of these submatrices with the fixed aspect ratio can effectively mitigate the aspect ratio bias. We validate our approach across various optimization techniques and application domains that involve eigenspectrum analysis of weights, including image classification in computer vision (CV) models, scientific machine learning (SciML) model training, and large language model (LLM) pruning. Our results show that despite its simplicity, FARMS uniformly improves the accuracy of eigenspectrum analysis while enabling more effective layer-wise hyperparameter assignment in these application domains. In one of the LLM pruning experiments, FARMS reduces the perplexity of the LLaMA-7B model by 17.3% when compared with the state-of-the-art method.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner
https://arxiv.org/abs/2506.06278 https://arxiv.org/pdf/2506.06278 https://arxiv.org/html/2506.06278

arXiv:2506.06278v1 Announce Type: new
Abstract: Current LLM unlearning methods are not robust: they can be reverted easily with a few steps of finetuning. This is true even for the idealized unlearning method of training to imitate an oracle model that was never exposed to unwanted information, suggesting that output-based finetuning is insufficient to achieve robust unlearning. In a similar vein, we find that training a randomly initialized student to imitate an unlearned model transfers desired behaviors while leaving undesired capabilities behind. In other words, distillation robustifies unlearning. Building on this insight, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a partially noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Lagrangian-based Equilibrium Propagation: generalisation to arbitrary boundary conditions & equivalence with Hamiltonian Echo Learning

Guillaume Pourcel, Debabrota Basu, Maxence Ernoult, Aditya Gilra
https://arxiv.org/abs/2506.06248 https://arxiv.org/pdf/2506.06248 https://arxiv.org/html/2506.06248

arXiv:2506.06248v1 Announce Type: new
Abstract: Equilibrium Propagation (EP) is a learning algorithm for training Energy-based Models (EBMs) on static inputs which leverages the variational description of their fixed points. Extending EP to time-varying inputs is a challenging problem, as the variational description must apply to the entire system trajectory rather than just fixed points, and careful consideration of boundary conditions becomes essential. In this work, we present Generalized Lagrangian Equilibrium Propagation (GLEP), which extends the variational formulation of EP to time-varying inputs. We demonstrate that GLEP yields different learning algorithms depending on the boundary conditions of the system, many of which are impractical for implementation. We then show that Hamiltonian Echo Learning (HEL) -- which includes the recently proposed Recurrent HEL (RHEL) and the earlier known Hamiltonian Echo Backpropagation (HEB) algorithms -- can be derived as a special case of GLEP. Notably, HEL is the only instance of GLEP we found that inherits the properties that make EP a desirable alternative to backpropagation for hardware implementations: it operates in a "forward-only" manner (i.e. using the same system for both inference and learning), it scales efficiently (requiring only two or more passes through the system regardless of model size), and enables local learning.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Neural Responses to Affective Sentences Reveal Signatures of Depression

Aditya Kommineni, Woojae Jeong, Kleanthis Avramidis, Colin McDaniel, Myzelle Hughes, Thomas McGee, Elsi Kaiser, Kristina Lerman, Idan A. Blank, Dani Byrd, Assal Habibi, B. Rael Cahn, Sudarsana Kadiri, Takfarinas Medani, Richard M. Leahy, Shrikanth Narayanan
https://arxiv.org/abs/2506.06244 https://arxiv.org/pdf/2506.06244 https://arxiv.org/html/2506.06244

arXiv:2506.06244v1 Announce Type: new
Abstract: Major Depressive Disorder (MDD) is a highly prevalent mental health condition, and a deeper understanding of its neurocognitive foundations is essential for identifying how core functions such as emotional and self-referential processing are affected. We investigate how depression alters the temporal dynamics of emotional processing by measuring neural responses to self-referential affective sentences using surface electroencephalography (EEG) in healthy and depressed individuals. Our results reveal significant group-level differences in neural activity during sentence viewing, suggesting disrupted integration of emotional and self-referential information in depression. Deep learning model trained on these responses achieves an area under the receiver operating curve (AUC) of 0.707 in distinguishing healthy from depressed participants, and 0.624 in differentiating depressed subgroups with and without suicidal ideation. Spatial ablations highlight anterior electrodes associated with semantic and affective processing as key contributors. These findings suggest stable, stimulus-driven neural signatures of depression that may inform future diagnostic tools.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Towards an Explainable Comparison and Alignment of Feature Embeddings

Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia
https://arxiv.org/abs/2506.06231 https://arxiv.org/pdf/2506.06231 https://arxiv.org/html/2506.06231

arXiv:2506.06231v1 Announce Type: new
Abstract: While several feature embedding models have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classification-related downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between sample groups clustered within the embedding spaces. In this work, we propose the \emph{Spectral Pairwise Embedding Comparison (SPEC)} framework to compare embeddings and identify their differences in clustering a reference dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the difference kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters identified in one embedding are also captured in the other model. We provide numerical results demonstrating the SPEC's application to compare and align embeddings on large-scale datasets such as ImageNet and MS-COCO. The code is available at [https://github.com/mjalali/embedding-comparison](github.com/mjalali/embedding-comparison).

#toXiv_bot_toot #toXiv_bot_new_article_toot

Corrector Sampling in Language Models

Itai Gat, Neta Shaul, Uriel Singer, Yaron Lipman
https://arxiv.org/abs/2506.06215 https://arxiv.org/pdf/2506.06215 https://arxiv.org/html/2506.06215

arXiv:2506.06215v1 Announce Type: new
Abstract: Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by iteratively revisiting and potentially replacing tokens in a window of previously generated text. This method can be integrated into existing autoregressive models, preserving their next-token-prediction quality and speed. Fine-tuning a pretrained 8B parameter model with RPT for only 100B resulted in ~10% relative improvements on reasoning and coding benchmarks compared to the standard sampling.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Model-Driven Graph Contrastive Learning

Ali Azizpour, Nicolas Zilberstein, Santiago Segarra
https://arxiv.org/abs/2506.06212 https://arxiv.org/pdf/2506.06212 https://arxiv.org/html/2506.06212

arXiv:2506.06212v1 Announce Type: new
Abstract: We propose $\textbf{MGCL}$, a model-driven graph contrastive learning (GCL) framework that leverages graphons (probabilistic generative models for graphs) to guide contrastive learning by accounting for the data's underlying generative process. GCL has emerged as a powerful self-supervised framework for learning expressive node or graph representations without relying on annotated labels, which are often scarce in real-world data. By contrasting augmented views of graph data, GCL has demonstrated strong performance across various downstream tasks, such as node and graph classification. However, existing methods typically rely on manually designed or heuristic augmentation strategies that are not tailored to the underlying data distribution and operate at the individual graph level, ignoring similarities among graphs generated from the same model. Conversely, in our proposed approach, MGCL first estimates the graphon associated with the observed data and then defines a graphon-informed augmentation process, enabling data-adaptive and principled augmentations. Additionally, for graph-level tasks, MGCL clusters the dataset and estimates a graphon per group, enabling contrastive pairs to reflect shared semantics and structure. Extensive experiments on benchmark datasets demonstrate that MGCL achieves state-of-the-art performance, highlighting the advantages of incorporating generative models into GCL.

#toXiv_bot_toot #toXiv_bot_new_article_toot

How to craft a deep reinforcement learning policy for wind farm flow control

Elie Kadoche, Pascal Bianchi, Florence Carton, Philippe Ciblat, Damien Ernst
https://arxiv.org/abs/2506.06204 https://arxiv.org/pdf/2506.06204 https://arxiv.org/html/2506.06204

arXiv:2506.06204v1 Announce Type: new
Abstract: Within wind farms, wake effects between turbines can significantly reduce overall energy production. Wind farm flow control encompasses methods designed to mitigate these effects through coordinated turbine control. Wake steering, for example, consists in intentionally misaligning certain turbines with the wind to optimize airflow and increase power output. However, designing a robust wake steering controller remains challenging, and existing machine learning approaches are limited to quasi-static wind conditions or small wind farms. This work presents a new deep reinforcement learning methodology to develop a wake steering policy that overcomes these limitations. Our approach introduces a novel architecture that combines graph attention networks and multi-head self-attention blocks, alongside a novel reward function and training strategy. The resulting model computes the yaw angles of each turbine, optimizing energy production in time-varying wind conditions. An empirical study conducted on steady-state, low-fidelity simulation, shows that our model requires approximately 10 times fewer training steps than a fully connected neural network and achieves more robust performance compared to a strong optimization baseline, increasing energy production by up to 14 %. To the best of our knowledge, this is the first deep reinforcement learning-based wake steering controller to generalize effectively across any time-varying wind conditions in a low-fidelity, steady-state numerical simulation setting.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Transformative or Conservative? Conservation laws for ResNets and Transformers

Sibylle Marcotte, R\'emi Gribonval, Gabriel Peyr\'e
https://arxiv.org/abs/2506.06194 https://arxiv.org/pdf/2506.06194 https://arxiv.org/html/2506.06194

arXiv:2506.06194v1 Announce Type: new
Abstract: While conservation laws in gradient flow training dynamics are well understood for (mostly shallow) ReLU and linear networks, their study remains largely unexplored for more practical architectures. This paper bridges this gap by deriving and analyzing conservation laws for modern architectures, with a focus on convolutional ResNets and Transformer networks. For this, we first show that basic building blocks such as ReLU (or linear) shallow networks, with or without convolution, have easily expressed conservation laws, and no more than the known ones. In the case of a single attention layer, we also completely describe all conservation laws, and we show that residual blocks have the same conservation laws as the same block without a skip connection. We then introduce the notion of conservation laws that depend only on a subset of parameters (corresponding e.g. to a pair of consecutive layers, to a residual block, or to an attention layer). We demonstrate that the characterization of such laws can be reduced to the analysis of the corresponding building block in isolation. Finally, we examine how these newly discovered conservation principles, initially established in the continuous gradient flow regime, persist under discrete optimization dynamics, particularly in the context of Stochastic Gradient Descent (SGD).

#toXiv_bot_toot #toXiv_bot_new_article_toot

ICU-TSB: A Benchmark for Temporal Patient Representation Learning for Unsupervised Stratification into Patient Cohorts

Dimitrios Proios, Alban Bornet, Anthony Yazdani, Jose F Rodrigues Jr, Douglas Teodoro
https://arxiv.org/abs/2506.06192 https://arxiv.org/pdf/2506.06192 https://arxiv.org/html/2506.06192

arXiv:2506.06192v1 Announce Type: new
Abstract: Patient stratification identifying clinically meaningful subgroups is essential for advancing personalized medicine through improved diagnostics and treatment strategies. Electronic health records (EHRs), particularly those from intensive care units (ICUs), contain rich temporal clinical data that can be leveraged for this purpose. In this work, we introduce ICU-TSB (Temporal Stratification Benchmark), the first comprehensive benchmark for evaluating patient stratification based on temporal patient representation learning using three publicly available ICU EHR datasets. A key contribution of our benchmark is a novel hierarchical evaluation framework utilizing disease taxonomies to measure the alignment of discovered clusters with clinically validated disease groupings. In our experiments with ICU-TSB, we compared statistical methods and several recurrent neural networks, including LSTM and GRU, for their ability to generate effective patient representations for subsequent clustering of patient trajectories. Our results demonstrate that temporal representation learning can rediscover clinically meaningful patient cohorts; nevertheless, it remains a challenging task, with v-measuring varying from up to 0.46 at the top level of the taxonomy to up to 0.40 at the lowest level. To further enhance the practical utility of our findings, we also evaluate multiple strategies for assigning interpretable labels to the identified clusters. The experiments and benchmark are fully reproducible and available at https://github.com/ds4dh/CBMS2025stratification.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Physics-Informed Neural Networks for Control of Single-Phase Flow Systems Governed by Partial Differential Equations

Luis Kin Miyatake, Eduardo Camponogara, Eric Aislan Antonelo, Alexey Pavlov
https://arxiv.org/abs/2506.06188 https://arxiv.org/pdf/2506.06188 https://arxiv.org/html/2506.06188

arXiv:2506.06188v1 Announce Type: new
Abstract: The modeling and control of single-phase flow systems governed by Partial Differential Equations (PDEs) present challenges, especially under transient conditions. In this work, we extend the Physics-Informed Neural Nets for Control (PINC) framework, originally proposed to modeling and control of Ordinary Differential Equations (ODE) without the need of any labeled data, to the PDE case, particularly to single-phase incompressible and compressible flows, integrating neural networks with physical conservation laws. The PINC model for PDEs is structured into two stages: a steady-state network, which learns equilibrium solutions for a wide range of control inputs, and a transient network, which captures dynamic responses under time-varying boundary conditions. We propose a simplifying assumption that reduces the dimensionality of the spatial coordinate regarding the initial condition, allowing the efficient training of the PINC network. This simplification enables the derivation of optimal control policies using Model Predictive Control (MPC). We validate our approach through numerical experiments, demonstrating that the PINC model, which is trained exclusively using physical laws, i.e., without labeled data, accurately represents flow dynamics and enables real-time control applications. The results highlight the PINC's capability to efficiently approximate PDE solutions without requiring iterative solvers, making it a promising alternative for fluid flow monitoring and optimization in engineering applications.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Antithetic Noise in Diffusion Models

Jing Jia, Sifan Liu, Bowen Song, Wei Yuan, Liyue Shen, Guanyang Wang
https://arxiv.org/abs/2506.06185 https://arxiv.org/pdf/2506.06185 https://arxiv.org/html/2506.06185

arXiv:2506.06185v1 Announce Type: new
Abstract: We initiate a systematic study of antithetic initial noise in diffusion models. Across unconditional models trained on diverse datasets, text-conditioned latent-diffusion models, and diffusion-posterior samplers, we find that pairing each initial noise with its negation consistently yields strongly negatively correlated samples. To explain this phenomenon, we combine experiments and theoretical analysis, leading to a symmetry conjecture that the learned score function is approximately affine antisymmetric (odd symmetry up to a constant shift), and provide evidence supporting it. Leveraging this negative correlation, we enable two applications: (1) enhancing image diversity in models like Stable Diffusion without quality loss, and (2) sharpening uncertainty quantification (e.g., up to 90% narrower confidence intervals) when estimating downstream statistics. Building on these gains, we extend the two-point pairing to a randomized quasi-Monte Carlo estimator, which further improves estimation accuracy. Our framework is training-free, model-agnostic, and adds no runtime overhead.

#toXiv_bot_toot #toXiv_bot_new_article_toot

A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization

Muhammed Ustaomeroglu, Guannan Qu
https://arxiv.org/abs/2506.06179 https://arxiv.org/pdf/2506.06179 https://arxiv.org/html/2506.06179

arXiv:2506.06179v1 Announce Type: new
Abstract: Self-attention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self-attention through the lens of interacting entities, ranging from agents in multi-agent reinforcement learning to alleles in genetic sequences, and show that a single layer linear self-attention can efficiently represent, learn, and generalize functions capturing pairwise interactions, including out-of-distribution scenarios. Our analysis reveals that self-attention acts as a mutual interaction learner under minimal assumptions on the diversity of interaction patterns observed during training, thereby encompassing a wide variety of real-world domains. In addition, we validate our theoretical insights through experiments demonstrating that self-attention learns interaction functions and generalizes across both population distributions and out-of-distribution scenarios. Building on our theories, we introduce HyperFeatureAttention, a novel neural network module designed to learn couplings of different feature-level interactions between entities. Furthermore, we propose HyperAttention, a new module that extends beyond pairwise interactions to capture multi-entity dependencies, such as three-way, four-way, or general n-way interactions.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Reusing Trajectories in Policy Gradients Enables Fast Convergence

Alessandro Montenegro, Federico Mansutti, Marco Mussi, Matteo Papini, Alberto Maria Metelli
https://arxiv.org/abs/2506.06178 https://arxiv.org/pdf/2506.06178 https://arxiv.org/html/2506.06178

arXiv:2506.06178v1 Announce Type: new
Abstract: Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. These methods learn the parameters of parametric policies via stochastic gradient ascent, typically using on-policy trajectory data to estimate the policy gradient. However, such reliance on fresh data makes them sample-inefficient. Indeed, vanilla PG methods require $O(\epsilon^{-2})$ trajectories to reach an $\epsilon$-approximate stationary point. A common strategy to improve efficiency is to reuse off-policy information from past iterations, such as previous gradients or trajectories. While gradient reuse has received substantial theoretical attention, leading to improved rates of $O(\epsilon^{-3/2})$, the reuse of past trajectories remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that extensive reuse of past off-policy trajectories can significantly accelerate convergence in PG methods. We introduce a power mean correction to the multiple importance weighting estimator and propose RPG (Retrospective Policy Gradient), a PG algorithm that combines old and new trajectories for policy updates. Through a novel analysis, we show that, under established assumptions, RPG achieves a sample complexity of $\widetilde{O}(\epsilon^{-1})$, the best known rate in the literature. We further validate empirically our approach against PG methods with state-of-the-art rates.

#toXiv_bot_toot #toXiv_bot_new_article_toot

The Lock-in Hypothesis: Stagnation by Algorithm

Tianyi Alex Qiu, Zhonghao He, Tejasveer Chugh, Max Kleiman-Weiner
https://arxiv.org/abs/2506.06166 https://arxiv.org/pdf/2506.06166 https://arxiv.org/html/2506.06166

arXiv:2506.06166v1 Announce Type: new
Abstract: The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber. We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity and potentially the lock-in of false beliefs. We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop. Code and data available at https://thelockinhypothesis.com

#toXiv_bot_toot #toXiv_bot_new_article_toot

ENMA: Tokenwise Autoregression for Generative Neural PDE Operators

Armand Kassa\"i Koupa\"i, Lise Le Boudec, Louis Serrano, Patrick Gallinari
https://arxiv.org/abs/2506.06158 https://arxiv.org/pdf/2506.06158 https://arxiv.org/html/2506.06158

arXiv:2506.06158v1 Announce Type: new
Abstract: Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

#toXiv_bot_toot #toXiv_bot_new_article_toot

carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks

Carolin Benjamins, Helena Graf, Sarah Segel, Difan Deng, Tim Ruhkopf, Leona Hennig, Soham Basu, Neeratyoy Mallik, Edward Bergman, Deyao Chen, Fran\c{c}ois Cl\'ement, Matthias Feurer, Katharina Eggensperger, Frank Hutter, Carola Doerr, Marius Lindauer
https://arxiv.org/abs/2506.06143 https://arxiv.org/pdf/2506.06143 https://arxiv.org/html/2506.06143

arXiv:2506.06143v1 Announce Type: new
Abstract: Hyperparameter Optimization (HPO) is crucial to develop well-performing machine learning models. In order to ease prototyping and benchmarking of HPO methods, we propose carps, a benchmark framework for Comprehensive Automated Research Performance Studies allowing to evaluate N optimizers on M benchmark tasks. In this first release of carps, we focus on the four most important types of HPO task types: blackbox, multi-fidelity, multi-objective and multi-fidelity-multi-objective. With 3 336 tasks from 5 community benchmark collections and 28 variants of 9 optimizer families, we offer the biggest go-to library to date to evaluate and compare HPO methods. The carps framework relies on a purpose-built, lightweight interface, gluing together optimizers and benchmark tasks. It also features an analysis pipeline, facilitating the evaluation of optimizers on benchmarks. However, navigating a huge number of tasks while developing and comparing methods can be computationally infeasible. To address this, we obtain a subset of representative tasks by minimizing the star discrepancy of the subset, in the space spanned by the full set. As a result, we propose an initial subset of 10 to 30 diverse tasks for each task type, and include functionality to re-compute subsets as more benchmarks become available, enabling efficient evaluations. We also establish a first set of baseline results on these tasks as a measure for future comparisons. With carps (https://www.github.com/automl/CARP-S), we make an important step in the standardization of HPO evaluation.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models

Rihui Jin, Zheyu Xin, Xing Xie, Zuoyi Li, Guilin Qi, Yongrui Chen, Xinbang Dai, Tongtong Wu, Gholamreza Haffari
https://arxiv.org/abs/2506.06137 https://arxiv.org/pdf/2506.06137 https://arxiv.org/html/2506.06137

arXiv:2506.06137v1 Announce Type: new
Abstract: Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerability to heterogeneity in table layouts, and (ii) inconsistency in reasoning due to limited code generation capability. We propose Table-r1, a two-stage P-TR method designed for SLMs. Stage 1 introduces an innovative self-supervised learning task, Layout Transformation Inference, to improve tabular layout generalization from a programmatic view. Stage 2 adopts a mix-paradigm variant of Group Relative Policy Optimization, enhancing P-TR consistency while allowing dynamic fallback to T-TR when needed. Experiments on four TR benchmarks demonstrate that Table-r1 outperforms all SLM-based methods, achieving at least a 15% accuracy improvement over the base model (LLaMA-8B) across all datasets and reaching performance competitive with LLMs.

#toXiv_bot_toot #toXiv_bot_new_article_toot

Gradient Similarity Surgery in Multi-Task Deep Learning

Thomas Borsani, Andrea Rosani, Giuseppe Nicosia, Giuseppe Di Fatta
https://arxiv.org/abs/2506.06130 https://arxiv.org/pdf/2506.06130 https://arxiv.org/html/2506.06130

arXiv:2506.06130v1 Announce Type: new
Abstract: The multi-task learning ($MTL$) paradigm aims to simultaneously learn multiple tasks within a single model capturing higher-level, more general hidden patterns that are shared by the tasks. In deep learning, a significant challenge in the backpropagation training process is the design of advanced optimisers to improve the convergence speed and stability of the gradient descent learning rule. In particular, in multi-task deep learning ($MTDL$) the multitude of tasks may generate potentially conflicting gradients that would hinder the concurrent convergence of the diverse loss functions. This challenge arises when the gradients of the task objectives have either different magnitudes or opposite directions, causing one or a few to dominate or to interfere with each other, thus degrading the training process. Gradient surgery methods address the problem explicitly dealing with conflicting gradients by adjusting the overall gradient trajectory. This work introduces a novel gradient surgery method, the Similarity-Aware Momentum Gradient Surgery (SAM-GS), which provides an effective and scalable approach based on a gradient magnitude similarity measure to guide the optimisation process. The SAM-GS surgery adopts gradient equalisation and modulation of the first-order momentum. A series of experimental tests have shown the effectiveness of SAM-GS on synthetic problems and $MTL$ benchmarks. Gradient magnitude similarity plays a crucial role in regularising gradient aggregation in $MTDL$ for the optimisation of the learning process.

#toXiv_bot_toot #toXiv_bot_new_article_toot

#toxiv_bot_toot

Client Info