Mar 06, 2025; Lecture
Living Lab Lecture SeriesLiving Lab Lecture No. 30: Compressing Large Language Models
Large Language Models (LLMs) mark a new era in Artificial Intelligence. However, their large size poses significant challenges for inference in real-world applications due to substantial GPU memory requirements and high inference latency.
In this talk, we discuss techniques to compress pre-trained LLMs, reducing their resource consumption during inference while maintaining their performance. More specifically, we approach the problem from a multi-objective Neural Architecture Search (NAS) perspective to jointly optimize performance and efficiency.
By considering the LLM as a super-network consisting of a large but finite number of sub-networks, we can identify a set of Pareto-optimal sub-networks that balance parameter count and validation performance. We empirically demonstrate that using NAS techniques for fine-tuning enhances the prunability of pre-trained LLMs and explore how this impacts real-world applications.