One challenge organizations face when customizing large language models (LLMs) is the need to run multiple experiments, which produces only one useful model. While the cost of experimentation is typically low, and the results well worth the effort, this experimentation process does involve “wasted” resources, such as compute assets spent without their product being utilized, dedicated developer time, and more.
Model merging combines the weights of multiple customized LLMs, increasing resource utilization and adding value to successful models. This approach provides two key solutions:
- Reduces experimentation waste by repurposing “failed experiments”
- Offers a cost-effective alternative to join training
This post explores how models are customized, how model merging works, different types of model merging, and how model merging is iterating and evolving.
Revisiting model customization
This section provides a brief overview of how models are customized and how this process can be leveraged to help build an intuitive understanding of model merging.
Note that some of the concepts discussed are oversimplified for the purpose of building this intuitive understanding of model merging. It is suggested that you familiarize yourself with customization techniques, transformer architecture, and training separately before diving into model merging. See, for example, Mastering LLM Techniques: Customization.
The role of weight matrices in models
Weight matrices are essential components in many popular model architectures, serving as large grids of numbers (weights, or parameters) that store the information necessary for the model to make predictions.
As data flows through a model, it passes through multiple layers, each containing its own weight matrix. These matrices transform the input data through mathematical operations, enabling the model to learn from and adapt to the data.
To modify a model’s behavior, the weights within these matrices must be updated. Although the specifics of weight modification are not essential, it’s crucial to understand that each customization of a base model results in a unique set of updated weights.
Task customization
When fine-tuning an LLM for a specific task, such as summarization or math, the updates made to the weight matrices are targeted towards improving performance on that particular task. This implies that the modifications to the weight matrices are localized to specific regions, rather than being uniformly distributed.
To illustrate this concept, consider a simple analogy where the weight matrices are represented as a sports field that is 100 yards in length. When customizing the model for summarization, the updates to the weight matrices might concentrate on specific areas, such as the 10-to-30 yard lines. In contrast, customizing the model for math might focus updates on a different region, like the 70-to-80 yard lines.
Interestingly, when customizing the model for a related task, such as summarization in the French language, the updates might overlap with the original summarization task, affecting the same regions of the weight matrices (the 25-to-35 yard lines, for example). This overlap suggests an important insight: different task customizations can significantly impact the same areas of the weight matrices.
While the previous example is purposefully oversimplified, the intuition is accurate. Different task customizations will lead to different parts of the weight matrices being updated, and customization for similar tasks might lead to changing the same parts of their respective weight matrices.
This understanding can inform strategies for customizing LLMs and leveraging knowledge across tasks.
Model merging
Model merging is a loose grouping of strategies that relates to combining two or more models, or model updates, into a single model for the purpose of saving resources or improving task-specific performance.
This discussion focuses primarily on the implementation of these techniques through an open-source library developed by Arcee AI called mergekit. This library simplifies the implementation of various merging strategies.
Many methods are used to merge models, in various levels of complexity. Here, we’ll focus on four main merging methods:
- Model Soup
- Spherical Linear Interpolation (SLERP)
- Task Arithmetic (using Task Vectors)
- TIES leveraging DARE
Model Soup
The Model Soup method involves averaging the resultant model weights created by hyperparameter optimization experiments, as explained in Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time.
Originally tested and verified through computer vision models, this method has shown promising results for LLMs as well. In addition to generating some additional value out of the experiments, this process is simple and not compute intensive.
There are two ways to create Model Soup: naive and greedy. The naive approach involves merging all models sequentially, regardless of their individual performance. In contrast, the greedy implementation follows a simple algorithm:
- Rank models by performance on the desired task
- Merge the best performing model with the second best performing model
- Evaluate the merged model’s performance on the desired task
- If the merged model performs better, continue with the next model; otherwise, skip the current model and try again with the next best model
This greedy approach ensures that the resulting Model Soup is at least as good as the best individual model.
Each step of creating a Model Soup is implemented by simple weighted and normalized linear averaging of two or more model weights. Both the weighting and normalization are optional, though recommended. The implementation of this from the mergekit
library is as follows:
res = (weights * tensors).sum(dim=0)
if self.normalize:
res = res / weights.sum(dim=0)
While this method has shown promising results in the computer vision and language domains, it faces some serious limitations. Specifically, there is no guarantee that the model will be more performant. The linear averaging can lead to degraded performance or loss of generalizability.
The next method, SLERP, addresses some of those specific concerns.
SLERP
Spherical Linear Interpolation, or SLERP, is a method introduced in a 1985 paper titled Animating Rotation with Quaternion Curves. It’s a “smarter” way of computing the average between two vectors. In a technical sense, it helps compute the shortest path between two points on a curved surface.
This method excels at combining two models. The classic example is imagining the shortest path between two points on the Earth. Technically, the shortest path would be a straight line that goes through the Earth, but in reality it’s a curved path on the surface of the Earth. SLERP computes this smooth path to use for averaging two models together while maintaining their unique model weight “surfaces.”
The following code snippet is the core of the SLERP algorithm, and is what provides such a good interpolation between the two models:
# Calculate initial angle between v0 and v1
theta_0 = np.arccos(dot)
sin_theta_0 = np.sin(theta_0)
# Angle at timestep t
theta_t = theta_0 * t
sin_theta_t = np.sin(theta_t)
# Finish the slerp algorithm
s0 = np.sin(theta_0 - theta_t) / sin_theta_0
s1 = sin_theta_t / sin_theta_0
res = s0 * v0_copy + s1 * v1_copy
return maybe_torch(res, is_torch)
Task Arithmetic (using Task Vectors)
This group of model merging methods utilizes Task Vectors to combine models in various ways, increasing in complexity.
Task Vectors: Capturing customization updates
Recalling how models are customized, updates are made to the model’s weights, and those updates are captured in the base model matrices. Instead of considering the final matrices as a brand new model, they can be viewed as the difference (or delta) between the base weights and the customized weights. This introduces the concept of a task vector,a structure containing the delta between the base and customized weights.
This is the same intuition behind Low Rank Adaptation (LoRA), but without the further step of factoring the matrices representing the weight updates.
Task Vectors can be simply obtained from customization weights by subtracting out the base model weights.
Task Interference: Conflicting updates
Recalling the sports field example, there is a potential for overlap in the updated weights between different customizations. There is some intuitive understanding that customization done for the same task would lead to a higher rate of conflicting updates than customization done for two, or more, separate tasks.
This “conflicting update” idea is more formally defined as Task Interference and it relates to the potential collision of important updates between two, or more, Task Vectors.
Task Arithmetic
As introduced in the paper Editing Models with Task Arithmetic, Task Arithmetic represents the simplest implementation of a task vector approach. The process is as follows:
- Obtain two or more task vectors and merge them linearly as seen in Model Soup.
- After the resultant merged task vector is obtained, it is added into the base model.
This process is simple and effective, but has a key weakness: no attention is paid to the potential interference between the task vectors intended to be merged.
TIES-Merging
As introduced in the paper TIES-Merging: Resolving Interference When Merging Models, TIES (TrIm Elect Sign and Merge) is a method that takes the core ideas of Task Arithmetic and combines it with heuristics for resolving potential interference between the Task Vectors.
The general procedure is to consider, for each weight in the Task Vectors being merged, the magnitude of each incoming weight, then the sign of each incoming weight, and then averaging the remaining weights.
This method seeks to resolve interference by enabling the models that had the most significant weight updates for any given weight update take precedence during the merging process. In essence, the models that “cared” more about that weight would be prioritized over the models that did not.
DARE
Introduced in the paper Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch, DARE isn’t directly a model merging technique. Rather, it’s an augment that can be considered alongside other approaches. DARE derives from the following:
Drops delta parameters with a ratio p And REscales the remaining ones by 1/(1 – p) to approximate the original embeddings.
Instead of trying to address the problem of interference through heuristics, DARE approaches it from a different perspective. In essence, it randomly drops a large number of the updates found in a specific task vector by setting them to 0, and then rescales the remaining weight proportional to the ratio of the dropped weights.
DARE has been shown to be effective even when dropping upwards of 90%, or even 99% of the task vector weights.
Increase model utility with model merging
The concept of model merging offers a practical way to maximize the utility of multiple LLMs, including task-specific fine-tuning done by a larger community. Through techniques like Model Soup, SLERP, Task Arithmetic, TIES-Merging, and DARE, organizations can effectively merge multiple models in the same family in order to reuse experimentation and cross-organizational efforts.
As the techniques behind model merging are better understood and further developed, they are poised to become a cornerstone of the development of performant LLMs. While this post has only scratched the surface, more techniques are constantly under development, including some evolution-based methods. Model merging is a budding field in the generative AI landscape, as more applications are being tested and proven.