Samhita Alla

Fine-Tuning Insights: Lessons from Experimenting with RedPajama Large Language Model on Flyte Slack Data

Large language models (LLMs) have taken the world by storm, revolutionizing our understanding and generation of human-like text. These models have showcased remarkable capabilities across a range of tasks, including question-answering, chatbots and even creative writing. Naturally, like many others, I was filled with excitement to explore and experience the potential of these models firsthand.

As an open-source contributor at Union, my aim was to enhance users’ ability to find solutions to their queries independently on Slack.

Building upon this goal, I had the Flyte Slack data on hand, and I set out to fine-tune an LLM using that data. My ultimate objective was to develop a Slack bot that could run within the Flyte Slack workspace.

Instead of relying on embeddings, which have already been extensively explored, I opted for (supervised) fine-tuning to examine its performance and potential benefits. This decision was driven by my desire not only to incorporate knowledge into the model but also allow it to learn the subtleties and nuances of Slack messages. Fine-tuning appeared to be the better option for achieving this objective.

While fine-tuning itself proved to be a straightforward process, I was surprised to discover that the surrounding factors and implications turned out to be much more complex than I had anticipated …

Slack scraper

Goal: Create a scraper that extracts data from Slack channels and saves it on HuggingFace. The extracted data will be used later for fine-tuning an LLM.

Data is vital for fine-tuning LLMs — or any other model, for that matter. Therefore, the first crucial step is to extract Slack data and generate a dataset from it, finding a way to store it appropriately.

I exported all the Slack data into a folder, which consists of directories that align with each Slack channel. Within each channel directory, there are neatly arranged JSON files, in which each file represents the data for a specific day.

Copied to clipboard!
[
   {
   	"client_msg_id": "530038bc-718a-4708-9d05-0e2811795ffe",
   	"type": "message",
   	"text": "Are you trying to deploy Flyte on EKS?",
   	"user": "U01J90KBSU9",
   	"ts": "1684221986.531759",
   	"blocks": [
       	{
           	"type": "rich_text",
           	"block_id": "79X",
           	"elements": [
               	{
                   	"type": "rich_text_section",
                   	"elements": [
                       	{
                           	"type": "text",
                           	"text": "Are you trying to deploy Flyte on EKS?"
                       	}
                   	]
               	}
           	]
       	}
   	],
   	"team": "TN89P6GGK",
   	"user_team": "TN89P6GGK",
   	"source_team": "TN89P6GGK",
   	"user_profile": {
       	"avatar_hash": "4f24b225c547",
       	"image_72": "https:\/\/avatars.slack-edge.com\/2023-04-28\/5184926466722_4f24b225c547392670bd_72.jpg",
       	"first_name": "Samhita",
       	"real_name": "Samhita Alla",
       	"display_name": "Samhita Alla",
       	"team": "TN89P6GGK",
       	"name": "aallasamhita",
       	"is_restricted": false,
       	"is_ultra_restricted": false
   	},
   	"thread_ts": "1684211605.360509",
   	"parent_user_id": "U057SAH5KA7"
   },
   ...
]

Sample JSON data

I proceeded to write a Flyte task that retrieves the names of the directories, or channels, within the parent directory.

Copied to clipboard!
@task(cache=True, cache_version="2.0")
def get_channel_dirs(parent_dir: str) -> List[str]:
    return glob(parent_dir + "/*/")

Next, I developed a task that extracts question-response pairs from the Slack data, involving the organization of messages into coherent threads. Within each Slack thread, I extracted question-response pairs, ensuring that a question corresponds to one user and the response corresponds to a different user. In cases where the same user posted multiple consecutive messages, I combined them into a single question or response.

This approach allowed me to retrieve multiple question-response pairs from each Slack thread, which made more sense to me than considering the first message to be the question and the remaining messages to be the response. Considering the possibility of multiple users contributing to a single Slack thread, it was important to capture the various contexts within the conversation.

Finally, the resulting question-response pairs were stored in a JSON file.

Copied to clipboard!
@task(cache=True, cache_version="2.0")
def question_response_pairs(channel_dir: str) -> Optional[FlyteFile]:
    threads = []
    thread_ts_list_index_pairs = {}
    sorted_list_of_files = sorted(glob(channel_dir + "*.json"), key=os.path.getctime)

    ...
    pairs = []
    for ts in threads:
        ...
        if input_messages and output_messages:
            pairs.append(
                {
                    "input": list(input_messages.values())[0],
                    "output": list(output_messages.values())[0],
                }
            )
    if pairs:
        json_file_name = os.path.join(
            flytekit.current_context().working_directory,
            f"flyte_slack_data_{Path(channel_dir).parts[1]}.json",
        )
        with open(json_file_name, "w") as f:
            json.dump(pairs, f)
        return FlyteFile(json_file_name)
    return None

This task is applied to all the channels, and the resulting question-response pairs are stored in individual JSON files. Next these files are merged together to create a comprehensive dataset, which is then saved on HuggingFace for further use and analysis.

Copied to clipboard!
def merge_json_files(json_files):
    json_result_file = os.path.join(
        flytekit.current_context().working_directory, "flyte_slack_data.json"
    )
    result = []
    for json_file in json_files:
        if json_file:
            with open(json_file, "r") as infile:
                result.extend(json.load(infile))
    with open(
        json_result_file,
        "w",
    ) as f:
        json.dump(result, f)
    return json_result_file


@task(
    secret_requests=[
        Secret(
            group=SECRET_GROUP, key=SECRET_NAME, mount_requirement=Secret.MountType.FILE
        )
    ],
)
def push_to_hub(json_files: List[Optional[FlyteFile]]):
    HF_TOKEN = flytekit.current_context().secrets.get(SECRET_GROUP, SECRET_NAME)
    dataset = load_dataset("json", data_files=merge_json_files(json_files))
    dataset.push_to_hub("unionai/flyte-slack-data", token=HF_TOKEN)

I have accumulated a substantial amount of data, totaling approximately 28.2k rows! You can analyze and use this dataset as needed.

The question-response pairs extracted are stored in HuggingFace Datasets.

At first glance, the large size of this dataset instilled a sense of excitement, raising expectations for the LLM to deliver impressive performance. However, there are a few gotchas to keep in mind:

  • Outputs as inputs: It's crucial to note that the outputs from previous question-response pairs are treated as inputs for subsequent pairs. This chaining of information can impact the model's understanding and generation of responses.
  • Terse responses: Responses on Slack can often be brief and lacking in detail. This brevity can pose a challenge for the LLM, as it may struggle to provide comprehensive or informative answers.
  • Inaccurate responses: Not every response in the dataset can be guaranteed to be an accurate answer to the corresponding question. Some responses may diverge from the intended answer or may not fully address the question's intent.
  • Non-question inputs: It's important to acknowledge that not all inputs labeled as questions within the dataset are necessarily genuine queries. Some labeled questions might be statements, comments, or incomplete phrases, which may impact the model's performance.

I proceeded with the fine-tuning of the LLM regardless.

Fine-Tuning RedPajama LLM

Goal: Fine-tune the RedPajama LLM using the collected dataset, refining its ability to provide responses to queries within the Flyte Slack platform. 

I chose the RedPajama 7B chat model for fine-tuning. As for the prompt, I decided to go with the following: 

Copied to clipboard!
def generate_prompt(input, output=""):
    return f"""As an advanced chatbot, you enjoy assisting users on a community Slack platform. Write a detailed response that appropriately answers the query.

### Query:
{input}

### Response:
{output}"""

To carry out the fine-tuning process, I utilized the PyTorch elastic training integration, running the code on a single node equipped with five T4 GPUs. This integration provided a seamless way to utilize any PyTorch elastic training runner and its associated integrations effortlessly on Union Cloud. It provided a convenient approach to train any LLM, allowing for a straightforward replacement of the `torchrun --nproc-per-node=1 --nnodes=1 …` script with a simple `@task(task_config=Elastic(nnodes=1))` annotation.

Copied to clipboard!
@task(
    task_config=Elastic(nnodes=1),
    requests=Resources(mem="100Gi", cpu="50", gpu="5", ephemeral_storage="100Gi"),
    pod_template=PodTemplate(
        primary_container_name="llm-fine-tuning",
        pod_spec=V1PodSpec(
            containers=[
                V1Container(
                    name="llm-fine-tuning",
                    image="ghcr.io/samhita-alla/redpajama-finetune:0.0.8",
                    volume_mounts=[V1VolumeMount(mount_path="/dev/shm", name="dshm")],
                )
            ],
            volumes=[
                V1Volume(
                    name="dshm",
                    empty_dir=V1EmptyDirVolumeSource(
                    	medium="Memory", size_limit="60Gi"
                    ),
                )
            ],
        ),
    ),
    environment={
        "TRANSFORMERS_CACHE": "/tmp",
        "CUDA_LAUNCH_BLOCKING": "1",
    },
    secret_requests=[
        Secret(
            group=SECRET_GROUP, key=SECRET_NAME, mount_requirement=Secret.MountType.FILE
        )
    ],
)
def redpajama_finetune(config: TrainerConfig) -> FlyteDirectory:
    ...

Flyte task configuration

Moving forward, the subsequent step entails loading the tokenizer and LLM model into memory to proceed with fine-tuning.

Copied to clipboard!
tokenizer = AutoTokenizer.from_pretrained(config.base_model)

tokenizer.pad_token_id = 0  # unk. we want this to be different from the eos token
tokenizer.padding_side = "left"  # Allow batched inference

model = AutoModelForCausalLM.from_pretrained(
    config.base_model, # "togethercomputer/RedPajama-INCITE-7B-Chat"
    trust_remote_code=True,
    device_map=device_map, # "auto"
)

By utilizing the `device_map="auto"` parameter, I can leverage Hugging Face Accelerate to automatically determine the optimal placement of each layer in the model based on the available resources. This approach maximizes the utilization of GPU memory space by initially storing the model's weights on the GPU(s). If additional space is required, the remaining weights are stored on the CPU. In cases where there is insufficient RAM, the excess weights are stored on the hard drive as memory-mapped tensors.

Running the 7B model in full precision requires a total of 28 GB (7 * 4) GPU RAM, considering that each single precision (float32) floating-point number occupies 4 bytes of memory. However, the T4 instance I'll use for fine-tuning has only 16 GB of GPU memory available. To ensure that the model fits within the available memory, I set `torch_dtype` to `torch.float16`. By using half precision, the memory footprint is reduced to approximately 14 GB, allowing the model to fit within the available 16 GB of GPU memory on the T4 instance.

Copied to clipboard!
model = AutoModelForCausalLM.from_pretrained(
    config.base_model,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)

To further minimize memory consumption, which may not be necessary in this particular case but can be beneficial for larger models, you can employ 8-bit or 4-bit quantization offered by the bitsandbytes library. This approach allows for even more efficient utilization of memory resources. By utilizing the `load_in_8bit` functionality, I can convert the loaded model into a mixed 8-bit quantized model. This feature enables the model to be loaded and operated with reduced memory requirements while maintaining acceptable performance.

Copied to clipboard!
model = AutoModelForCausalLM.from_pretrained(
    config.base_model,
    trust_remote_code=True,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)

After loading the model in 8-bit precision, the resulting memory footprint is as follows:

Copied to clipboard!
model.get_memory_footprint()
# Output: 7406375040

The model only occupies 7.4GB of memory!

Note: If the `torch_dtype` is not specified when loading the model in 8-bit, you will encounter the following warning:

Copied to clipboard!
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.

You can also use Parameter-Efficient Fine-Tuning (PEFT) techniques to address the challenges posed by computational and storage requirements. PEFT focuses on fine-tuning only a small subset of additional model parameters, resulting in substantial reductions in both computational and storage costs. Remarkably, these techniques achieve performance levels that are comparable to full fine-tuning approaches.

I will be utilizing a PEFT method called Low Rank Adapters (LoRA) from the PEFT library. Instead of fine-tuning the entire model, LoRa enables fine-tuning of specific adapters, which are then appropriately loaded within the model.

Before training the int8 model using PEFT, there are some pre-processing steps that need to be performed. To help with this, I will incorporate a utility function called `prepare_model_for_int8_training` that performs the following tasks:

  1. It casts all the non `int8` modules to full precision (`fp32`) to ensure stability during training.
  2. It adds a `forward_hook` to the input embedding layer, enabling gradient computation of the input hidden states. This is important for accurate gradient calculations.
  3. It enables gradient checkpointing, which optimizes memory usage during training by selectively storing and recomputing intermediate activations.
Copied to clipboard!
# 8-bit training
model = prepare_model_for_int8_training(model)

# LoRA
lora_config = LoraConfig(
    r=config.lora_r, # 8
    lora_alpha=config.lora_alpha, # 16
    target_modules=config.lora_target_modules, # ["query_key_value"]
    lora_dropout=config.lora_dropout, # 0.05
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

LoRA and PEFT

The next step in the process is to define a `Trainer` that will handle the training loop.

Copied to clipboard!
trainer = Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=TrainingArguments(
        per_device_train_batch_size=config.micro_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=100,
        num_train_epochs=config.num_epochs,
        learning_rate=config.learning_rate,
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps" if config.val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=eval_steps if config.val_set_size > 0 else None,
        save_steps=save_steps,
        output_dir=config.output_dir,
        save_total_limit=3,
        load_best_model_at_end=True if config.val_set_size > 0 else False,
        ddp_find_unused_parameters=False if ddp else None,
        group_by_length=config.group_by_length,
        report_to="wandb" if use_wandb else None,
        run_name=wandb_run_name if use_wandb else None,
    ),
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
    callbacks=[SavePeftModelCallback],
)

HuggingFace Trainer to fine-tune RedPajama using Slack data.

I stored the resulting fine-tuned model LoRA weights on the Hugging Face models hub for easy accessibility and future use.

You can access the complete fine-tuning code on our GitHub repository. The code includes various libraries and techniques, such as PyTorch elastic training for native `torchrun` usage, 8-bit quantization, PEFT and LoRA. By integrating these techniques into a Flyte task, I successfully conducted the fine-tuning process on T4 GPUs, leveraging the capabilities of Union Cloud as a unified platform. Union Cloud offers a secure environment, deployed within your virtual private cloud (VPC), to ensure you have full control over data access. It lets you reproduce your fine-tuned model artifacts and facilitates easy retrieval whenever needed.

Inference

Goal: Implement an inference pipeline using the fine-tuned RedPajama LLM to generate accurate predictions on new user queries. 

During the inference phase, I retrieved the pre-trained model and instantiated a LoRA model using the pre-trained LoRA configuration and weights.

Copied to clipboard!
@task(requests=Resources(gpu="1", mem="50Gi", cpu="10"))
def generate_output(
    input: str,
    temperature: float,
    top_p: float,
    top_k: int,
    num_beams: int,
    max_new_tokens: int,
    load_8bit: bool,
    base_model: str,
    lora_weights: str,
) -> str:
    base_model = base_model or os.environ.get("BASE_MODEL", "")
    assert (
        base_model
    ), "Please specify a --base_model, e.g. --base_model='huggyllama/llama-7b'"

    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=load_8bit,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    model = PeftModel.from_pretrained(
        model,
        lora_weights,
        torch_dtype=torch.float16,
    )

    # unwind broken decapoda-research config
    model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
    model.config.bos_token_id = 1
    model.config.eos_token_id = 2

    if not load_8bit:
        model.half()  # seems to fix bugs for some users.

    model.eval()
    if torch.__version__ >= "2" and sys.platform != "win32":
        model = torch.compile(model)

    prompt = generate_prompt(input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(device)
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
    )

    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
            repetition_penalty=1.3,
         )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)

    return output.split("### Response:")[1].strip()

The predictions I generated were suboptimal.

{{RedPajama-qa="/blog-component-assets"}}.

 

It seems that the prompt I used could have been more effective. RedPajama recommended an alternative prompt, but unfortunately, it did not yield improved results Interestingly, I also noticed a significant number of empty outputs for some unknown reason.

Copied to clipboard!
<human>: [Instruction]
<bot>:

Key takeaways

I observed that the performance of the fine-tuned model did not meet my initial expectations. The responses generated by the fine-tuned model tended to be concise, and there were occasional repetitions in the output.

The brevity of responses can be attributed to the characteristics of the Slack data used for training. To address this, one possible approach could be to filter and consider only longer responses or involve human intervention to select and provide appropriate responses.

Regarding the issue of repetitions, it seemed to stem from the model being undertrained, as the training loss plateaued while the evaluation loss continued to decrease. Although I attempted to mitigate this problem by applying a repetition penalty, it did not completely eliminate the repetitions.

The training loss reached a plateau, while the evaluation loss continued to decrease.

Interestingly, when I fine-tuned the model on a smaller subset of 1,000 data samples instead of the entire dataset, the responses became longer, although the model occasionally generated hallucinated content. Nevertheless, the responses remained polite and detailed.

Fine-tuning seems like a promising approach, but its effectiveness heavily relies on the availability and quality of the data. However, considering the requirement for high-end GPUs solely for training purposes, it may not be the most efficient use of resources. When fine-tuning, it is important to anticipate the response style that aligns with the output format of the training dataset. I noticed that the fine-tuned model often heavily relied on the specific characteristics of the training data, sometimes disregarding the desired response style (even after modifying the prompt to guide it toward generating detailed explanations.)

In contrast, generating semantic embeddings proved to be a quick, straightforward and effective alternative in most use cases. This cost-effective solution allows you to harness the power of larger and more advanced models like GPT to generate embeddings. Since embeddings operate as distinct modules, they retain the pre-trained knowledge and are less prone to catastrophic forgetting, unlike fine-tuning, which can lead to the loss of previously learned information while adapting to new tasks. When the objective is to impart knowledge to the model, embeddings provide a more consistent response style, primarily focusing on delivering factual information in most cases.

To summarize my experimentation journey and key takeaways: I initially began fine-tuning the model with a limited understanding of dataset curation and prompt engineering. I quickly learned that the quality of the dataset plays a vital role in the fine-tuning process. On the other hand, dealing with embeddings proved to be relatively simpler and didn't require extensive dataset cleanup, although it never hurts to ensure cleanliness. 

Moving forward, the next steps involve deciding between fine-tuning and utilizing semantic embeddings. If fine-tuning appears to be a suitable approach, I will proceed with the creation of a high-quality Slack dataset. I would greatly appreciate your feedback and any ideas you may have regarding improvements or further steps that I could consider. 

Originally published at https://allasamhita.medium.com.

Article