📜 Use datasets
Certainly, when we're looking to utilize an existing dataset for tasks like fine-tuning or model testing, it becomes crucial to ensure that the dataset is in a format compatible with xTuring
. This ensures smooth functionality when working with the xTuring
platform. However, there might be instances where the chosen dataset isn't in the format that xTuring
expects. In such cases, there's a method we can follow to rectify this issue. Let's explore the process:
Selecting the Dataset: The first step involves choosing a dataset that suits our requirements for fine-tuning or testing the model.
Format Alignment: It's essential to verify whether the chosen dataset is in the format that
xTuring
accepts. If it's not, we need to proceed with some adjustments.Format Adjustment: To align the dataset with
xTuring
's expectations, we should reformat it according to the accepted structure. This ensures that the platform can seamlessly work with the data.Ensuring Coherency: The reformatted text should maintain coherency and clarity. It should effectively convey the message while adhering to proper grammar and organization.
By following these steps, we ensure that the chosen dataset is transformed into a compatible format for xTuring
, enabling efficient usage and optimal results.
We know what all we need to do to make format the dataset, below is the how behind it!
Instruction dataset format
For this tutorial we will need to prepare a dataset which contains 3 columns (instruction, text, target) for instruction fine-tuning or 2 columns (text, target) for text fine-tuning. Here, we will see how to convert Alpaca dataset to be used for instruction fine-tuning. Before starting, make sure you have downloaded the Alpaca dataset in your working directory.
Convert the dataset to Instruction Dataset format
This is the main step where we our knowledge of the existing dataset, we convert it to a format understandable by xTuring
's InstructionDataset class.
import json
from datasets import Dataset, DatasetDict
alpaca_data = json.load(open('/path/to/alpaca_dataset'))
instructions = []
inputs = []
outputs = []
for data in alpaca_data:
instructions.append(data["instruction"])
inputs.append(data["input"])
outputs.append(data["output"])
data_dict = {
"train": {"instruction": instructions, "text": inputs, "target": outputs}
}
dataset = DatasetDict()
for k, v in data_dict.items():
dataset[k] = Dataset.from_dict(v)
dataset.save_to_disk(str("./alpaca_data"))
Load the prepared dataset
After preparing the dataset in correct format, you can use this dataset for the instruction fine-tuning.
To load the instruction dataset
from xturing.datasets.instruction_dataset import InstructionDataset
instruction_dataset = InstructionDataset('/path/to/instruction_converted_alpaca_dataset')
Text dataset format
The datasets that we find on the internet are formatted in a way which is accepted by the xTuring
's TextDataset
class, so we need not worry text fine-tuning and just use those datasets as is.