📚 Generate a dataset
To generate a dataset we will make use of engines that consist of third party APIs. The below ones are the currently supported ones:
- OpenAI
- Cohere
- AI21
OpenAI api key can be obtained here
from xturing.model_apis.openai import ChatGPT, Davinci
engine = ChatGPT("your-api-key")
# or
engine = Davinci("your-api-key")
from xturing.model_apis.cohere import Medium
engine = Medium("your-api-key")
from xturing.model_apis.ai21 import J2Grande
engine = J2Grande("your-api-key")
From no data
Even if we have no data, we can write a .jsonl
file that contains the tasks/use cases we would like your model to perform well in. Continue reading to learn this file structure.
Write your tasks.jsonl
Each line of this file needs to be a JSON object with the following fields:
Name | Type | Desription |
---|---|---|
id | string | A unique identifier for the seed task. This can be any string that is unique within the set of seed tasks you are generating a dataset for. |
name | string | A name for the seed task that describes what it is. This can be any string that helps you identify the task. |
instruction | string | A natural language instruction or question that defines the task. This should be a clear and unambiguous description of what the task is asking the model to do. |
instances | List[Dict[str,str]] | A list of input-output pairs that provide examples of what the model should output for this task. Each input-output pair is an object with two fields: input and output. |
is_classification | boolean | A flag that indicates whether this is a classification task or not. If this flag is set to true, the output should be a single label (e.g. a category or class), otherwise the output can be any text. The default value is false. |
Here's an example of a task in the above mentioned format:
{
"id": "seed_task_0",
"name": "addition",
"instruction": "Add the two numbers together",
"instances": [
{
"input": "2 + 2",
"output": "4"
},
{
"input": "3 + 7",
"output": "10"
}
],
"is_classification": false
}
Here is how an sample tasks.jsonl
file should like:
{
"id": "seed_task_0",
"name": "breakfast_suggestion",
"instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?",
"instances": [{"input": "", "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."}],
"is_classification": false
}
{
"id": "seed_task_1",
"name": "antonym_relation",
"instruction": "What is the relation between the given pairs?",
"instances": [{"input": "Night : Day :: Right : Left", "output": "The relation between the given pairs is that they are opposites."}], "is_classification": false
}
Save the dataset
In order to use the dataset we just generated and not waste time again next we need it, we can simply save our instance like shown here.
Example
Using .generate_dataset()
method we can generate a dataset from a list of tasks/use cases. If the generation gets interrupted, since the results are being cached, we can resume the generation just by passing the same list of tasks. If we don't want to load the cached result, then we will just delete the created folder from our working directory.
from xturing.datasets import InstructionDataset
from xturing.model_apis.openai import Davinci
## Load the required engine
engine = Davinci("your-api-key")
## Generate the dataset
dataset = InstructionDataset.generate_dataset(path="./tasks.jsonl", engine=engine)
## Save the dataset instance
dataset.save('/path/to/directory')
Following parameters can be used to control the extent of generation:
Name | Type | Default | Desription |
---|---|---|---|
num_instructions_for_finetuning | int | 5 | The size of the generated dataset. If this number is much bigger than the number of lines in tasks.jsonl we can expect a more diverse dataset. Keep in mind that the bigger the number you set, more the credits are going to be used from your engine. |
num_instructions | int | 10 | A cap on the size of the dataset, this can help to create a more diverse dataset. If you don't want to apply a cap, set this to the same value as num_instructions_for_finetuning. |
From custom data
We can also generate a dataset from our own files.
The files can be of one of the following formats:
.csv .doc .docx .eml .epub .gif .jpg .jpeg .json .html .htm .mp3 .msg .odt .ogg .pdf .png .pptx .rtf .tiff .tif .txt .wav .xlsx .xls
Set up your environment
First, we need to make sure that all the necessary libraries are installed on our system. For this, we need to run the below commands:
- OSX
- Ubuntu/Debian
This rely on you having homebrew installed
$ brew install caskroom/cask/brew-cask
$ brew cask install xquartz
$ brew install poppler antiword unrtf tesseract swig
$ apt-get update
$ apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig
Prepare the files
Next, we just need to provide the directory path where our files are located. Files from sub-directories will also be discovered automatically.
Save the dataset
In order to use the dataset we just generated and not waste time again next time we need it, we can simply save our instance like shown here.
Example
from xturing.datasets import InstructionDataset
from xturing.model_apis.openai import ChatGPT
# Load the required engine
engine = ChatGPT("your-api-key")
## Generate the dataset
dataset = InstructionDataset.generate_dataset_from_dir(path="/path/to/directory", engine=engine)
## Save the dataset instance
dataset.save("./my_generated_dataset")
Following parameters can be used to customise data generation.
Name | Type | Default | Desription |
---|---|---|---|
use_self_instruct | bool | False | When True the dataset will be augmented with self-instructions (more samples, more diverse). In this case, you also have control hover the same parameters of generate_dataset() method: num_instructions, num_instructions_for_finetuning. |
chunk_size | int | 8000 | The size of the chunk of text (in chars) that will be used to generate the instructions. We recommend values below 10000, but it depends on the model (engine) you are using. |
num_samples_per_chunk | int | 5 | The number of samples that will be generated for each chunk. |