Skip to main content

📚 Generate a dataset

To generate a dataset we will make use of engines that consist of third party APIs. The below ones are the currently supported ones:

OpenAI api key can be obtained here

from xturing.model_apis.openai import ChatGPT, Davinci
engine = ChatGPT("your-api-key")
# or
engine = Davinci("your-api-key")

From no data

Even if we have no data, we can write a .jsonl file that contains the tasks/use cases we would like your model to perform well in. Continue reading to learn this file structure.

Write your tasks.jsonl

Each line of this file needs to be a JSON object with the following fields:

NameTypeDesription
idstringA unique identifier for the seed task. This can be any string that is unique within the set of seed tasks you are generating a dataset for.
namestringA name for the seed task that describes what it is. This can be any string that helps you identify the task.
instructionstringA natural language instruction or question that defines the task. This should be a clear and unambiguous description of what the task is asking the model to do.
instancesList[Dict[str,str]]A list of input-output pairs that provide examples of what the model should output for this task. Each input-output pair is an object with two fields: input and output.
is_classificationbooleanA flag that indicates whether this is a classification task or not. If this flag is set to true, the output should be a single label (e.g. a category or class), otherwise the output can be any text. The default value is false.

Here's an example of a task in the above mentioned format:

{
"id": "seed_task_0",
"name": "addition",
"instruction": "Add the two numbers together",
"instances": [
{
"input": "2 + 2",
"output": "4"
},
{
"input": "3 + 7",
"output": "10"
}
],
"is_classification": false
}

Here is how an sample tasks.jsonl file should like:

{
"id": "seed_task_0",
"name": "breakfast_suggestion",
"instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?",
"instances": [{"input": "", "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."}],
"is_classification": false
}
{
"id": "seed_task_1",
"name": "antonym_relation",
"instruction": "What is the relation between the given pairs?",
"instances": [{"input": "Night : Day :: Right : Left", "output": "The relation between the given pairs is that they are opposites."}], "is_classification": false
}

Save the dataset

In order to use the dataset we just generated and not waste time again next we need it, we can simply save our instance like shown here.

Example

Using .generate_dataset() method we can generate a dataset from a list of tasks/use cases. If the generation gets interrupted, since the results are being cached, we can resume the generation just by passing the same list of tasks. If we don't want to load the cached result, then we will just delete the created folder from our working directory.

from xturing.datasets import InstructionDataset
from xturing.model_apis.openai import Davinci

## Load the required engine
engine = Davinci("your-api-key")

## Generate the dataset
dataset = InstructionDataset.generate_dataset(path="./tasks.jsonl", engine=engine)

## Save the dataset instance
dataset.save('/path/to/directory')

Following parameters can be used to control the extent of generation:

NameTypeDefaultDesription
num_instructions_for_finetuningint5The size of the generated dataset. If this number is much bigger than the number of lines in tasks.jsonl we can expect a more diverse dataset. Keep in mind that the bigger the number you set, more the credits are going to be used from your engine.
num_instructionsint10A cap on the size of the dataset, this can help to create a more diverse dataset. If you don't want to apply a cap, set this to the same value as num_instructions_for_finetuning.

From custom data

We can also generate a dataset from our own files.

The files can be of one of the following formats:

.csv .doc .docx .eml .epub .gif .jpg .jpeg .json .html .htm .mp3 .msg .odt .ogg .pdf .png .pptx .rtf .tiff .tif .txt .wav .xlsx .xls

Set up your environment

First, we need to make sure that all the necessary libraries are installed on our system. For this, we need to run the below commands:

This rely on you having homebrew installed

$ brew install caskroom/cask/brew-cask
$ brew cask install xquartz
$ brew install poppler antiword unrtf tesseract swig

Prepare the files

Next, we just need to provide the directory path where our files are located. Files from sub-directories will also be discovered automatically.

Save the dataset

In order to use the dataset we just generated and not waste time again next time we need it, we can simply save our instance like shown here.

Example

from xturing.datasets import InstructionDataset
from xturing.model_apis.openai import ChatGPT

# Load the required engine
engine = ChatGPT("your-api-key")

## Generate the dataset
dataset = InstructionDataset.generate_dataset_from_dir(path="/path/to/directory", engine=engine)

## Save the dataset instance
dataset.save("./my_generated_dataset")

Following parameters can be used to customise data generation.

NameTypeDefaultDesription
use_self_instructboolFalseWhen True the dataset will be augmented with self-instructions (more samples, more diverse). In this case, you also have control hover the same parameters of generate_dataset() method: num_instructions, num_instructions_for_finetuning.
chunk_sizeint8000The size of the chunk of text (in chars) that will be used to generate the instructions. We recommend values below 10000, but it depends on the model (engine) you are using.
num_samples_per_chunkint5The number of samples that will be generated for each chunk.