📚 Generate a dataset

To generate a dataset we will make use of engines that consist of third party APIs. The below ones are the currently supported ones:

OpenAI
Cohere
AI21

OpenAI api key can be obtained here

from xturing.model_apis.openai import ChatGPT, Davinci
engine = ChatGPT("your-api-key")
# or
engine = Davinci("your-api-key")

from xturing.model_apis.cohere import Medium
engine = Medium("your-api-key")

from xturing.model_apis.ai21 import J2Grande
engine = J2Grande("your-api-key")

From no data

Even if we have no data, we can write a .jsonl file that contains the tasks/use cases we would like your model to perform well in. Continue reading to learn this file structure.

Write your `tasks.jsonl`

Each line of this file needs to be a JSON object with the following fields:

Name	Type	Desription
id	string	A unique identifier for the seed task. This can be any string that is unique within the set of seed tasks you are generating a dataset for.
name	string	A name for the seed task that describes what it is. This can be any string that helps you identify the task.
instruction	string	A natural language instruction or question that defines the task. This should be a clear and unambiguous description of what the task is asking the model to do.
instances	List[Dict[str,str]]	A list of input-output pairs that provide examples of what the model should output for this task. Each input-output pair is an object with two fields: input and output.
is_classification	boolean	A flag that indicates whether this is a classification task or not. If this flag is set to true, the output should be a single label (e.g. a category or class), otherwise the output can be any text. The default value is false.

Here's an example of a task in the above mentioned format:

{
    "id": "seed_task_0",
    "name": "addition",
    "instruction": "Add the two numbers together",
    "instances": [
        {
            "input": "2 + 2",
            "output": "4"
        },
        {
            "input": "3 + 7",
            "output": "10"
        }
    ],
    "is_classification": false
}

Here is how an sample tasks.jsonl file should like:

{
  "id": "seed_task_0",
  "name": "breakfast_suggestion",
  "instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?",
  "instances": [{"input": "", "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."}],
  "is_classification": false
}
{
  "id": "seed_task_1",
  "name": "antonym_relation",
  "instruction": "What is the relation between the given pairs?",
  "instances": [{"input": "Night : Day :: Right : Left", "output": "The relation between the given pairs is that they are opposites."}], "is_classification": false
}

Save the dataset

In order to use the dataset we just generated and not waste time again next we need it, we can simply save our instance like shown here.

Example

Using .generate_dataset() method we can generate a dataset from a list of tasks/use cases. If the generation gets interrupted, since the results are being cached, we can resume the generation just by passing the same list of tasks. If we don't want to load the cached result, then we will just delete the created folder from our working directory.

from xturing.datasets import InstructionDataset
from xturing.model_apis.openai import Davinci

## Load the required engine
engine = Davinci("your-api-key")

## Generate the dataset
dataset = InstructionDataset.generate_dataset(path="./tasks.jsonl", engine=engine)

## Save the dataset instance
dataset.save('/path/to/directory')

Following parameters can be used to control the extent of generation:

Name	Type	Default	Desription
num_instructions_for_finetuning	int	5	The size of the generated dataset. If this number is much bigger than the number of lines in tasks.jsonl we can expect a more diverse dataset. Keep in mind that the bigger the number you set, more the credits are going to be used from your engine.
num_instructions	int	10	A cap on the size of the dataset, this can help to create a more diverse dataset. If you don't want to apply a cap, set this to the same value as num_instructions_for_finetuning.

From custom data

We can also generate a dataset from our own files.

The files can be of one of the following formats:

.csv .doc .docx .eml .epub .gif .jpg .jpeg .json .html .htm .mp3 .msg .odt .ogg .pdf .png .pptx .rtf .tiff .tif .txt .wav .xlsx .xls

Set up your environment

First, we need to make sure that all the necessary libraries are installed on our system. For this, we need to run the below commands:

OSX
Ubuntu/Debian

This rely on you having homebrew installed

$ brew install caskroom/cask/brew-cask
$ brew cask install xquartz
$ brew install poppler antiword unrtf tesseract swig

$ apt-get update
$ apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig

Prepare the files

Next, we just need to provide the directory path where our files are located. Files from sub-directories will also be discovered automatically.

Save the dataset

In order to use the dataset we just generated and not waste time again next time we need it, we can simply save our instance like shown here.

Example

from xturing.datasets import InstructionDataset
from xturing.model_apis.openai import ChatGPT

# Load the required engine
engine = ChatGPT("your-api-key")

## Generate the dataset
dataset = InstructionDataset.generate_dataset_from_dir(path="/path/to/directory", engine=engine)

## Save the dataset instance
dataset.save("./my_generated_dataset")

Following parameters can be used to customise data generation.

Name	Type	Default	Desription
use_self_instruct	bool	False	When True the dataset will be augmented with self-instructions (more samples, more diverse). In this case, you also have control hover the same parameters of generate_dataset() method: num_instructions, num_instructions_for_finetuning.
chunk_size	int	8000	The size of the chunk of text (in chars) that will be used to generate the instructions. We recommend values below 10000, but it depends on the model (engine) you are using.
num_samples_per_chunk	int	5	The number of samples that will be generated for each chunk.

From no data​

Write your tasks.jsonl​

Save the dataset​

Example​

From custom data​

Set up your environment​

Prepare the files​

Save the dataset​

Example​

From no data

Write your `tasks.jsonl`

Save the dataset

Example

From custom data

Set up your environment

Prepare the files

Save the dataset

Example