👨🏻‍🏫 Inference

xTuring is easy to use. The library already loads the best parameters for each model by default.

For advanced usage, you can customize the generation_config attribute of the model object.

In this tutorial, we will be loading one of the supported models and customizing it's generation configuration before running inference.

First, we need to load the model we want to use.

Choose a model: Choose version:

from xturing.models import BaseModel
model = BaseModel.create("")

Next, we need to fetch model's generation configuration using the below command.

generation_config = model.generation_config()

We can print the generation_config object to check the default configuration.

Now, we can customize the generation configuration as we wish. All the customizable parameters are list below.

generation_config.max_new_tokens = 256

Lastly, we can run inference using the below command to see how our set configuration works.

output = model.generate(texts=["Why are the LLM models important?"])

We can print the output object to see the results.

Name	Type	Range	Default	Desription
max_new_tokens	int	≥1	256	The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
penalty_alpha	int	[0,1)	0.6	For contrastive search decoding. The values balance the model confidence and the degeneration penalty.
top_k	float	≥0	4	For contrastive search and sampling decoding method. The number of highest probability vocabulary tokens to keep for top-k-filtering.
do_sample	bool	{true, false}	false	Whether or not to use sampling.
top_p	float	≥0	0	For sampling decoding method. If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.