Skip to main content

👨🏻‍🏫 Inference

xTuring is easy to use. The library already loads the best parameters for each model by default.

For advanced usage, you can customize the generation_config attribute of the model object.

In this tutorial, we will be loading one of the supported models and customizing it's generation configuration before running inference.

Load the model

First, we need to load the model we want to use.

from xturing.models import BaseModel
model = BaseModel.create("")

Load the config object

Next, we need to fetch model's generation configuration using the below command.

generation_config = model.generation_config()

We can print the generation_config object to check the default configuration.

Customize the configuration

Now, we can customize the generation configuration as we wish. All the customizable parameters are list below.

generation_config.max_new_tokens = 256

Test the model

Lastly, we can run inference using the below command to see how our set configuration works.

output = model.generate(texts=["Why are the LLM models important?"])

We can print the output object to see the results.


max_new_tokensint≥1256The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
penalty_alphaint[0,1)0.6For contrastive search decoding. The values balance the model confidence and the degeneration penalty.
top_kfloat≥04For contrastive search and sampling decoding method. The number of highest probability vocabulary tokens to keep for top-k-filtering.
do_samplebool{true, false}falseWhether or not to use sampling.
top_pfloat≥00For sampling decoding method. If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.