Predictive LLMs: Can GPT-3.5 enhance XGBoost predicitions?

Large Language Models (LLMs), such as GPTs (Generative Pre-trained Transformers), are trained to understand and generate human language. Their applications range from answering questions and creating texts to more complex tasks like translations and automated content generation.

But can we also use LLMs for predicting metric target variables? We asked ourselves this question because a large part of our projects revolves around forecasting metric values or trends. It leads us to take a closer look at the advantages LLMs offer over traditional machine learning methods like XGBoost. LLMs are not only capable of seamlessly processing a wide variety of data formats - from metric to binary, categorical, and free-text data - but they also bring a remarkable ability to recognize complex patterns and relationships because they are pretrained on huge datasets. Compared to standard methods like XGBoost, LLMs thus offer great potential in terms of feature generation and forecast accuracy. Below, we demonstrate how Large Language Models (LLMs), compared to XGBoost, can enhance the forecasting accuracy of a metric target variable.

Use Case

Through our partner LOT Internet, we have gained access to a dataset of vehicles listed on the platform Our main focus is on predicting the vehicle price, while considering various vehicle features such as mileage and age as influencing factors. In addition to these tabular features, the text description in the dataset is of particular interest to us, where sellers provide detailed descriptions of the car's condition and the terms of sale. With the help of an LLM, we can also use the text description as a feature.


In the first step, we convert the tabular data into a text format. An example prompt looks like this:

"Note that the cars we are talking about are offered on the platform and their make is Skoda. Estimate the value of my used 2020 Skoda SCALA, a sedan with a 999cc petrol engine, cloth upholstery, 5 seats, 11801 mileage, and 69 kW power output in euros. Consider in your estimate that the equipment of the Skoda includes -List of Equipment-."

In cases where we also include the text description, we simply add another line to the prompt:

"For your estimate, consider also the following German description of the car: (German description)."

This is the so-called User-Prompt. Preceding this is the system prompt, which controls the general character of the responses from the LLM. In our case, we write:

"I want you to be very short and precise in your answers and only answer with numbers. Remember also that we are talking about the value of used cars for which characteristics such as mileage, construction date, body type, engine displacement, power in kW, fuel use, model, make, equipment on board, and the number of seats provide important information. I want only a number as an answer."

We have carefully chosen the prompts to ensure that the language model provides an estimate that is not too far from the actual car price. Indeed, it is possible to ask the model a specific question for each car and get a price estimate. This approach is called "zero-shot learning", as the model has not seen any training data beforehand. In our case, the forecast accuracy in zero-shot learning is significantly lower than the accuracy of XGBoost.

The crucial step for the success of the prediction by the language model lies in what is called Finetuning. This involves adapting an already pre-trained model specifically for a particular task. In our case, this task is to predict vehicle prices. For this, we need to provide the model with example dialogs with the corresponding answers. During finetuning, each dialog, as in the previous example, along with the correct answer (the vehicle price), is used as a training observation.

We finetuned the openAI LLM gpt-3.5-turbo via the openAI API and then predicted vehicle prices on a test dataset. In XGBoost, we use the same features except for the text description. Who wins the challenge? Here are the results.

ModelMean APEMedian APE
XGBoost (n = 6000)14.1 %7.5 %
XGBoost (n = 5000)14.5 %7.6 %
XGBoost (n = 4000)14.7 %7.5 %
XGBoost (n = 3000)15.1 %7.6 %
XGBoost (n = 2000)16.0 %7.7 %
XGBoost (n = 1000)17.6 %8.5 %
XGBoost (n = 600)20.4 %8.8 %
LLM in zero-shot learning (n=0)27.2 %15.1 %
LLM without text description with finetuning (n = 600)14.0 %7.5 %
LLM with text description with finetuning (n = 600)11.8 %6.6 %

Let's first consider the forecast accuracy of XGBoost with 6000 training observations. In this case, text description is not used as a feature, which would require further preprocessing. We see that the mean absolute percentage error (MAPE) between the predicted and actual vehicle prices is 14.1%. This is our key benchmark. In the case of zero-shot learning, the predictions of the LLM are significantly worse (MAPE 27.2%). However, even without text description, the forecast accuracy achieved through finetuning reaches a similar level to the benchmark (14.0 vs. 14.1%). It is also noteworthy that the LLM achieves this with only 600 training observations, while XGBoost needs 6000 observations for the comparison value. The significance of the number of training observations will be discussed further below. First, it should be highlighted that the text description provides the finetuned LLM with a substantial advantage, improving forecast accuracy over the XGBoost benchmark by 2.3 percentage points to a MAPE of 11.8%.

Discussion of Case Numbers

The LLM was already trained on a huge amount of data before finetuning. Does this affect how many training observations it needs to make good predictions? Our results suggest that it does. The LLM achieves the same level of forecast quality with 600 training observations as XGBoost does with 6000. We attribute this to the effect of pretraining, through which the LLM possesses a broad base of knowledge that is also useful for the specific use case. This advantage underpins the new trend towards so-called foundation models, of which LLMs are a special case. How does forecast quality change when the number of training observations is varied? In our experiments with the openAI LLM, we found that there is no significant difference between using 600 or 6000 observations in finetuning. After the first 600 observations, the LLM in our case does not extract any new information from the data. This is different from XGBoost. Using this algorithm, we observe a continuous improvement in forecast quality as we incrementally increase the training observations from 600 to 6000. If we train XGBoost with the same 600 observations as the LLM, XGBoost does not come close to the quality of the LLM (MAPEs of 20.4% vs. 14.0%). We conclude from this that the use of LLMs can be very worthwhile, especially with a small number of observations.


Should we replace established algorithms like XGBoost with LLMs right now? That would be a premature conclusion. When the number of training observations is large enough and the data is purely tabular, our results show that XGBoost is still just as good as the openAI LLM. Additionally, LLMs have the disadvantage of lacking explainability. Even though text descriptions significantly improve the forecast, it is not immediately clear which aspects of the text description are responsible for this improvement. Overall, it is impressive how well the LLM can predict metric vehicle prices. In our use case, the LLM proves superior in two scenarios. In the first scenario, we also use text descriptions as a feature. The LLM enables a straightforward combination of tabular features with an unstructured data source. The second scenario involves a small number of training observations, in our case 600. Here, the finetuned LLM significantly outperforms XGBoost in terms of forecast accuracy.