Predictive LLMs: Can open source models outperform OpenAI when it comes to price forecasts?

2025-01-22by Till Bethge

In our last blog article, we showed how the Large Language Model (LLM) GPT-3.5 from OpenAI can significantly improve the accuracy of vehicle price forecasts. It is particularly impressive that the LLM is superior to the high-performance machine learning algorithm XGBoost in our use case. The advantage of the LLM is that it processes the combination of text description and tabular data amazingly well. In this article, we repeat the experiment and take a look at whether open source LLMs are able to keep up with the powerful models of OpenAI. Open source LLMs are often made available on the Hugging Face platform. Among others, we use the Llama-3.1-8b-Instruct (Meta) and the Mistral-7B-Instruct, which was developed by the french company Mistral AI.

Use Case

In this project, our partner LOT Internet gave us access to a data set of vehicles listed on the mobile.de platform. Our aim is to predict the sales price of the vehicles. To do this, we use various vehicle attributes such as mileage, year of manufacture and engine power. However, the text description descriptions in which sellers provide detailed information about the condition and equipment of the vehicles are particularly valuable. We use this text description data as an additional feature to increase prediction accuracy. In contrast to conventional machine learning methods such as XGBoost, which mainly process tabular data, LLMs can efficiently integrate both structured and unstructured data.

Implementation

The technical implementation with the open-source models poses a number of challenges. These include the selection of a model with multi-lingual capabilities, the compatibility of the model size with our GPU memory capacities, the technical setup of our GPU and the selection of a suitable Python code base for fine-tuning. First of all, the choice of model was limited by the fact that it had to be able to handle German and English, because the text description data of our use case was given in German. When we first started this analysis, such a model was not so easy to find. We came across the leo-hessianai-7b - a version of Meta's Llama2 that was finetuned on German texts. The Llama-3.1 can now speak German out-of-the-box. Our company's own GPU is an Nvidia L40S with 48GB. With the help of the CUDA-Toolkit we can finetune various models that have around 7-8 billion parameters. We are focussing on the following open source models:

Llama-3.1-8b-Instruct (Meta)
Mistral-7B-Instruct (Mistral AI)
leo-hessianai-7b (Hessian AI)

We use the following Python libraries for implementation:

pytorch: For training neural networks. Is called as a dependency by all libraries that we use.
torchtune: pytorch-native library for the fine-tuning of LLMs. We use it for the Llama-3.1.
mistral-finetune code base: We use it to fine-tune the Mistral model.

Hugging Face libraries:

transformers: Toolkit for the use of pre-trained models.
trl: Enables supervised fine-tuning.
peft: For parameter-efficient fine tuning.

Torchtune scores points with its structured documentation and promises a high level of robustness: ‘The library will never be the first to provide a feature, but available features will be thoroughly tested.’ The strength of Hugging Face libraries is the wide variety of models that can be used with them. Sometimes this leads to confusion. The Mistral-finetune codebase is the library with which we can most quickly perform fine-tuning and predictions by following the instructions in the repo. The fact that this code is only tailored to the Mistral models certainly plays a role in its good usability. At the same time, this limits the reusability of the code when switching to other models.

Implementation steps

Bring data into chat format: Firstly, the tabular data was translated into a text format, similar to our first example with GPT-3.5.
Load model and tokeniser: The models and tokenisers were loaded by Hugging Face. Or in the case of Mistral directly from their website. The tokeniser is an essential component, as it converts the text into a form that the model can understand. The model itself is pre-trained and further customised for the specific task of price estimation.
Training/finetuning: The model has been fine-tuned to fit our specific task. Various hyperparameter settings were tested in order to achieve the best performance.
Optimise hyperparameters of the fine-tuning: Key hyperparameters included batch size, learning rate, and the number of training steps. The algorithm used is called stochastic gradient descent. The batch size indicates how many training observations are used to calculate an update of the model parameters. This update is then called a training step. The learning rate determines the size of the training step while its direction is calculated based on the gradient of the model parameters and with the help of the training observations.
Generation of the forecasts: After training, the price forecasts were generated. We investigated different decoding strategies, but decided in favour of a deterministic strategy, as we only need short answers in the form of numerical values as a return and no creative text answers. Deterministic here means that the tokens with the highest probability are output. An alternative would be, for example, to sample from the tokens with the three highest probabilities.

Results

The table shows the results of our price predictions with Hugging Face's open source models compared to OpenAI's GPT-3.5 turbo. The results of XGBoost are also shown for comparison. For all LLM results, we used the text descriptions during fine-tuning. We vary the number of training observations N used for fine-tuning and document the metrics Mean Absolute Percentage Error (APE) and Median APE on the test dataset.

Model	N	Mean APE	Median APE
gpt 3.5 turbo	600	11.80 %	6.60 %
Meta-Llama-3.1-8B-Instruct	600	14.60 %	8.60 %
Mistral-7B-Instruct-v0.3	600	56.40 %	21.30 %
leo-hessianai-7b	600	23.51 %	11.77 %
XGBoost (without text description)	600	19.80 %	8.50 %
gpt 3.5 turbo	6000	11.80 %	6.60 %
Meta-Llama-3.1-8B-Instruct	6000	13.20 %	7.40 %
Mistral-7B-Instruct-v0.3	6000	12.60 %	6.30 %
leo-hessianai-7b-chat-bilingual	6000	15.35 %	8.60 %
leo-hessianai-7b	6000	13.08 %	7.27 %
XGBoost (without text description)	6000	13.30 %	7.10 %
leo-hessianai-7b	3000	15.34 %	9.09 %
leo-hessianai-7b	1500	18.46 %	11.07 %

Interpretation and conclusion

Our results show that the open source models tested so far only come close to the accuracy of OpenAI's model with a significantly higher number of training observations. An increasing number of training observations has a positive effect on the prediction quality. We illustrate this with leo-hessianai-7b, among others, by gradually increasing the number. To be fair, it should be noted that the openAI model is much larger (approx. 25 x) than the models we tested open-source. However, ignoring the higher number of training observations, the Mistral-7b-Instruct outperforms the openAI model in terms of accuracy when following the median APE (6.3 vs. 6.6 %). We conclude that the open-source models can certainly keep up and assume that with larger open-source models it could also be observed that fewer training observations are required to achieve high accuracy. However, in order to fine-tune larger models ourselves, a computing infrastructure is required that we can only access via cloud providers. This could be an exciting continuation of our experiments.