Blog

XGBoost vs. LLMs for Predictive Analytics

Traditionally, methods such as regression, decision trees, and established machine learning methods like XGBoost are used in the field of predictive analytics. However, the rapid development in the field of artificial intelligence may introduce a new tool: Large Language Models (LLMs). They have the potential to dethrone the current gold standard, XGBoost. Although LLMs are not actually designed to output metric target variables – which is very often the case in predictive analytics. In the following, I will explain the advantages and disadvantages of XGBoost and LLMs for predictive analytics applications and provide an assessment of what the new state of the art might look like.

The Classic XGBoost

XGBoost (Extreme Gradient Boosting) is a powerful boosting framework that uses decision trees to create precise and fast prediction models. It has been optimized for efficiency and accuracy and has become well-known through Kaggle challenges, as it has been used by many winners.

Advantages of XGBoost in Predictive Analytics

  • High Accuracy: XGBoost often provides better predictions than many other ML algorithms.
  • Efficiency: Optimized for speed and low memory consumption.
  • Feature Importance: Offers insights into how much a feature contributes to improving accuracy ("Gain").
  • Scalability: Works well with large datasets and can be distributed across multiple cores and machines.

Disadvantages of XGBoost in Predictive Analytics

  • Complexity of Hyperparameters: Requires careful tuning of many parameters.
  • Computational Intensity: Despite optimizations, it can be resource-intensive with very large datasets.
  • Input Data: Text data cannot be used as a feature in XGBoost; they must be manually converted into numerical features.

Large Language Models (LLMs)

LLMs are based on neural networks that are trained with large text datasets. They can solve complex linguistic tasks such as text generation, translation, and answering questions. Models like GPT-4 are capable of understanding and generating human-like text.

Advantages of LLMs in Predictive Analytics

  • Processing Text and Unstructured Data: Can analyze and interpret large amounts of text data, making unstructured data sources usable for forecasts.
  • Pretraining: LLMs have been pretrained on large text datasets and often come with knowledge relevant to the question at hand. However, this is a double-edged sword, as it also brings the risk of information leakage! (#43: Damit es im Live-Betrieb nicht kracht: Vermeidung von Overfitting & Data Leakage)

Disadvantages of LLMs in Predictive Analytics

  • Information Leakage: The LLM may have been trained on data that already contains the outcome of the prediction question.
  • Lack of Transparency: The decision-making processes are not comprehensible. Using open-source models, Explainable AI (XAI) can help mitigate this.
  • Cost: Using LLMs either requires payment based on the number of tokens or a GPU to work with open-source models.

The New State of the Art?

To improve the accuracy of predictions, XGBoost and LLMs can be combined, especially if there are relevant text or unstructured data for the prediction task. In this case, additional features can be created from text or unstructured data using the LLM for the prediction with XGBoost.

Alternatively, the structured features can be combined with the unstructured data and directly passed to an LLM via a prompt to generate predictions.

Although it is currently difficult to foresee, it is conceivable that LLMs alone might represent the new state of the art for predictive analytics in the future. It should be noted that in our previous tests, only LLMs with fine-tuning have come close to the performance of XGBoost, whereas zero-shot approaches have not yet come close.

Conclusion

The combination of LLMs and XGBoost offers a promising way to improve prediction accuracy. While XGBoost remains an excellent choice for structured data, LLMs can provide additional insights from unstructured text data. By integrating both methods, we can leverage the strengths of each approach to develop even more precise and robust prediction models. However, it is already evident that LLMs outperform XGBoost in certain scenarios. For more information, check out our (german) podcast episode: #50: Predictive Analytics mit LLMs: ist GPT3.5 besser als XGBoost?