Blog

100 grams of Lego, please.

Everyone who has children or deep down still is one himself might had the pleasure to find himself in a toy store's Lego counter once. Since it is my son's favourite father-son-activity to build Lego stuff I did so several times in the last year. Now as I stood in front of all those Lego boxes studying prices and themes, while my son carefully pondered on which box to buy, I became suspicious because the pricing of Lego Star Wars seemed to be somewhat higher than for other Lego themes (Star Wars is my sons favourite Lego theme so I'm kinda interested in its pricing). One thought further I wondered if there are any usable data in the world wide web on Lego sets to do some crunching.

As you're reading this article, the answer is yes!

The website https://brickset.com has detailed data on Lego sets from the past 60 years that can easily be accessed via its API. Additionally, each sets website offers some more information that can be obtained by web scraping.

Important Note: If you'd like to execute the code you need to get yourself an API key. This code, unfortunately, won't work with the dummy key.

library("httr")
library("XML")
library("rvest")
library("cowplot")
library("Hmisc")
library("lme4")
library("dplyr")
library("tidyr") 
library("ggplot2")

Get Data From Brickset API

You need to get yourself an API key. There you go: https://brickset.com/tools/webservices/requestkey. Copy the key below where the example key is specified. Mind the quotes and the hyphen.

apiKey <- "XXXDUMMYXXX"
years  <- 2010:2016

# dont do linebreaks within url character string

url <- paste0("brickset.com/api/v2.asmx/getSets?&query=&apiKey=",
              apiKey,
              "&userHash=&theme=&subtheme=&setNumber=&year=",
              paste(years, collapse = ","),
              "&Owned=&Wanted=&orderBy=&pageSize=20000&pageNumber=&userName=")
xmlDat <- url %>% GET()


# check if it worked
http_status(xmlDat)


# write xml content to data frame
lego <- xmlDat %>% xmlParse() %>% xmlToDataFrame()

Pick Relevant Variables And Set Correct Class

The dataset has lots of variables that are redundant to this analysis. So let's clean up a little bit.

lego <- lego %>% 
 select(., setID, name, year, theme, pieces, UKRetailPrice, USRetailPrice, 
                                    EURetailPrice, bricksetURL)

num <- c("year", "pieces", "UKRetailPrice", "USRetailPrice", "EURetailPrice")
lego[, num] <- sapply(lego[, num], as.numeric)

Now take a look on which Lego themes we've got. 

lego %>% 
  group_by(theme) %>%
  tally(sort = TRUE)

There are some themes that are no 'real' themes or hard to compare, i.e. "Books" or "Gear". That's why I exclude some of them. Then I'll pick the 4 themes with the most sets (Star Wars is one of them).

lego <- lego %>% 
  filter(!theme %in% c("Gear", "Collectable Minifigures", "Duplo", "Books", "Promotional", "Miscellaneous"))

relevantThemes <- lego %>% 
  group_by(theme) %>% 
  tally(sort = TRUE) %>% 
  slice(1:4) %>% 
  .[["theme"]]

lego <- lego %>% filter(., theme %in% relevantThemes)

Web Scraping For Additional Information

If you take a look at the variables in the dataset and if you take a further look on the brickset website for each set you might notice that there are some information in the web that we don't have in our dataset. Such things as the weight and the dimensions of a set could be quite helpful information here. So what we'll do is scrape each set-website (luckily the brickset URL is one of the variables in the dataset) and get us the data we want.

for (i in 1:length(lego$bricksetURL)) {
  r <- GET(lego$bricksetURL[i])
  
  # get features and infos
  feat <- r %>% read_html %>% html_nodes("dt") %>% xml_text()
  info <- r %>% read_html %>% html_nodes("dd") %>% xml_text()
  names(info) <- feat
  
  # write them into the lego table
  lego$weight[i] <- info["Weight"]
  lego$volume[i]    <- info["Dimensions"]
}

Make Information Gained From Web 'Workable'

Okay, the information is there but if you take a look at 'weight' for example, you'll see that these character strings are far from ready to work with.

lego$weight <- lego$weight %>% 
  gsub("Kg.*", "", .) %>%  # extract the weight only from character strings
  as.numeric()*1000        # get grams instead of kilo

lego$volume <- lego$volume %>% 
  strsplit(., " x | cm") %>%  # cut character string where 'x' and 'cm' is
  lapply(., `[`, 1:3) %>%     # take the three first arguments from each list (length, width, height)
  lapply(., as.numeric) %>%   # convert to numeric
  lapply(., prod ) %>%        # dimension in cubic centimeter 
  unlist() / 1000             # calculate the packaging size in liters (1 liter = 1000 cubic centimeter)
  

# pricePerPiece (€) can be calculated by hand
lego <- mutate(lego, pricePerPiece = lego$EURetailPrice/lego$pieces)

Descriptive Statistics

Before start doing analysis with our nice and clean Lego dataset, we should take a look at some descriptive stats. This task is facilitated tremendously by the describe()-function from the package 'Hmisc'. 

describe(lego)

Use Cases Without NAs Only

As the description of the dataset has shown, there are some variables that have missing values. In this analysis we only want to use sets without any missing value (NA).

# how much sets are without NAs?
sum(complete.cases(lego))

# exclude sets with NAs
lego <- lego[complete.cases(lego), ]

Pricing

# first get rid of the extreme outliers in weight and pieces
lego <- lego %>% filter(!weight/pieces > quantile(weight/pieces, na.rm = T, probs = c(.99))) 

For the reason that I live in Germany I will work with the € price of the Lego sets. The dashed lines refer to the average value of the corresponding variable.

# Price ~ Packaging Size by Theme
p1 <- ggplot(lego, aes(volume, EURetailPrice)) + 
  geom_point(aes(color = theme), size = 2, alpha = .5) + 
  geom_hline(yintercept = mean(lego$EURetailPrice), lty = "dashed") +
  geom_vline(xintercept = mean(lego$volume), lty = "dashed") +
  ylab("Price [€]") + xlab("Packaging Size [liter]")  +
  theme(legend.position = "none") 

# Price ~ Pieces
p2 <- ggplot(lego, aes(pieces, EURetailPrice)) + 
  geom_point(aes(color = theme), size = 2, alpha = .5) + 
  geom_hline(yintercept = mean(lego$EURetailPrice), lty = "dashed") + 
  geom_vline(xintercept = mean(lego$pieces), lty = "dashed") +
  ylab("") + xlab("Pieces") +
  theme(legend.position = "none", axis.text.y = element_blank())

# Price ~ Weight
p3 <- ggplot(lego, aes(weight, EURetailPrice)) + 
  geom_point(aes(color = theme), size = 2, alpha = .5) + 
  geom_hline(yintercept = mean(lego$EURetailPrice), lty = "dashed") + 
  geom_vline(xintercept = mean(lego$weight), lty = "dashed") +
  ylab("") + xlab("Weight [g]") +
  theme(axis.text.y = element_blank())
# combine plots
plot_grid(p1, p2, p3, nrow=1, rel_widths = c(2, 2, 2.5))

p4 <- ggplot(lego, aes(weight/pieces, EURetailPrice/pieces)) + 
  geom_point(aes(color = theme, size = lego$pieces), alpha = .5) +
  xlab("Weight Per Piece [g]") +
  ylab("Price Per Piece [€]") +
  theme(legend.position = c(0.2,0.8)) +
  scale_size(guide = 'none') 

p5 <-   ggplot(lego, aes(x = theme, y = EURetailPrice/pieces)) +
  geom_boxplot(aes(group = theme, color = theme)) +
  ylab("") + xlab("") + theme_classic() +
  theme(axis.text = element_blank(), axis.ticks = element_blank(), 
        legend.position = "none") 

p6 <-   ggplot(lego, aes(x = theme, y = weight/pieces)) +
  geom_boxplot(aes(group = theme, color = theme)) +
  ylab("") + xlab("") + theme_classic() + coord_flip() +
  theme(axis.text = element_blank(), axis.ticks = element_blank(), legend.position = "none") 
plot_grid(p6, NULL, p4, p5, ncol = 2, rel_widths = c(1,0.3), rel_heights = c(0.3,1))

The plot shows the relation of the weight per piece and the price per piece, note that the size of the points is proportional to the number of pieces in a set. The boxplots show the distribution of 'EURetailPrice/pieces' and 'weight/pieces'.
It can be seen that on average the Star Wars Lego theme has the lightest but most expensive pieces.

Price of 1g Lego by theme and currency area

Now I wondered if Lego has a different price policy in the other currency areas. For each I regressed the price per gram on the theme. In order to get a regression coefficient for every factor of theme I excluded the intercept. 

# USA
fit <- lm(USRetailPrice/weight ~ 0 + theme, data = lego)

PPG_USA <- coef(fit) %>% 
  setNames(., unlist(fit$x)) %>% 
  round(., 4)*100 
# EU
fit <- lm(EURetailPrice/weight ~ 0 + theme, data = lego)

PPG_EU <- coef(fit) %>% 
  setNames(., unlist(fit$x)) %>% 
  round(., 4)*100 
# UK
fit <- lm(UKRetailPrice/weight ~ 0 + theme, data = lego)

PPG_UK <- coef(fit) %>% 
  setNames(., unlist(fit$x)) %>% 
  round(., 4)*100 

Results

The prices per 1 gram Lego in the table below are given in Cents ($) for the US, Pence (£) for the UK and in Cents (€) for the EU. 

CityFriendsNinjagoStar Wars
USA6.507.377.988.16
UK4.826.206.357.40
EU5.596.917.938.95

Roughly there is the same price relation among the themes in each currency area. But still you pay appr. 37 % more for Lego Star Wars compared to City in the EU. In the UK it is 35 % and in the US 20 %. 

Price Development (per 1g) By Theme And Year

One information we still didn't use is the year, let's change that and take a look at the yearly price development for each theme. 

# regression grouped by year
lego$year <- as.factor(lego$year)
lego$theme <- as.factor(lego$theme)

priceDev <- lego %>% 
              lmList(EURetailPrice/weight ~ 0 + theme | year, data = .)  %>%   # regression grouped by year
              coef() %>%                                                # get regression coefficients
              data.frame() %>%                                          # create a data frame 
              setNames(., gsub('theme', '', names(.))) %>%              # nice colnames
              mutate(., year = as.numeric(rownames(.)))                 # add 'year' variable from rownames
              

# for ggplot we need to have a clean data frame (each variable in one column)
priceDev <- gather(priceDev, theme, pricePerGram, -year)   
priceDev <- lego %>% 
  group_by(year) %>% 
  nest() %>% 
  mutate(model = purrr::map(data, ~ lm(EURetailPrice/weight ~ 0 + theme, data = .))) %>%  
  unnest(model %>% purrr::map(broom::tidy)) %>% 
  select(year, term, estimate, std.error) 


priceDev$term <- gsub('theme', '', priceDev$term) 
ggplot(priceDev, aes(x = year, y = estimate, group = term)) + 
   geom_point(aes(color = term), size = 3) + geom_line(aes(color = term), size = 1) +
   geom_errorbar(aes(ymax = estimate + std.error, ymin = estimate - std.error, width=0.2, color = term), alpha = 0.5)

So the good thing is that I was right with my assumption concerning Lego Star Wars. But unfortunately this fact wont invalidate my son's argument that it is Lego Star Wars he likes to play with most.