0
$\begingroup$

At an airport there is a company that offers a specific service for travellers (there are no competitors offering the same service at the location). They performed an experiment - they temporarily reduced the price of the service (quite significantly) hoping they might achieve more sales. They have records about all sales for about 3 months before and 3 months after the reduction of the price and in addition they know how many travellers visited the airport every day within those 6 months.

The task is to compare two samples:

Sample 1: total sales $(x_{1,1}, x_{1,2}, ..., x_{1,m})$ per day for $m$ consecutive days before the price was reduced ($m$ is approximately 90, i.e. 3 months)

Sample 2: total sales $(x_{2,1}, x_{2,2}, ..., x_{2,n})$ per day for $n$ consecutive days after the price was reduced ($n$ is also approximately 90, i.e. 3 months)

They would like to know if the reduction of the price of the service influenced sales or not. Just by observing graphical representation of the data it looks like that not but they would like to support the hypothesis by statistical arguments.

The first idea that came to my mind is that a variant of two-sample $t$-test could be used in this case. But the problem is that the number of travellers visiting the airport is known for each day and it differs from day to day. So obviously if there is more people at the airport the sales are higher. Let's denote $(y_{1,1}, x_{1,2}, ..., y_{1,m})$ numbers of travelers visiting the airport in days before the reduction of the prize and $(y_{2,1}, x_{2,2}, ..., y_{2,n})$ numbers of travelers visiting the airport in days after the reduction of the prize . My idea is to "standardize" sales by dividing them by numbers of travelers, i.e. $z_{i,j} = x_{i,j}/y_{i,j}$.

Questions I have:

  1. Does it make sense to apply such standardization to the data considering I want to compare resulted "standardized" samples $(z_{1,1}, z_{1,2}, ..., z_{1,m})$ and $(z_{2,1}, z_{2,2}, ..., z_{2,n})$? Would two-sample $t$-test be suitable for such data? (What is worrying me is that in fact each observed day sales value has a different weight $1/y_{i,j}$ that is applied to it when it is standardized and I am not sure if this doesn't break any assumption of two-sample $t$-test.)
  2. Could possibly a different type of analysis be applied to the data to decide if sales were influenced by the reduction of the price of service or not?
$\endgroup$

1 Answer 1

1
+50
$\begingroup$

I would use a lognormal regression, since sales are usually quite skewed and also the result of a lognormal regression is easier to interpret.

Create a data matrix with $n+m$ rows. Each row represents one day. The variable $Y$ will be created by concatenating the vectors $x_1$ and $x_2$. The variable $X$ will be next in the matrix and will contain only zeros and ones. There will be $m$ zeros and $n$ ones indicating days berfore and after price reduction. Finally, we add the variable number of travellers ($N$), which is concatenated vectors $y_1$ and $y_2$ under your notation.

If you are using R, then your model will be as follows:

fit = lm(log(Y) ~ X, offset = log(N), data = yourData)

#statistical significance of sales change
coef(summary(fit))[2,]

#effect size (by what percentage has the profile increased)
effect_size = (exp(coef(summary(fit))[2,1])-1)*100

paste0("The average profit per traveller increased by ",round(effect_size)," percent.")

Why is there an offset in the regression? Because your dependent variable is not sales ($Y$) but sales per traveller ($Y/N$):

$log(Y/N) = \beta_0 + \beta_1X_1$

$log(Y) - log(N) = \beta_0 + \beta_1X_1$

$log(Y) = \beta_0 + \beta_1X_1 + log(N)$

$\endgroup$
6
  • $\begingroup$ Thanks a lot Daniel. I will study your answer in the next few hours and if I find it suitable for my use case I will accept it. $\endgroup$
    – mcihak
    Commented Jan 10, 2023 at 15:19
  • $\begingroup$ Have you tried it? Did it bring the desired result? $\endgroup$ Commented Jan 12, 2023 at 18:31
  • $\begingroup$ Yes, definitely and I have to say that I like the model. Calculations in R led to $b_1 = 0.17$ approx. That means effect size is about 18.5 %. $P$-value is about $0.2$ so I deduced that sales change is not statistically significant. This is I would say due to quite a high variance of sales. So even though effect size is pretty high we can't reject the hypothesis that $\beta_0 = 0$ at significance level of 5 %. ``` $\endgroup$
    – mcihak
    Commented Jan 12, 2023 at 21:13
  • $\begingroup$ To be honest there is one thing I am doubting about. In the lognormal model we consider $m+n$ values $z_{ij}$ (in my notation). These $z_{ij}$ values represent sales per traveller in particular days. The issue I can see is that we doesn't consider importance (weight) of these values $z_{ij}$ in the model. $\endgroup$
    – mcihak
    Commented Jan 12, 2023 at 21:14
  • $\begingroup$ I mean let's say one day sales per person are 5 currency units while 30 000 travellers visited the airport and another day 12 currency units while let's say 10 000 travellers visited the airport. In my opinion values 5 and 12 shouldn't have the same weight in the model. I would say weight of 5 should be three times bigger because it represents 3 times bigger amount of potential customers. (I had to split the comment into 3 comments because of characters limit :) $\endgroup$
    – mcihak
    Commented Jan 12, 2023 at 21:15

Not the answer you're looking for? Browse other questions tagged or ask your own question.