9
$\begingroup$

I am trying to find the earliest use of the term hyperparameter. Currently, it is used in machine learning but it must have had earlier uses in statistics or optimization theory. Even the multivolume Lexicon der Mathematik (Springer) does not have this term.

So far, one can trace up to 1972 in Bayes Estimates for the Linear Model D. V. Lindley, A. F. M. Smith Journal of the Royal Statistical Society. Series B (Methodological), Volume 34, Issue 1 (1972), 1-41. Link

The authors introduce the term hyperparameter with a footnote:

In the present paper we study situations where we have exchangeable prior knowledge and assume this exchangeability described by a mixture. In the example this implies $E\left(\theta_i\right)=\mu$, say, a common value for each $i$. In other words there is a linear structure to the parameters analogous to the linear structure supposed for the observations $\mathbf{y}$. If we add the premise that the distribution from which the $\theta_i$ appear as a random sample is normal, the parallelism between the two stages, for $\mathbf{y}$ and $\boldsymbol{\theta}$, becomes closer. In this paper we study the situation in which the parameters of the general linear model themselves have a general linear structure in terms of other quantities which we call hyperparameters. $\dagger$ In this simple example there is just one hyperparameter, $\mu$.

Footnote

$\dagger$ We believe we have borrowed this terminology from I. J. Good but are unable to trace the reference.

I.J. Good was a statistician turned philosopher but Google Scholar shows no hope that he introduced this term after 1960s.

$\endgroup$

1 Answer 1

7
$\begingroup$

In 1996 Irving Good himself recalls:

One of the related problems close to philosophy is the estimation of the probability of one category of a multinomial when the order of the cells is irrelevant. [... This] led me on to the development of a hyperprior for the hyperparameter $k$, which I discussed at some length in my book The Estimation of Probabilities (Good, 1965) and in several later works.


As pointed out by AChem, Good does not use the term hyperparameter in his 1965 book. Instead, he speaks of a "flattening constant". In the 1980 paper Some history of the hierarchical Bayesian methodology Good explicitly claims to have invented the latter term, without claiming credit for the former term:

$\endgroup$
2
  • 3
    $\begingroup$ I had checked his book (1965) from the Internet Archive. It is freely available. The term hyperparameter does not exist there at all. May be Good's memory was fading when he gave this interview. $\endgroup$
    – ACR
    Commented Aug 13, 2023 at 18:55
  • 3
    $\begingroup$ @AChem There's no need to postulate that Good's memory was fading. He claims only that he came up with the term "flattening constant" before Fienberg and Holland did. His use of the term "hyperparameter" in 1996 may have been because he thought that, in the world of 1996, that term was more familiar to his readers, not because he thought he had used the word "hyperparameter" originally. But I agree with you that the 1996 interview does not give us any more information about who coined the term "hyperparameter." $\endgroup$ Commented Aug 17, 2023 at 12:03

Not the answer you're looking for? Browse other questions tagged or ask your own question.