James A. Rosenthal (1996), in Qualitative Descriptors of Strength
of Association and Effect Size, Journal of Social Service Research. He indirectly cites Fleiss (1994):
Fleiss (1994) contends that the odds ratio is the preferred measure of
effect size for dichotomous variables. Unlike the phi coefficient, it
is not affected by the proportions in the sample that comprise the
categories of the independent variable (Fleiss).
Rosenthal proceeds by giving the example of a situation where two hypothetical interventions are tested on young people to prevent delinquent offenses, at two different occasions with different sample sizes. Here are two tables to visualize the hypothetical situations he suggests (he doesn't use tables in the paper, he simply describes the situation inline):
Table 1
|
did not commit delinquent offense |
committed delinquent offenses |
intervention A |
90 |
10 |
intervention B |
50 |
50 |
Table 2
|
did not commit delinquent offense |
committed delinquent offenses |
intervention A |
180 |
20 |
intervention B |
10 |
10 |
He explains that the phi coefficient may lead to an incorrect conclusion. Namely, that the difference of effectiveness between interventions A and B in the first table might look more important than between A and B in the second table, as the $\phi$ coefficient is 0.436 vs. 0.335. But if you look at the row-wise percentages, this conclusion would be highly disputable. This problem does not arise with the odds ratio, which is the same in both tables (9).
In Effect Sizes for Research, Robert J. Grissom and John J. Kim, while they raise criticism against the odds ratio, also argue that (p. 250):
A phi arising from another study of the same two dichotomous
variables, but using a sampling method other than naturalistic
sampling, would not be comparable to a phi based on naturalistic
sampling; that is, the value of phi can vary across studies using
different sampling methods to study the same pair of dichotomous
variables.
They proceed by also explaining the limitations of $\phi$ relative to the possibility of attaining its minimal or maximal values (besides the Grissom and Kim's book, looking for "phi maximum value" in a specialized search engine should return a couple of relevant papers; otherwise, see the Davenport and El-Sanhurry's paper mentioned in the references below).
Grissom and Kim's recommendation for choosing an effect size for 2x2 tables can be found p.281 of their book (bold is mine):
In the case of naturalistic sampling, in which a given number of
participants is categorized with respect to two truly dichotomous
variables in a 2 × 2 table, possibly appropriate measures of effect
size in the population are the phi coefficient, relative risk, and the odds ratio
[...]
When participants have been randomly assigned into two treatment groups that are to be classified into a 2 × 2 table, appropriate measures of effect size are the population risk difference, relative risk, and (possibly) odds ratio.
References
Davenport, E. C., & El-Sanhurry, N. A. (1991). Phi/phimax: Review and synthesis. Educational and Psychological Measurement, 51(4), 821–828. https://doi.org/10.1177/001316449105100403
Fleiss, J. L. (1994). Measures of effect size for categorical data. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 245–260). Russell Sage Foundation.
Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: Univariate and multivariate applications (2nd ed.). Routledge.
Rosenthal, J. A. (1996). Qualitative Descriptors of Strength
of Association and Effect Size, Journal of Social Service Research, 21:4, 37-59, https://doi.org/10.1300/J079v21n04_02