Hypothetical bias : a new meta-analysis

Participants in hypothetical surveys or referenda typically express higher values for goods than do participants faced with similar choices in which the stakes involve real money. Previous meta-analyses have confirmed the widespread presence of hypothetical bias in stated preference studies and have identified certain factors associated with higher or lesser degrees of bias. These studies, and indeed the broader stated preference valuation literature, have not offered any definitive insights that can reliably be used to eliminate these biases. The earlier meta-analyses are now dated and were based on a limited number of studies. In this chapter we assess the evidence from the literature up to the present time on hypothetical bias. We include many more papers touching on hypothetical bias than were available to or used by the authors of the prior meta-analyses. We also add two variables (not analyzed in the existing literature) to our meta-analysis: one that is designed to capture whether the good in question is likely to be perceived as familiar or unfamiliar to the study’s survey participants and a second that indicates whether or not the valuation of the good in question is largely or exclusively generated by non-use considerations. In the remainder of this chapter, we first discuss how our meta-data were created. We then identify and briefly discuss some of the issues in survey design that have been hypothesized to contribute to the presence or extent of the hypothetical bias exhibited in various studies. We then present results from a regression analysis of the meta-data, and follow with some concluding remarks.


INTRODUCTION
Participants in hypothetical surveys or referenda typically express higher values for goods than do participants faced with similar choices in which the stakes involve real money.Previous meta-analyses have confirmed the widespread presence of hypothetical bias in stated preference studies and have identified certain factors associated with higher or lesser degrees of bias.These studies, and indeed the broader stated preference valuation literature, have not offered any definitive insights that can reliably be used to eliminate these biases.The earlier meta-analyses are now dated and were based on a limited number of studies.
In this chapter we assess the evidence from the literature up to the present time on hypothetical bias.We include many more papers touching on hypothetical bias than were available to or used by the authors of the prior meta-analyses.We also add two variables (not analyzed in the existing literature) to our meta-analysis: one that is designed to capture whether the good in question is likely to be perceived as familiar or unfamiliar to the study's survey participants and a second that indicates whether or not the valuation of the good in question is largely or exclusively generated by non-use considerations.
In the remainder of this chapter, we first discuss how our meta-data were created.We then identify and briefly discuss some of the issues in survey design that have been hypothesized to contribute to the presence or extent of the hypothetical bias exhibited in various studies.We then present results from a regression analysis of the meta-data, and follow with some concluding remarks.

DATA
Our initial sample for our meta-analysis includes all of the relevant comparisons of mean willingness to pays (WTPs) between hypothetical and real survey treatments that we were able to draw from the papers cited in one or more of three previous meta-analyses: those of List and Gallet (2001), Little and Berrens (2004), and Murphy et al. (2005). 2 Similar to the practice adopted in the first two of these studies and in one of the analyses done in Murphy et al. (2005), we analyze only those comparisons from studies that included explicit calculations of mean WTPs across both hypothetical and real treatments.We thus eliminated from further analysis those studies that merely provide percentage yes/no results drawn from dichotomous choice referendum questions and that did not go on to calculate population mean WTPs.To this sample of results drawn from previously cited papers, we added data comparing WTPs drawn from additional papers not cited in prior meta-analyses. 3The unit of observation is a comparison between a hypothetical WTP and a corresponding real WTP for the same good drawn from the same paper.Any particular paper could contribute one or multiple observations to the dataset, with the number depending on the number of survey variants conducted as a part of the paper's survey design.All told, our sample includes 432 comparisons between hypothetical and real results drawn from 77 studies.These studies are listed in the Bibliography with an asterisk.
For each of the comparisons of inferred WTP from CV surveys to inferred WTP from real transactions involving money, we calculate a "bias ratio" (BR) defined as the ratio of the mean WTP drawn from the hypothetical treatment to the mean WTP drawn from the real treatment.A histogram of the BRs found in Figure 1 provides summary data on the bias ratios we derived from the observations contained in our meta-data.The median value in the distribution is 1.39, while the mean value is 2.33.
The range of BRs exhibited in the distribution is relatively large, with a 5th percentile BR of 0.50 and a 95th percentile value of 8.66.Notably, the shape of the distribution of BRs provided in Figure 1 suggests that the values found in the dataset follow something like a log-normal distribution.Figure 2, which displays a histogram of the BRs arrayed on a logarithmic scale, confirms that the BRs are indeed consistent with a lognormal distribution.
We calculate a bias ratio (BR) for each observation retained in our data and assign a series of indicator variables to each comparison reflecting factors present or absent in the study's design that have been hypothesized in the literature to influence the extent of hypothetical bias.These factors are: • whether or not the BR was calculated through use of an ex-post certainty correction; • the presence or absence of a cheap talk script in the survey instrument; • whether the hypothetical and real observations are drawn from a survey in which a single group of participants are asked to respond to both hypothetical and real treatments (same) or are drawn from two separate survey panels (different); • whether or not the study uses a conjoint/choice experiment framework rather than any other type of contingent valuation;

Figure 1 Bias ratio frequency distribution
Harry Foster and James Burrows -9781786434692 Downloaded from Elgar Online at 04/27/2019 07:37:02AM via free access • whether or not the survey group consists entirely of students; • whether or not the survey was administered in a laboratory setting; • whether the good is a public good or a private good; • whether the good is likely to be perceived as a familiar or an unfamiliar one by the survey's participants; • whether the perceived benefits to the survey participants of providing the good are generated primarily by non-use considerations.
Each of these nine factors can be used by itself to divide the full dataset into a pair of non-overlapping and fully inclusive subsamples.Table 1 provides summary statistics on the median, mean, and standard deviations for all of the 18 subsamples that can be created in this way.In addition to the median, Table 1 also provides the 5th percentile and 95th percentile BRs for each subsample and p-values associated with an equality of the means test across each relevant pairing.

Certainty correction
Certainty correction takes a value of one if the observation is derived from a hypothetical treatment employing an ex post certainty correction and is set to zero otherwise.This variable is meant to control for the  use in several studies of certainty correction techniques in an attempt to reconcile the differences between mean WTPs exhibited in paired hypothetical and real survey treatments by using data drawn from follow-up questions asking survey recipients to rate how sure they are in the answer they gave to the valuation question.Most commonly, this certainty question asks recipients to rate their degree of certainty in their answers to a hypothetical valuation question on a numerical scale, typically running from one to ten, with one representing "very uncertain" and ten representing "very certain."Alternatively, some studies dispense with creating a numerical scale, instead asking survey participants to indicate the degree of certainty in their responses by choosing the phrase that best describes their level of certainty from a set of qualitative options (for example, "very uncertain," "somewhat uncertain," "somewhat certain" or "highly certain") presented to them in the survey instrument.Such studies typically find that reasonable agreement between hypothetical and real treatments can be obtained if the set of hypothetical responses used to calculate WTPs is limited to only the survey responses given by those who indicate a degree of certainty that meets or exceeds some cut-off value or, alternatively, opt for qualitative descriptions of their levels of certainty that indicate a relatively high level of certainty.The researcher chooses the appropriate cut-off point for degree of certainty through an ex post determination of which particular value brings the certaintycorrected hypothetical WTP into closest agreement with the WTP derived from a real treatment.The particular cut-off value that is determined in this manner is survey specific and cannot be predicted a priori.For example, of six studies cited in Morrison and Brown (2009) that employed a ten-point certainty scale, two found closest agreement between real and hypothetical values if the analysis of WTP was limited to only responses associated with a degree of certainty of seven or greater, while two other studies found an optimal cut-off at eight, and two found that including only responses equal to ten brought the best fit between hypothetical and real WTP values.Because researchers actively choose, on an ex post basis, the certainty cut-off to be applied to the hypothetical treatment data to mimic the results obtained from an analogous real treatment, calibration factors between certainty-corrected hypothetical WTP results and real treatment WTP values will by design cluster near a value of one.For this reason, our regression models control for observations drawn from "certainty-corrected" or "certainty-calibrated" results.We are unaware of any paper that has analyzed how to set a certainty correction ex antethat is, there is no procedure available to know what the "right" certainty correction is in advance.

Cheap talk
Cheap talk is set at 1 if the hypothetical treatment used in comparing hypothetical and real responses utilized a "cheap talk" script and set to zero otherwise.In this approach, survey respondents in the hypothetical treatment group are asked to answer any valuation questions only after they have first been presented with a script informing them of the tendency of participants in prior hypothetical surveys to overstate WTPs and asking them to keep this fact in mind whenanswering the survey's questions.Most, though not all, studies on the efficacy of cheap talk scripts have found that mean WTPs derived from treatments utilizing cheap talk scripts tend to be lower than those derived from similar hypothetical treatments lacking a cheap talk script, although the differences in the results obtained between the treatments may or may not be statistically significant.Consistent with the expectation that cheap talk scripts should generally dampen the extent of hypothetical bias, we find that the mean BR for study treatments included in our meta-analysis that include a cheap talk script (1.62) is lower than the mean bias ratio for study treatments lacking a cheap talk script (2.42).This difference is significant at any conventional level of statistical significance.

Same respondents vs different respondents
The indicator variable Same is set to one if the survey design has the same person answering the hypothetical and real survey treatments -that is, each participant is first asked to answer hypothetical valuation questions and then is asked to make a real purchase or contribution for the same good.In the alternative "different" sample treatment, participants are divided into two groups, with one group answering only a hypothetical survey and the other group subjected to only the real treatment.Treatments that rely on the same individuals to provide responses for both the hypothetical and then the real valuation exercises are believed to generate smaller bias ratios than treatments that rely on separate and different hypothetical and real treatment groups.When the same participants are asked to respond to a hypothetical treatment and then to a real treatment, their real purchase behavior may be influenced in an upward direction by anchoring or by conscious or unconscious desires to have their real decisions bear a relationship to their answers to the questions asked them in the prior hypothetical treatment.In our sample, the mean bias ratio for within-sample comparisons is just slightly lower than that for between comparisons (2.09 vs 2.43), and the difference between these two values is not statistically significant.

Conjoint/choice experiment
The indicator variable Conjoint is set to one if the observation comes from a study using conjoint or choice experiment techniques in which WTPs are derived indirectly from the pattern of choices individual participants express when asked to choose among hypothetical goods that differ in their product attributes.Conjoint is otherwise set to zero otherwise, as is the case for studies using some variant of a contingent valuation survey.Some proponents of the choice experiment framework have claimed that it is less susceptible to hypothetical bias than are contingent valuation techniques.In our sample, the mean bias ratio for conjoint/choice experiment elicitation formats is 1.80, versus a mean value of 2.46 for studies using any one of several other elicitation techniques.The difference between these two means is statistically different at any conventional level of statistical significance.

Student
The indicator variable Student is set to 1 if a study's participants consist entirely of students and to zero otherwise.It has been hypothesized that survey responses from panels comprised of students are likely to reflect a greater degree of hypothetical bias than are those from panels drawn from predominantly non-student populations.In our sample, the mean bias ratio derived from experiments using only student participants is 2.41, while the mean bias ratio derived from studies that used non-student or mixed survey populations is slightly smaller at 2.29.The difference between these two means is not statistically significant.

Lab experiment
The indicator variable Lab is set to one if the hypothetical and real survey instruments were administered in a laboratory study and to zero otherwise.In our sample, the mean bias ratio derived from experiments conducted in laboratory settings is 1.78, while that derived from studies conducted in other settings is larger, at 2.72.The difference between these two subsamples means is statistically significant at all conventional levels of statistical significance.

Private good/public good
The indicator variable Private is set to one if the good that is the subject of the study is a private good and to zero otherwise.Valuations for public goods might be expected to exhibit greater hypothetical bias than those for private goods, given the far greater familiarity survey participants have in engaging in transactions for the purchase of private goods.Contrary to these expectations, in our subsamples the mean value of BR derived from experiments valuing private goods (2.46) is slightly greater than that derived from the public goods subsample (2.24).The difference between these two means is not statistically significant.

Familiar good
Based upon our own best judgment, we have classified observations into those we believe are for goods that are familiar to the population being surveyed and those that are unfamiliar to them.As might be expected, the mean bias ratio derived from experiments valuing familiar goods is lower than that from experiments valuing unfamiliar goods (2.21 vs 2.42, respectively), although the difference in means is not statistically significant.As far as we are aware, this study is the first of its kind to rely upon this distinction to create an explanatory variable for use in a meta-analysis.

Non-use
This variable, the assignment of which is based upon our best judgment, is set to one for goods that we believe generate all or most of their perceived value from non-use considerations.The mean correction factor in our survey is 2.63 for non-use-value goods and 2.08 for use-value goods.The difference between these two means is statistically significant at the 10% level.As is the case for the creation of the familiar/unfamiliar distinction described above, we believe that this study is the first to use this distinction to create an explanatory variable to be used in a meta-analysis.

Base Model
Having assigned the appropriate indicator variable values to each observation retained in our data we then estimated an equation of the form: where ln (BR), the natural logarithm of the bias ratio for each observation, is the dependent variable and the explanatory variables are a constant and the various indicator variables are as defined in the previous section.
Results of this initial OLS regression analysis are displayed in columns 1 and 2 of Table 2.Because some studies contribute multiple observations to the data, we follow the practice of Little and Berrens (2004) and estimate and report clustered standard errors, with each paper represented in the data forming a separate cluster.In addition, we estimate the equation using both an unweighted and a weighted sample; in the former version, each observation carries equal weight in the estimation, no matter how many other observations may be drawn from the same paper, while in the latter each observation derived from a single study is weighted by the inverse of the number of comparisons in the dataset derived from the same study; that is, for each comparison from a paper contributing n observations to the dataset is weighted by a factor of 1/n.
In general, most of the coefficients have the signs previously hypothesized for them in the literature.The coefficient on the Certainty correction

Functional Form
We explore whether the results displayed in columns 1 and 2 of Table 2 are robust with respect to choice of functional form by estimating regressions in which BR replaces ln (BR) as the dependent variable.The original specification is appropriate under an assumption that the effects of the explanatory variables on observed bias ratios are multiplicative in nature, while the choice of BR as dependent variable implicitly assumes that the effects of the same variables in determining observed bias ratios are additive in nature.Results from regression form in which BR serves as the dependent variable can be found in columns 3 and 4 of Table 2.All of the coefficients in the regressions in which BR is the dependent variable exhibit the same signs as those of their counterparts in the ln (BR) specification.With the sole exception of the Cheap talk variable in the unweighted BR specification, all of the indicator variables that achieve statistical significance in the ln (BR) specifications also achieve some level of statistical significance in the corresponding BR specifications, with only two exceptions (the Cheap talk variable in the unweighted models and the Familiar variable in the weighted models).The same relationship holds in the reverse comparison, as all of the variables that achieve statistical significance in the BR specifications also are of statistical significance in the ln (BR) regressions.The consistency in the patterns of coefficient signs and significance across the two sets of specifications provides reassurance that the results produced with ln (BR) as the dependent variable are not driven by this particular choice of functional form.

Time Trend
We also explore whether the results produced by the reference specification are robust to the inclusion of a Time trend variable.Including a time variable in the regression specification controls for the possibility that, after controlling for the effects of the other explanatory variables, at least some of the variation in bias ratios might reflect ongoing refinements in methodology and the gradual adoption of "best practices" in conducting valuation studies.The Time trend variable is based on year of publication5 and is set at 1 for the year 1972, increasing by one unit with each subsequent year.Table 3 displays a comparison of the results derived from estimating the reference model against those obtained when the Time trend variable is included as an additional explanatory variable.The coefficients on the Time trend variable are relatively small and are not statistically significant in either the unweighted sample or weighted sample regression and are both positive.The coefficients on the other explanatory variables are little changed by the inclusion of a Time trend variable and the pattern of which variables are statistically significant does not change at all, with the exception of the lab variable in the weighted models.Any changes in either the R-squared or root mean-squared error (RMSE) statistics produced by the addition of the Time trend variable to the regression equation are sufficiently small as to leave the rounded values reported in Table 3 unchanged.In short, the addition of a Time trend variable adds nothing to improve either the explanatory or predictive powers of the reference regression specification.
The results of this latest meta-analysis are broadly consistent with the findings of previous meta-analyses with respect to the pattern of coefficient signs.The regression equations we estimate explain, in the best of circumstances, only about 23% of the overall variance we observe in BRs.(R 2 = 0.2329 for the weighted regression and R 2 = 0.2349 for the unweighted regression.)Notable, too, is the low predictive power of these regressions, as evidenced by the relatively large RMSEs in both the weighted and unweighted sample regressions, 0.726 and 0.706, respectively.This represents the standard error of prediction, a measure of the precision with which the actual value of the dependent variable, ln (BR), can be predicted by the regression line.Even the smaller of these figures indicates that 95% of the observations of ln (BR) should fall within an interval of plus or minus 1.38 logarithmic units from their values as predicted by the regression line.Evaluated at their sample means (ln (BR) = 0.452 for the unweighted sample and ln (BR) = 0.616 for the weighted sample) and after having been converted from log form into levels, these relatively large RMSEs establish a 95% confidence interval of prediction for BR ranging from 0.394 to 6.27 for the unweighted sample regression and between 0.446 and 7.68 for the weighted sample equation.These wide ranges clearly indicate that the reference model cannot be used to provide a precise prediction of the bias ratio associated with any particular set of study characteristics.The reference model is thus unsuitable as a tool to offset the hypothetical bias that is inherent in all valuation exercises that attempt to value natural resources on the basis of survey respondents' answers to hypothetical questions.

Fixed Effects Regression
As a final robustness check, and to further explore the usefulness of the regression model in making predictions of the bias ratio associated with any particular survey, we re-estimated the reference model in a version that assigned study-specific fixed effect variables to 76 of the 77 studies from which we obtained our data.Table 4 provides results derived from the fixed effects specification alongside results from the reference model.
A fixed effects regression relies solely on variation within group effects in determining the coefficients to be placed on the other explanatory variables.This characteristic has several implications for our model.First, it means that studies that supply only one observation to the dataset are effectively ignored in estimating the other coefficients of our models.Second, we cannot estimate coefficients for variables that are held constant within each and every paper in which they appear.As a result, we cannot simultaneously estimate fixed effects coefficients and also estimate coefficients for the private, familiar good, and non-use variables.
The regression coefficients associated with five of the six indicator variables that are common to both the fixed effects and reference regressions are generally larger in magnitude and more likely to achieve statistical significance in the fixed effects specifications than is the case for their reference specification counterparts.This general observation is particularly evident when evaluating the coefficients on the potentially offsetting student and lab variables.The exception is the certainty correction variable, which drops in magnitude between the reference and fixed effects regressions, while remaining highly statistically significant.
Comparing the key regression statistics generated by the fixed effects specifications to those produced by the reference model, it is apparent that the fixed effects specifications do a better job than do their reference model counterparts in explaining the data.The fixed effects regressions generate considerably higher R-squared statistics and lower RMSE statistics than those derived from the estimation of their corresponding reference model equations.This is to be expected, as incorporating a separate fixed effect for every study should improve the goodness of fit of a regression, but the extent of the improvements further illustrates how little of the underlying variation in the data can be attributed to readily observable study characteristics.Our results offer little hope for any efforts to develop "correction

CONCLUSIONS
This study considers whether economists have yet developed any practical and reliable ways to correct for or overcome the well-known phenomenon of hypothetical bias found in survey-based attempts to value environmental or other goods.It does so by updating and extending work done in prior meta-analyses of stated preference methods that has confirmed the widespread presence of hypothetical bias in stated preference studies and that has associated certain factors in survey design with higher or lesser degrees of observable bias.Our meta-analysis, like prior meta-analyses on the same topic, offers no definitive insights that can be used to eliminate or reduce hypothetical bias.While we find some, but generally weak, associations between the presence of a limited number of survey design characteristics and the degree of hypothetical bias likely to be exhibited in particular types of survey treatments, any insights provided by our analysis cannot reliably be used to control for or eliminate the degree of bias likely to be found in any particular survey as the regression coefficients produced by our analyses are typically associated with relatively wide standard errors and the equations can explain only a small fraction of the variance exhibited in the degree of hypothetical bias observed across various studies.

Table 1
Summary statistics for bias ratio by observation type

Table 2
Regression coefficients for reference and linear specifications Harry Foster and James Burrows -9781786434692 Downloaded from Elgar Online at 04/27/2019 07:37:02AM via free access variables in the unweighted models is relatively large, negative, and statistically significant at the 1% level.The Cheap talk, Same, and Conjoint variables are associated with lower bias ratios, as is the variable meant to indicate whether or not the good being valued is a familiar one.The use of student subjects is associated with higher bias ratios.Potentially offsetting the effects of the Student variable, the coefficient on the lab variable is negative, indicating that conducting experiments in a lab setting is associated with lowered bias ratios. 4The coefficient on the Private variable is positive in sign, as is the coefficient on Non-use.The coefficients on the Certainty correction, Cheap talk, Private and Non-use variables are the only ones to achieve statistical significance at conventional levels (p < 0.05) in both the weighted sample and unweighted sample regressions.The coefficient on the Lab variable is of marginal statistical significance (p < 0.10) in only the weighted sample regression.All of the other indicator variable coefficients fail to achieve statistical significance at conventional levels.

Table 3
Time trend regression coefficients

Table 4
Comparison between reference model and fixed effects regression coefficients Robust standard errors in parentheses; *** p < 0.01, ** p < 0.05, * p < 0.1.Harry Foster and James Burrows -9781786434692 Downloaded from Elgar Online at 04/27/2019 07:37:02AM via free accessfactors" or other tools that would be needed to offset hypothetical bias in any particular instance. Note: