Assessing the validity of stated preference data using follow-up questions

Stated preference (SP) studies such as contingent valuation (CV) and discrete choice experiments (DCEs) are often used to attempt measurement of willingness to pay (WTP) for environmental goods. However, concern exists that these methods do not provide data that can support valid, reliable, and meaningful WTP estimates, especially in the context of estimating non-use values for environmental goods. The foundation of all survey-based exercises is that the questions as asked by the researcher and answered by the respondent share a common understanding. This common understanding is difficult to achieve. In WTP studies, additional criteria must be met if the results are to provide data for estimating Hicksian welfare measures.2 The criteria that must be satisfied if SP data are to be theoretically interpreted via the standard microeconomic rational choice model (RCM) have been widely discussed in the literature (e.g., Mitchell and Carson, 1989; Carson and Groves, 2007; US EPA SAB, 2009; Carson and Louviere, 2011; Bateman, 2011). General consensus exists on these criteria: that the respondents believe the information in the survey and base their responses solely on outcomes described in the survey,


INTRODUCTION
Stated preference (SP) studies such as contingent valuation (CV) and discrete choice experiments (DCEs) are often used to attempt measurement of willingness to pay (WTP) for environmental goods.However, concern exists that these methods do not provide data that can support valid, reliable, and meaningful WTP estimates, especially in the context of estimating non-use values for environmental goods.The foundation of all survey-based exercises is that the questions as asked by the researcher and answered by the respondent share a common understanding.This common understanding is difficult to achieve.In WTP studies, additional criteria must be met if the results are to provide data for estimating Hicksian welfare measures. 2The criteria that must be satisfied if SP data are to be theoretically interpreted via the standard microeconomic rational choice model (RCM) have been widely discussed in the literature (e.g., Mitchell and Carson, 1989; Carson and Groves, 2007; US EPA SAB,  2009; Carson and Louviere, 2011; Bateman, 2011).General consensus exists on these criteria: that the respondents believe the information in the survey and base their responses solely on outcomes described in the survey, 1 Respectively, Senior Economist, Cardno, Newark, DE (corresponding author, Kelley Myers@cardno.com);Technical Director, Economics and Decision Sciences, ERM, Raleigh, NC; Vice President, Cardno, Newark, DE; Senior Consultant, Cardno, Santa Barbara, CA.
2 Economists, psychologists and others have developed more behaviorally based theories that depart from the standard microeconomic model of rational choice.While welfare measures may be developed for such theories, we focus here on the standard interpretation of rational choice and Hicksian WTP measures associated with such choice.they treat the exercise posed in the survey as they would a real decision that affects their budget, and they answer valuation questions as rational economic agents with well-defined preferences who are trading money for economic goods.
One approach for assessing whether respondents satisfy these criteria is to use follow-up, debriefing questions.The earliest and most ubiquitous follow-up questions were "Yes/No" follow-ups based on recommendations from the National Oceanic and Atmospheric Administration (NOAA) Blue Ribbon Panel on contingent valuation.As part of a review of the use of contingent valuation to estimate lost non-use values in the context of natural resource damage assessments (NRDAs), the NOAA panel recommended the use of "Yes/No" follow-ups to determine the type of response (i.e., protest vote, yea-saying, etc.).However, the scope of follow-up questions has expanded over time (Krupnick and Adamowicz, 2006 provide a discussion).These questions may be used to "shore up the credibility of the survey" (ibid.),"to modify the estimate derived from one or more SP questions in some way" (Carson and Louviere, 2011), or to identify "problematic responses" in order to delete some responses or respondents or treat them as zeros for analysis purposes.
Despite their ubiquity, there is little consistency to either the questions posed or their use to modify analyses.First, no consensus exists on what or how many questions to ask in order to identify problematic responses.Second, most studies report results by question rather than by respondent; thus, the literature does not evaluate how many respondents had a general understanding of the tasks asked of them.Third, other than those respondents who protest the SP exercise as a whole and typically are dropped from the analysis sample, no consensus exists on what to do about problematic answers.This lack of consistency in the use of follow-up questions is troubling, as substantial proportions of respondents may give problematic answers to some of the follow-up questions and welfare estimates may be sensitive to decisions made regarding such answers.
This chapter does not solve this problem; we do not propose a theory of "problematic responses" and a practice for what to do about them.We do, however, provide some new insights into the potential prevalence of problematic responses and assess whether respondents are providing valid information.We focus on the pattern of responses by individual respondents to follow-up questions across a suite of debriefing questions.These questions identify whether respondents are failing to meet the criteria for satisfactory SP responses discussed above.This approach allows us to assess whether respondents "fail" on a large number of questions or only one, fail on one or many validity criteria, give responses that are correlated with observable demographic variables, and whether validity failures are related to answers to valuation questions. 3 The subject of our survey is valuing wetland restoration projects to reduce the effects of hypoxia in the Chesapeake Bay.The survey was Internet based and uses a sample of respondents from a web-based panel.The results show that most respondents do not meet the fundamental SP assumption that responses to valuation questions reflect carefully considered, rational economic values for the goods being evaluated in the survey.
In fact, if one uses the answers to our suite of follow-up questions as a whole to identify a "core" 4 group of respondents who give unambiguously valid responses, the core would include two respondents out of a total of 1,224, both of whom were not willing to pay for environmental improvements in any of their votes.We also find that people are likely to fail more than one question within a single validity criterion.In other words, when using different types of questions to address the same topic (i.e., various types of questions and response formats that address scenario attendance), people still fail, thus reducing the likelihood that the failing response was due to response error (e.g., misinterpreting questions, marking the wrong response). 5Further, we find little relationship between the tendency to fail the criteria and demographic variables.Hence, applying some sort of weight to the sample to match the population based on census data does not appropriately weight for the proportion of those in the population that would fail to meet the SP validity criteria.This undermines the ability to apply "econometric fixes" to problematic answers.
The rest of the chapter is organized as follows.The next section provides examples of SP studies that use debriefing questions.The third section describes our study design and data.The fourth summarizes the results, while the fifth describes the implications and paths for further research.
3 Of course, asking a follow-up question about what a respondent was thinking when answering the primary valuation question has its methodological defi ciencies.An alternative is a "think aloud" protocol in real time (e.g., Schkade and Payne, 1994) as the respondent is answering the question.However, follow-up questions are frequently used to identify "problem responses" and can trigger alternative estimators of welfare.It is this practice we address here.
4 Bishop et al. (2011) refer to those satisfying criteria as being part of a "rational core" of respondents; here, we call those passing all questions as part of the "core." 5 For example, we ask respondents seven diff erent questions to assess whether they attend to the voting scenarios and outcomes described in the survey and not others.The average number of failed questions is three, and 75% of respondents fail between two and fi ve questions.

LITERATURE REVIEW
This section discusses the use of follow-up questions in the SP literature and describes how they map to the basic principles of validity (i.e., respondents take the exercise seriously and treat it as they would a real decision, believe the information in the survey and answer valuation questions as rational economic agents with well-defined preferences).The goal is not to provide a comprehensive assessment of whether or not validity has been found to be a significant problem.Instead, we provide a description of some of the ways that it has been assessed as background information to illustrate how we developed our approach.Table 1 provides examples of the results of follow-up questions reported in the SP literature.
The most common approach to assessing whether respondents take the SP exercise seriously (i.e., view their responses as consequential) is to use questions that ask about response certainty.Using this approach, respondents are asked how certain they are that they would actually pay the amount, or vote as they indicated they would in the survey.Scenario acceptance, or belief in the information provided, requires that respondents value the good described in the survey (and not some other good of their own construction) based on the stated price (and not some other price they believe they would pay).A number of studies ask follow-up questions to test whether respondents believe the survey scenario (Carson et al., 1994,  2003; Krupnick et al., 2002; Banzhaf et al., 2006, 2011; Bishop et al., 2011).These studies ask whether the individual believed the outcomes described would occur, if they believed they would have to pay the amount shown, or if they valued something larger than the good in question.
Finally, the third criterion requires that respondents exhibit utility maximizing behavior and make trade-offs according to standard compensatory methods.Examples of behaviors that violate this criterion include problematic attitudes such as yea-saying or purchasing moral satisfaction, protest responses, using simplifying decision heuristics rather than careful evaluations, and ignoring certain attributes of the SP question.Follow up questions are often used to identify these types of behaviors and adjust the WTP values accordingly.
Our literature review yields three insights that guided our study design.First, because of the widespread use of follow-up questions in CV surveys, we expected that almost all recent DCE studies would use follow-up questions to test validity comprehensively.However, the proportion of DCE studies using follow-up questions to test validity is smaller than we expected, 6 and most studies that do use follow-up questions only focus The first section introduces respondents to Chesapeake Bay and describes the causes and impacts of hypoxia and how restoring coastal wetlands can reduce these effects.The first section also asks some general warm-up questions about environmental attitudes.The second section describes a potential program for reducing hypoxia in Chesapeake Bay by restoring coastal wetlands.This section includes a description of the policy change, the institution for providing this change, and the payment mechanism.In our survey, the policy change is a second phase of restoration to build on restoration that has already occurred in Phase 1 (thus mitigating the desire to vote yes to "do something" for the environment, since something already has been done).If approved, Phase 2 would require a one-time payment through increased income taxes for all US households.We select a national income tax as the payment mechanism since the benefits of the restoration are not limited by geographic location. 7The pages that follow describe the attributes affected by the program, which include acres of restored wetlands, bird diversity, days without excess algae, fish and shellfish abundance, public access to wetlands, and a Chesapeake Bay state that "[a] surprisingly large number of stated choice surveys do not use debriefi ng questions. ..that ask respondents what they felt or thought as they read text or answered questions."Our review also supports this fi nding.
7 Using a general tax as a payment mechanism is one of two types of coercive payment mechanisms commonly used in DCEs (see Carson and Louviere, 2011  ecosystem health score. 8The attributes were developed over the course of a year through a combination of consultation with ecologists, subject matter experts, and focus group respondents.We also designed several of the ecological attributes, including the Chesapeake Bay ecosystem health score, by following guidelines for ecological indicators in SP valuation developed by Johnston and collaborators (Johnston et al., 2011, 2012).
The third section describes the voting format and includes a reminder about some of the pros and cons of voting for a restoration program.The pros include belief that reducing hypoxia in the Chesapeake Bay is worth the cost and is a good use of tax dollars, and that the cost of the tax increase is within a respondent's budget.The reasons to vote against the program include belief that it is not worth the cost, not a good use of tax dollars, and not within the respondent's budget.After a sample vote, each respondent votes on five different combinations of restoration outcomes.In each vote, respondents have the option to choose the status quo (keep the amount of restoration completed in Phase 1 and pay nothing) or to choose one of two alternative restoration programs at an additional cost to their household.To generate the choice sets, we used SAS market research macros to generate a D-Optimal fractional factorial design out of the full factorial (4 6 * 2 1 = 8,192 alternatives).This generation produced 24 choice pairs that we blocked into six groups of four.9A sample choice set is shown in Figure 1.
The fourth and final section of the survey contains the debriefing questions, followed by standard socioeconomic and demographic questions.
Our data come from 1,224 respondents enrolled in a web-enabled panel maintained by Research Now. 10 Of the sampled respondents, 27% said they had visited Chesapeake Bay.The average income of our sample was generally in line with 2010 census data, but the income range from $25,000 to $74,900 was slightly over-represented and the higher ranges were slightly under-represented; 51% were males compared to 49.2% from the census data. 11

METHODS AND RESULTS
Our analysis has three basic components.First, we review the frequency distributions of responses to each of the debriefing questions and impose three degrees of rigor for specifying whether a respondent satisfies the validity criteria. 12Second, we examine each respondent's answers to see if 11 Our sample was drawn from a nationwide population, stratifi ed by 30% from the Mid-Atlantic region, 30% from the West, and 40% from the rest of the country.
12 The frequency distributions are available upon request from the corresponding author.a pattern exists by respondent (across the follow-up questions).Third, we use multivariate regression analysis to investigate whether the tendency to meet the validity criteria is associated with a particular type of response to the voting question (i.e., all yes votes, etc.), and to see if respondent characteristics can predict whether a respondent is more or less likely to pass the validity criteria.

Descriptive Statistics
Our analyses are based on a portion of the full sample (960 out of 1,224) as we exclude protest no's (i.e., people who stated they did not trust the government, did not believe in tax increases of any kind, or did not feel they should have to pay for the good).Based on our focus groups and oneon-ones, we use various types of questions and response formats, which include multiple choice, open-ended, and contrasting statements (a question that asks respondents to indicate whether they agree with contrasting statements on either end of a five-point scale) and use several different question types to address the same topic (i.e., attending to the scenario, believing responses will affect the outcome, etc.).
Clearly our response options give some leeway in determining what constitutes a "problem" in a response.For some of the questions, we construct three classes of "rigor" for specifying when a respondent satisfies the validity criteria: stringent, average, and lenient. 13The stringent approach leads to the smallest number of respondents meeting the validity criteria: respondents meet the criteria only if their answers are unambiguously valid.These respondents are most clearly part of the "core" of respondents.For example, respondents were asked several questions with a contrasting statement response format where an unambiguously valid response was to the far left (response option A) and an invalid response was on the far right (response option E), with five total response categories from A to E. In the stringent approach, respondents who chose A or B are regarded as satisfying the validity criterion for this type of question.The lenient approach accepts more response categories as meeting the criteria and so expands the size of the core.The average approach is in the middle of these two.
In general, for questions that have only two response options (one that is unambiguously valid and one that is not), we do not specify a degree of rigor.A respondent either gives a valid response, or does not.Also, we do not specify a degree of rigor if a question is only shown to a subset of the entire sample (i.e., a follow-up question based on a response to a previous question).In both cases, this is illustrated in Table 2 where the frequency of response is the same across all categories (lenient, average, and stringent).
Table 2 provides a complete list of valid responses to each of the questions and the percentage meeting the validity criterion for each question.Looking at some of the responses that did not vary by the degree of rigor, Table 2 shows that 20% of the sample gives a valid response to a question regarding how they considered costs of the restoration program when making a choice.A valid response to this question includes "I thought only about how I and/or my family would be affected by the cost," whereas an invalid response includes "I thought about an amount that would be fair for most people to pay" and "I thought about an amount that would get a lot of people to vote yes."Additionally, less than 40% valued the program as described (i.e., did not consider health effects when deciding their votes, which the survey explicitly excluded as a benefit of the program), thought they would have to pay the amount shown, did not include other outcomes like reducing toxic chemicals not part of the scenario, and did not consider that voting for the program would increase the chances of the government starting a similar program near them.
Using the stringent approach to identifying valid responses, 28% of respondents saw the results as consequential (i.e., thought that the survey responses would be used to decide whether taxes would be collected).Twenty-one percent of the sample thought program outcomes should be chosen based on people's answers to questions in surveys like this one and 25% thought that survey sponsors want to find out how much the public values the program.

Cumulative Assessment of the Validity of Responses
Next, we examine responses to the follow-up questions at the respondent level.Figure 2 provides the cumulative percentage of the respondents who give invalid responses by degree of rigor.Using the most lenient assessment, 50% of the respondents failed at least six questions.Using the most stringent assessment, 50% of the respondents fail at least nine questions.
We next identify respondents who provide valid responses to all questions and, therefore, make up the "core" of respondents.Table 3 shows that only two people are in the core using the lenient approach to inclusion, one person is in the core using the average approach, and the stringent core is empty.Moreover, the two people who do remain in the lenient core (and therefore clearly can be judged to engage in the survey as real, understand the choices being asked of them, and respond in accord with economic  rationality) voted against contributing taxes for environmental improvements in all five votes.These results demonstrate that substantial proportions of the respondents do not provide responses to the follow-up questions that comport with the validity criteria.Of course, asking a large set of questions increases the chances that a respondent gives at least one invalid response.Hence, we do not necessarily propose that those not in the core be dropped from the

Regression Analysis
This section examines the relationship between the follow-up questions and how people respond to the voting scenarios.It also explores whether respondent characteristics are reasonable predictors as to whether people will give valid or invalid responses to the follow-up questions.Table 4 shows the results of a zero-inflated Poisson model that regresses the responses to the follow-up questions for each individual on the number of status quo votes.We use the average approach to identifying responses as valid, and code each variable so that a 1 equals a valid response, 0 otherwise.The top half of the table indicates whether the response to the follow-up question has an influence on the total number of status quo votes, whereas the bottom half of the table is a logit regression that indicates whether a respondent is more likely to be a certain zero (i.e., "all yes" voter) based on their response to the question.The results show that seven questions influence the probability of being a certain zero (i.e., someone who votes yes to all five votes).The results also show that a subset of these questions influences the number of no votes.For example, giving the following valid responses lowers the probability of voting yes to the restoration program in all five votes: considering only how their family would be affected by the taxes when voting (Family_only), and not how other people would be affected, that the votes will affect the size and scope of the restoration program (Vote_affectscope), and that the survey responses will be used to decide if taxes will be collected for the program (Vote_affecttaxes).In other words, respondents who give valid responses to these questions are less likely to be "yes" voters.Table 4 also indicates that valid responses to Family_only and Vote_affecttaxes 14 With potential for response errors, as the number of questions increases, the probability of answering all correctly goes to zero even if the respondent is in the core.This outcome begs the question of what to do with respondents that fail some signifi cant fraction of followup questions assessing validity, which is beyond the scope of this chapter.increases the number of "no" votes.The overall conclusion from this analysis is that it takes more than just a single follow-up question or a single type of question (i.e., response certainty, etc.) to evaluate how people respond to the voting question.Our results indicate that many of our questions either affect how people responded to the vote (i.e., were all yes voters) or affect the number of times a person chooses the status quo.
In a review of alternative methods of valuing environmental goods and services, US EPA SAB (2009) stated that a key criterion for choosing an approach was whether or not the method provides a reliable way to extrapolate from the respondents to the target population.The regression analysis below provides evidence that a reliable extrapolation approach may not exist.First, one needs to be reasonably certain that the respondents are a true random sample of the population.However, if people who fail to satisfy validity criteria are more likely to vote for a tax for an environmental program, it is reasonable to believe they may also be more likely to respond to the survey.Therefore, randomness cannot be assumed.However, if the propensity to pass validity tests is closely linked to demographics, then sampling weights could be used to adjust the results.Unfortunately, no strong link exists between demographics and propensity to satisfy the criteria and, thus, no such simple weighting scheme to match the sample to the population is available.
To explore this, we use a binary logit model to regress demographics (age, income, and gender) on the response to each of the follow-up questions listed in Table 4.All of the adjusted R 2 s are low and in most cases, an F-test indicates that the coefficients on at least two of the variables (age, income or gender) are not significantly different from zero in each of the regressions.However, this is not consistent across questions, making it difficult to identify a consistent pattern of respondent characteristics that explains responses to any of the follow-up questions.DCE studies have used a wide variety of techniques to make "adjustments" for respondents who don't appear to be providing valid responses.Reviewing and evaluating those approaches is beyond the scope of this chapter.Our point here is that the propensity to satisfy validity criteria may be so idiosyncratic that no reliable method may exist for determining the appropriate percentage of results to adjust to extrapolate to the population.

DISCUSSION AND CONCLUSIONS
General consensus exists in the literature that respondents to SP studies should attend to the scenarios and outcomes described in the survey and not others, take the exercise seriously and treat it as they would a real decision that affects their budget, and answer valuation questions as rational economic agents with well-defined preferences who are trading money for specific economic goods.In our survey, a large portion of our sample fails to meet these criteria.In fact, when examining all of each respondent's responses to the entire suite of follow-up questions, our sample yielded no more than two respondents out of 1,224 who answered the follow-up questions in a manner consistent with meeting the validity criteria; the average respondent failed to give a valid response, on average, to six or nine questions depending on degree of rigor in coding responses.
The majority of respondents (more than 50%) in our survey valued something other than reducing the effects of hypoxia, considered other elements of cost than how their family would be affected by the cost of the program, did not believe they would have to pay the amount shown in the vote, and/or thought that voting for the program would increase the chances of starting a similar program near them.We also find that people who vote yes for a program at least once are less likely to give a valid response and both of the respondents in the core that do give valid responses voted not to pay for the environmental program in all five votes.
Should policy decisions and legal damages be assessed using information obtained from people who appear to give invalid responses to follow-up questions such as these?What should be done with the results of answers to follow-up questions such as those we obtained?We do not propose answers to these questions, but our analysis suggests the questions are important.
It has been argued that when "state of the art" survey design and administration is employed, the results from SP studies can represent the population's true monetary values in an unbiased fashion (Ryan and  Spash, 2011).However, existing literature and the results of this study may show that the inconsistent and invalid responses may be more endemic to SP methods and potentially resistant to changes in survey designs.

Figure 2
Figure 2 Cumulative percentage of sample that fails at least one question

Table 1
Summary of SP studies that use debriefing questions Establishing the validity of a stated choice survey is fundamental, but assessing validity does not appear to be a standard practice in DCE studies.Second, numerous studies report results that show significant portions of the population giving a problematic response to the follow-up question, casting doubt on the DCE's validity.Third, these studies tend to report sample proportions answering a question in an invalid fashion for each question separately.The pattern of responses across questions and across question topics by an individual respondent is not investigated.
Notes: a. Represents the highest number reported from all questions.b.Only report frequency of responses to one out of 34 questions.Kelley Myers, Doug MacNair, Ted Tomasi and Jude Schneider -9781786434692 Downloaded from Elgar Online at 04/27/2019 06:35:19AM via free access on one test of validity.STUDY DESIGN AND DATA Our study developed follow-up questions as part of a stated choice survey about reducing hypoxia in Chesapeake Bay.Survey development occurred between August 2010 and September 2011 with the aid of two focus groups and four one-on-one interview sessions.The survey has four sections, which is consistent with current practices in DCEs (see Bateman et al., 2002 for more information).

Table 2
Percentage of sample that gives a valid response by question (n = 960) Kelley Myers, Doug MacNair, Ted Tomasi and Jude Schneider -9781786434692 Downloaded from Elgar Online at 04/27/2019 06:35:19AM via free access

Table 3
Number of respondents in core

Table 4
Zero-inflated Poisson regression by valid responses on number of "no" votes Kelley Myers, Doug MacNair, Ted Tomasi and Jude Schneider -9781786434692 Downloaded from Elgar Online at 04/27/2019 06:35:19AM via free access