Last week I suggested a few evaluation related resolutions…one I didn’t mention which is easily accomplished is reading and/or contributing to AEA365.  AEA365 is a daily evaluation blog sponsored by the American Evaluation Association.  AEA’s Newsletter says: “The aea365 Tip-a-Day Alerts are dedicated to highlighting Hot Tips, Cool Tricks, Rad Resources, and Lessons Learned by and for evaluators (see the aea365 site here). Begun on January 1, 2010, we’re kicking off our second year and hoping to expand the diversity of voices, perspectives, and content shared during the coming year. We’re seeking colleagues to write one-time contributions of 250-400 words from their own experience. No online writing experience is necessary – you simply review examples on the aea356 Tip-a-Day Alerts site, craft your entry according to the contributions guidelines, and send it to Michelle Baron our blog coordinator. She’ll do a final edit and upload. If you have questions, or want to learn more, please review the site and then contact Michelle at aea365@eval.org. (updated December 2011)”

AEA365 is a valuable site.  I commend it to you.

Now the topic for today: Data sources–the why and the why not (or advantages and disadvantages for the source of information).

Ellen Taylor Powell, Evaluation Specialist at UWEX, has a handout that identifies sources of evaluation data.  These sources are existing information, people, and pictorial records and observations. Each source has advantages and disadvantages.

The source for the information below is the United Way publication, Measuring Program Outcomes (p. 86).

1.  Existing information such as Program Records are

  • Available
  • Accessible
  • Known sources and methods  of data collection

Program records can also

  • Be corrupt because of data collection methods
  • Have missing data
  • Omit post intervention impact data

2. Another form of existing information is Other Agency Records

  • Offer a different perspective
  • May contain impact data

Other agency records may also

  • Be corrupt because of data collection methods
  • Have missing data
  • May be unavailable as a data source
  • Have inconsistent time frames
  • Have case identification difficulties

3.  People are often main data source and include Individuals and General Public and

  • Have unique perspective on experience
  • Are an original source of data
  • General public can provide information when individuals are not accessible
  • Can serve geographic areas or specific population segments

Individuals and the general public  may also

  • Introduce a self-report bias
  • Not be accessible
  • Have limited overall experience

4.  Observations and pictorial records include Trained Observers and Mechanical Measurements

  • Can provide information on behavioral skills and practices
  • Supplement self reports
  • Can be easily quantified and standardized

These sources of data also

  • Are only relevant to physical observation
  • Need data collectors who must be reliably trained
  • Often result in inconsistent data with multiple observers
  • Are affected by the accuracy of testing devices
  • Have limited applicability to outcome measurement

Last Wednesday, I had the privilege to attend the OPEN (Oregon Program Evaluators Network) annual meeting.

Michael Quinn Patton, the key note speaker, talked about  developmental evaluation and

utilization focused evaluation.  Utilization Focused Evaluation makes sense–use by intended users.

Developmental Evaluation, on the other hand, needs some discussion.

The way Michael tells the story (he teaches a lot through story) is this:

“I had a standard 5-year contract with a community leadership program that specified 2 1/2 years of formative evaluation for program improvement to be followed by 2 1/2 years of summative evaluation that would lead to an overall decision about whether the program was effective. ”   After 2 1/2 years, Michael called for the summative evaluation to begin.  The director  was adamant, “We can’t stand still for 2 years.  Let’s keep doing formative evaluation.  We want to keep improving the program… (I) Never (want to do a summative evaluation)”…if it means standardizing the program.  We want to keep developing and changing.”  He looked at Michael sternly, challengingly.  “Formative evaluation!  Summative evaluation! Is that all you evaluators have to offer?” Michael hemmed and hawed and said, “I suppose we could do…ummm…we could do…ummm…well, we might do, you know…we could try developmental evaluation!” Not knowing what that was, the director asked “What’s that?”  Michael responded, “It’s where you, ummm, keep developing.”  Developmental evaluation was born.

The evaluation field offered, until now, two global approaches to evaluation, formative for program improvement and summative to make an overall judgment of merit and worth.  Now, developmental evaluation (DE) offers another approach, one which is relevant to social innovators looking to bring about major social change.  It takes into consideration systems theory, complexity concepts, uncertainty principles,  nonlinearity, and emergence.  DE acknowledges that resistance and push back are likely when change happens.  Developmental evaluation recognized that change brings turbulence and suggests ways that “adapts to the realities of complex nonlinear dynamics rather than trying to impose order and certainty on a disorderly and uncertain world” (Patton, 2011).  Social innovators recognize that outcomes will emerge as the program moves forward and to predefine outcomes limits the vision.

Michael has used the art of Mark M. Rogers to illustrate the point.  The cartoon has two early humans, one with what I would call a wheel, albeit primitive, who is saying, “No go.  The evaluation committee said it doesn’t meet utility specs.  They want something linear, stable, controllable, and targeted to reach a pre-set destination.  They couldn’t see any use for this (the wheel).”

For Extension professionals who are delivering programs designed to lead to a specific change, DE may not be useful.  For those Extension professionals who vision something different, DE may be the answer.  I think DE is worth a look.

Look for my next post after October 14; I’ll be out of the office until then.

Patton, M. Q. (2011) Developmental Evaluation. NY: Guilford Press.

One response I got for last week’s query was about on-line survey services.  Are they reliable?  Are they economical?  What are the design limitations?  What are the question format limitations?

Yes.  Depends.  Some.  Not many.

Let me take the easy question first:  Are they economical?

Depends.  Cost of postage for paper survey (both out and back) vs. the time it takes to enter questions in system.  Cost of system vs. length of survey.  These are things to consider.

Because most people have access to email today,  using an on-line survey service is often the easiest and most economical way to distribute an evaluation survey.  Most institutional review boards view an on-line survey like a mail survey and typically grant a waiver of documentation of informed consent.  The consenting document is the entry screen and often an agree to participate question is included on that screen.

Are they valid and reliable?

Yes, but…The old adage “Garbage in, garbage out” applies here.  Like a paper survey, and internet survey is only as good as the survey questions.  Don Dillman, in his third edition “Internet, mail, and mixed-mode surveys” (co-authored with Jolene D.  Smyth and Leah Melani Christian), talks about question development.  Since he wrote the book (literally), I use this resource a lot!

What are the design limitations?

Some limitations apply…Each online survey service is different.  The most common service is Survey Monkey (www.surveymonkey.com).  The introduction to Survey Monkey says, “Create and publish online surveys in minutes, and view results graphically and in real time.”  The basic account with Survey Monkey is free.  It has limitations (number of questions [10]; limited number of question formats [15]; number of responses [100]). And you can upgrade to the Pro or Unlimited  for a subscription fee ($19.95/mo or $200/annually, respectively).  There are others.  A search using “survey services” returns many options such as Zoomerang or InstantSurvey.

What are the question format limitations?

Not many–both open-ended and closed ended questions can be asked.  Survey Monkey has 15 different formats from which to choose (see below).  I’m sure there may be others; this list covers most formats.

  • Multiple Choice (Only one Answer)
  • Multiple Choice (Multiple Answers)
  • Matrix of Choices (Only one Answer per Row)
  • Matrix of Choices (Multiple Answers per Row)
  • Matrix of Drop-down Menus
  • Rating Scale
  • Single Textbox
  • Multiple Textboxes
  • Comment/Essay Box
  • Numerical Textboxes
  • Demographic Information (US)
  • Demographic Information (International)
  • Date and/or Time
  • Image
  • Descriptive Text

Oregon State University has an in-house service sponsored by the College of Business (BSG–Business Survey Groups).  OSU also has an institutional account with Student Voice, an on-line service designed initially for learning assessment which I have found useful for evaluations.  Check your institution for options available.  For your next evaluation that involves a survey, think electronically.

Rensis Likert was a sociologist at the University of Michigan.  He is credited with developing the Likert scale.

Before I say a few words about the scale and subsequently the item (two different entities), I want to clarify how to say his name:

Likert pronounced (he died in 1981) his name lick-urt (short i), like to lick something.  Most people mispronounce it.  I hope he is resting easy…

Lickert scales and Lickert items are two different things.

A Lickert scale is a multi-item instrument composed of items asking opinions (attitudes) on an agreement-disagreement continuum.  The several items have response levels arranged horizontally.  The response levels are anchored with sequential integers as well as words that assumes equal intervals.  These words–strongly disagree, somewhat disagree, neither agree or disagree, somewhat agree, strongly agree–are symmetrical around a neutral middle point.  Likert always measured attitude by agreement or disagreement. Today the methodology is applied to other domains.

A Lickert item is one of many that has response levels arranged horizontally and anchored with consecutive integers that are more or less evenly spaced, bivalent and symmetrical about a neutral middle.  If it doesn’t have these characteristics, it is not a Lickert item–some authors would say that without these characteristics, the item is not even a Likert-type item.  For example, an item asking how often you do a certain behavior with a scale of  “never,” “sometimes, “average,” “often,” and “very often” would not be a Lickert item.  Some writers would consider it a Likert-type item.  If the middle point “average” is omitted, it would still be considered a Likert-type item.

Referring to ANY ordered category item as Likert-type is a misconception.  Unless it has response levels arranged horizontally, anchored with consecutive integers, anchored with words that connote even spacing, and are bivalent, the item is only an ordered-category item or sometimes a visual analog scale or a semantic differential scale.  More on visual analog scales and semantic differential scales at another time.

A good friend of mine asked me today if I knew of any attributes (which I interpreted to be criteria) of qualitative data (NOT qualitative research).  My friend likened the quest for attributes for qualitative data to the psychometric properties of a measurement instrument–validity and reliability–that could be applied to the data derived from those instruments.

Good question.  How does this relate to program evaluation, you may ask.  That question takes us to an understanding of paradigm.

Paradigm (according to Scriven in Evaluation Thesaurus) is a general concept or model for a discipline that may be influential in shaping the development of that discipline.  They do not (again according to Scriven) define truth; rather they define prima facie truth (i.e., truth on first appearance) which is not the same as truth.  Scriven goes on to say, “…eventually, paradigms are rejected as too far from reality and they are always governed by that possibility[i.e.,that they will be rejected] (page 253).”

So why is it important to understand paradigms.  They frame the inquiry. And evaluators are asking a question, that is, they are inquiring.

How inquiry is framed is based on the components of paradigm:

  • ontology–what is the nature of reality?
  • epistemology–what is the relationship between the known and the knower?
  • methodology–what is done to gain knowledge of reality, i.e., the world?

These beliefs shape how the evaluator sees the world and then guides the evaluator in the use of data, whether those data are derived from records, observations, interviews (i.e., qualitative data) or those data are derived from measurement,  scales,  instruments (i.e., quantitative data).  Each paradigm guides the questions asked and the interpretations brought to the answers to those questions.  This is the importance to evaluation.

Denzin and Lincoln (2005) in their 3rd edition volume of the Handbook of Qualitative Research

list what they call interpretive paradigms. They are described in Chapters 8 – 14 in that volume.  The paradigms are:

  1. Positivist/post positivist
  2. Constructivist
  3. Feminist
  4. Ethnic
  5. Marxist
  6. Cultural studies
  7. Queer theory

They indicate that each of these paradigms have criteria, a form of theory, and have a specific type of narration or report.  If paradigms have criteria, then it makes sense to me that the data derived in the inquiry formed by those paradigms would have criteria.  Certainly, the psychometric properties of validity and reliability (stemming from the positivist paradigm) relate to data, usually quantitative.  It would make sense to me that the parallel, though different, concepts in constructivist paradigm, trustworthiness and credibility,  would apply to data derived from that paradigm–often qualitative.

If that is the case–then evaluators need to be at least knowledgeable about paradigms.

In 1963, Campbell and Stanley (in their classic book, Experimental and Quasi-Experimental Designs for Research), discussed the retrospective pretest.  This is the method where by the participant’s attitude, knowledge, skills, behaviors, etc. existing prior to and after the program are assessed together AFTER the program. A novel approach to capturing what the participant knew, felt, did before they experienced the program.

Does it work?  Yes…and no (according to the folks in the know).

Campbell and Stanley mention the use of the retrospective pretest in measuring attitudes towards Blacks (they use the term Negro) of soldiers who are assigned to racially mixed vs. all white combat infantry units (1947) and to measure housing project occupants attitudes to being in integrated vs. segregated housing units when there was a housing shortage (1951).  Both tests showed no difference between the two groups in remembering prior attitudes towards the idea of interest.  Campbell and Stanley argue that having only posttest measures,  any difference found may have been attributable to selection bias.    They caution readers to “…be careful to note that the probable direction of memory bias is to distort the past…into agreement with (the) present…or has come to believe to be socially desirable…”

This brings up several biases that the Extension professional needs to be concerned with in planning and conducting an evaluation: selection bias, desired response bias, and response shift bias.  All of which can have serious implications for the evaluation.

Those are technical words for several limitations which can affect any evaluation.  Selection bias is the preference to put some participants into one group rather than the other.  Campbell and Stanley call this bias a threat to validity.  Desired response bias occurs when participants try to answer the way they think the evaluator wants them to answer.  Response shift bias happens when participants frame of reference or  understanding changes during the program, often due to misunderstanding  or preconceived ideas.

So these are the potential problems.  Are there any advantages/strengths to using the retrospective pretest?  There are at least two.  First, there is only one administration, at the end of the program.  This is advantageous when the program is short and when participants do not like to fill out forms (that is, minimizes paper burden).  And second, avoids the response-shift bias by not introducing information that may not be understood  prior to the program.

Theodore Lamb (2005) tested the two methods and concluded that the two approaches appeared similar and recommended the retrospective pretest if conducting a pretest/posttest is  difficult or impossible.  He cautions, however, that supplementing the data from the retrospective pretest with other data is necessary to demonstrate the effectiveness of the program.

There is a vast array of information about this evaluation method.  If you would like to know more, let me know.

Having addressed the question about which measurement scale was used (“Statistics, not the dragon you think”), I want to talk about how many groups are being included in the evaluation and how those groups are determined.

The first part of that question is easy–there will be either one, two, or more than two groups.  Most of what Extension does results in one group, often an intact group.  An intact group is called a population and consists of all the participants in the program.  All program participants can be a very large number or a very small number.

The Tree School program is an example that has resulted in a very large number of participants (hundreds) .  It is a program that has been in existence for about 20 years.  Contacting all of these participants would be inefficient.  On the other hand, the 4H science teacher training program involved a  small number participants (about 75) and has been in existence for 5 years. Contacting all participants would be efficient.

With a large population, choosing a part of the bigger group is the best approach.  The part chosen is called a sample and is only a part of a population.  Identifying a part of the population starts with the contact list of participants.  The contact list is called the sampling frame.  It is the basis for determining the sample.

Identifying who will be included in the evaluation is called a sampling plan or a sampling approach.  There are two types of sampling approaches–probability sampling and nonprobability sampling.  Probability sampling methods are those which assure that the sample represents the population from which it is drawn.  Nonprobability sampling methods are those which are based on characteristics of the population.  Including all participants works well for a population with less than 100 participants.  If there are over 100 participants, choosing a subset of the sampling frame will be more efficient and effective.  There are several ways to select a sample and reduce the population to a manageable number of participants.  Probability sampling approaches include:

  • simple random sampling
  • stratified random sampling
  • systematic sampling
  • cluster sampling

Nonprobability sampling approaches include:

  • convenience sampling
  • snowball sampling
  • quota sampling
  • focus groups

More on these sampling approaches later.

I had a conversation today about how to measure if I was making a difference in what I do.  Although the conversation was referring to working with differences, I am conscious that the work work I do and the work of working with differences transcends most disciplines and positions.  How does it relate to evaluation?

Perspective and voice.

These are two sides of the same coin.  Individuals come to evaluation with a history or perspective.  Individuals voice their view in the development of evaluation plans.  If individuals are not invited and/or do  not come to the table for the discussion, a voice is missing.

This conversation went on–the message was that voice and perspective are  more important in evaluations which employ a qualitative approach rather than a quantitative approach.  Yes—and no.

Certainly, words have perspective and provide a vehicle for voice.  And words are the basis for qualitative methods.   So this is the “Yes”.   Is this still an issue when the target audience is homogeneous?  Is it still an issue when the evaluator is “different” on some criteria than the target audience.  Or as one mental health worker once stated, only an addict can provide effective therapy to another addict.  Is that really the case?  Or do voice and perspective always over lay an evaluation?

Let’s look at quantitative methods.  Some would argue that numbers aren’t affected by perspective and voice.  I will argue that the basis for these numbers is words.  If words are turned into numbers are voice and perspective still an issue?  This is the “Yes and no”.  
I am reminded of the story of a brook and a Native American child.  The standardized test asked which of the following is similar to a brook.  The possible responses were (for the sake of this conversation) river, meadow, lake, inlet.  The Native American child, growing up in the desert Southwest, had never heard of the word “brook”.  Consequently got the item wrong.  This was one of many questions where perspective affected the response.  Wrong answers were totaled to a number subtracted from the possible total and a score (a number) resulted.  That individual number was grouped with other individual numbers and compared to numbers from another group using a statistical test (for the sake of conversation), a t-test.  Is the resulting statistic of significance valid?  I would say not.  So this is the “No”.  Here the voice and perspective have been obfuscated.

The statistical significance between those groups is clear according to the computation; clear that is  until one looks at the words behind the numbers.  It is in the words behind the numbers that perspective and voice affect the outcomes.

A colleague of mine trying to explain observation to a student said, “Count the number of legs you see on the playground and divide by two. You have observed the number of students on the playground.” That is certainly one one way to look at the topic.

I’d like to be a bit more precise that that, though.  Observation is collecting information through the use of the senses–seeing, hearing, tasting, smelling, feeling.  To gather observations, the evaluator must have a clearly specified protocol–a step-by-step approach to what data are to be collected and how. The evaluator typically gets the first exposure to collecting information by observation at a very young age–learning to talk (hearing); learning to feed oneself (feeling); I’m sure you can think of other examples.  When the evaluator starts school and studies science,   when the teacher asks the student to “OBSERVE” the phenomenon and record what is seen, the evaluator is exposed to another approach to the method of observation.

As the process becomes more sophisticated, all manner of instruments may assist the evaluator–thermometers, chronometers, GIS, etc. And for that process to be able to be replicated (for validity), the steps become more and more precise.

Does that mean that looking at the playground and counting the legs and dividing by two has no place? Those who decry data manipulation would say agree that this form of observation yields information of questionable usefulness.   Those who approach observation as an unstructured activity would disagree and say that exploratory observation could result in an emerging premise.

You will see observation as the basis for ethnographic inquiry.  David Fetterman has a small volume (Ethnography: Step by step) published by Sage that explains how ethnography is used in field work.  Take simple ethnography a step up and one can read about meta-ethnography by George W. Noblit and R. Dwight Hare. I think my anthropology friends would say that observation is a tool used extensively by anthropologists. It is a tool that can be used by evaluators as well.


How many time have you been interviewed?

How many times have you conducted an interview?

Did you notice any similarities?  Probably.

My friend and colleague, Ellen Taylor-Powell has defined interviews as a method for collecting information by talking with and listening to people–a conversation if you will.  These conversations traditionally happen over the phone or face to face–with social media, they could also happen via chat, IM, or some other technology-based approach. A resource I have found  useful is the Evaluation Cookbook.

Interviews can be structured (not unlike a survey with discrete responses) or unstructured (not unlike a conversation).  You might also hear interviews consisting of closed-ended questions and open-ended questions.

Perhaps the most common place for interviews is in the hiring process (seen in personnel evaluation).

Another place for the use of interviews is in the performance review process (seen in performance evaluation).

Unless the evaluator conducting personnel/performance  evaluations,  the most common place for interviews to occur when survey methodology is employed.

Dillman (I’ve mentioned him in previous posts) has sections in his second (pg. 140 – 148) and third (pg. 311-314) editions that talk about the use of interviews in survey construction.  He makes a point in his third edition that I think is important for evaluators to remember and that is the issue of social desirability bias (pg. 313).  Social desirability bias is the possibility that the respondent would answer with what s/he thinks the person asking the questions would want/hope to hear.  Dillman  goes on to say, “Because of the interaction with another person, interview surveys are more likely to produce socially desirable answers for sensitive questions, particularly for questions about potentially embarrassing behavior…”

Expect social desirability response bias with interviewing (and expect differences in social desirability when part of the interview is self-report and part is face-to-face).  Social desirability responses could (and probably will) occur when questions do not appear particularly sensitive to the interviewer; the respondent may have a different cultural perspective which increases sensitivity.  That same cultural difference could also manifest in increased agreement with interview questions often called acquiescence.

Interviews take time; cost more; and often yield a lot of data which may be difficult to analyze.  Sometimes, as with a pilot program, interviews are worth it.  Interviews can be used for formative and summative  evaluations.  Consider if interviews are the best source of evaluation data for the program in question.