Although I have been learning about and doing evaluation for a long time, this week I’ve been searching for a topic to talk about.  A student recently asked me about the politics of evaluation–there is a lot that can be said on that topic, which I will save for another day.  Another student asked me about when to do an impact study and how to bound that study.  Certainly a good topic, too, though one that can wait for another post.  Something I read in another blog got me thinking about today’s post.  So, today I want to talk about gathering demographics.

Last week, I mentioned in my TIMELY TOPIC post about the AEA Guiding Principles. Those Principles along with the Program Evaluation Standards make significant contributions in assisting evaluators in making ethical decisions.  Evaluators make ethical decisions with every evaluation.  They are guided by these professional standards of conduct.  There are five Guiding Principles and five Evaluation Standards.  And although these are not proscriptive, they go along way to ensuring ethical evaluations.  That is a long introduction into gathering demographics.

The guiding principle, Integrity/Honesty states thatEvaluators display honesty and integrity in their own behavior, and attempt to ensure the honesty and integrity of the entire evaluation process.”  When we look at the entire evaluation process, as evaluators, we must strive constantly to maintain both personal and professional integrity in our decision making.  One decision we must make involves deciding what we need/want to know about our respondents.  As I’ve mentioned before, knowing what your sample looks like is important to reviewers, readers, and other stakeholders.  Yet, if we gather these data in a manner that is intrusive, are we being ethical?

Joe Heimlich, in a recent AEA365 post, says that asking demographic questions “…all carry with them ethical questions about use, need, confidentiality…”  He goes on to say that there are “…two major conditions shaping the decision to include – or to omit intentionally – questions on sexual or gender identity…”:

  1. When such data would further our understanding of the effect or the impact of a program, treatment, or event.
  2. When asking for such data would benefit the individual and/or their engagement in the evaluation process.

The first point relates to gender role issues–for example are gay men more like or more different from other gender categories?  And what gender categories did you include in your survey?  The second point relates to allowing an individual’s voice to be heard clearly and completely and have categories on our forms reflect their full participation in the evaluation.  For example, does marital status ask for domestic partnerships as well as traditional categories and are all those traditional categories necessary to hear your participants?

The next time you develop a questionnaire that includes demographic questions, take a second look at the wording–in an ethical manner.

Hello, readers.  This week I’m doing something different with this blog.  This week, and the third week in each month from now on, I’ll be posting a column called Timely Topic.  This will be a post on a topic that someone (that means you reader) has suggested.  A topic that has been buzzing around in conversations.  A topic that has relevance to evaluation.  This all came about because a colleague from another land grant institution is concerned about the dearth of evaluation skills among Extension colleagues.  (Although this comment makes me wonder to whom this colleague is talking, that question is content for another post, another day.)  So thinking about how to get core evaluation information out to more folks, I decided to devote one post a month to TIMELY TOPICS.  To day’s post is about “THINKING CAREFULLY”.

Recently, I’ve been asked to review a statistics text book for my department. This particular book uses a program that is available on everyone’s computer.  The text has some important points to make and today’s post reflects one of those points.  The point is thinking carefully about using statistics.

As an evaluator–if only the evaluator of your own programs–you must think critically about the “…context of the data, the source of the data, the method used in data collection, the conclusions reached, and the practical implications” (Triola, 2010, p. 18).  The author posits that to understand general methods of using sample data; make inferences about populations; understand sampling and surveys; and important measures of key characteristics of data, as well as the use of valid statistical methods, one must recognize the misuse of statistics.

I’m sure all of you have heard the quote, “Figures don’t lie; liars figure,” which is attributed to Mark Twain.  I’ve always heard the quote as “Statistics lie and liars use statistics.”  Statistics CAN lie.  Liars CAN use statistics.  That is where thinking carefully comes in–to determine if the statistical conclusions being presented are seriously flawed.

As evaluators, we have a responsibility (according to the AEA guiding principles) to conduct systematic, data-based inquiry; provide competent performance; display honesty and integrity…of the entire evaluation process; respect the security, dignity, and self-worth of all respondents; and consider the diversity of the general and public interests and values.  This demands that we think carefully about the reporting of data.  Triola cautions, “Do not use voluntary response sample data for making conclusions about a population.”  How often have you used data from individuals who decide themselves (self-selected) whether to participate in your survey or not?  THINK CAREFULLY about your sample.  These data cannot be generalized to all people like your respondents because of the bias that is introduced by self-selection.

Other examples of misuse of statistics include

  • using correlation for concluding causation;
  • reporting data that involves a sponsors product;
  • identifying respondents inappropriately;
  • reporting data that is affected with a desired response bias;
  • using small samples to draw conclusions for large groups;
  • implying that being precise is being accurate; and
  • reporting misleading or unclear percentages. (This cartoon was drawn by Ben Shabad.)

When reporting statistics gathered from your evaluation, THINK CAREFULLY.

Last week I suggested a few evaluation related resolutions…one I didn’t mention which is easily accomplished is reading and/or contributing to AEA365.  AEA365 is a daily evaluation blog sponsored by the American Evaluation Association.  AEA’s Newsletter says: “The aea365 Tip-a-Day Alerts are dedicated to highlighting Hot Tips, Cool Tricks, Rad Resources, and Lessons Learned by and for evaluators (see the aea365 site here). Begun on January 1, 2010, we’re kicking off our second year and hoping to expand the diversity of voices, perspectives, and content shared during the coming year. We’re seeking colleagues to write one-time contributions of 250-400 words from their own experience. No online writing experience is necessary – you simply review examples on the aea356 Tip-a-Day Alerts site, craft your entry according to the contributions guidelines, and send it to Michelle Baron our blog coordinator. She’ll do a final edit and upload. If you have questions, or want to learn more, please review the site and then contact Michelle at aea365@eval.org. (updated December 2011)”

AEA365 is a valuable site.  I commend it to you.

Now the topic for today: Data sources–the why and the why not (or advantages and disadvantages for the source of information).

Ellen Taylor Powell, Evaluation Specialist at UWEX, has a handout that identifies sources of evaluation data.  These sources are existing information, people, and pictorial records and observations. Each source has advantages and disadvantages.

The source for the information below is the United Way publication, Measuring Program Outcomes (p. 86).

1.  Existing information such as Program Records are

  • Available
  • Accessible
  • Known sources and methods  of data collection

Program records can also

  • Be corrupt because of data collection methods
  • Have missing data
  • Omit post intervention impact data

2. Another form of existing information is Other Agency Records

  • Offer a different perspective
  • May contain impact data

Other agency records may also

  • Be corrupt because of data collection methods
  • Have missing data
  • May be unavailable as a data source
  • Have inconsistent time frames
  • Have case identification difficulties

3.  People are often main data source and include Individuals and General Public and

  • Have unique perspective on experience
  • Are an original source of data
  • General public can provide information when individuals are not accessible
  • Can serve geographic areas or specific population segments

Individuals and the general public  may also

  • Introduce a self-report bias
  • Not be accessible
  • Have limited overall experience

4.  Observations and pictorial records include Trained Observers and Mechanical Measurements

  • Can provide information on behavioral skills and practices
  • Supplement self reports
  • Can be easily quantified and standardized

These sources of data also

  • Are only relevant to physical observation
  • Need data collectors who must be reliably trained
  • Often result in inconsistent data with multiple observers
  • Are affected by the accuracy of testing devices
  • Have limited applicability to outcome measurement

While I discussing evaluation in general earlier this week, the colleague with whom I was conversing asked me how data from a post/pre evaluation form are analyzed.  I pondered this for a nanosecond and said change scores…one would compute the difference between the post ranking and the pre ranking and subject that change to some statistical test.  “What test?” my colleague asked.

So, today’s post is on what test  and why?

First, you need to remember that the post/pre data are related response.  SPSS uses the label “paired samples” or “2-related samples” and those labels are used with a parametric test and a non-parametric test, respectively for responses from the same person (two related responses).

Parametric tests (like the t-test) are based on the assumption that the data are collected from a normal distribution (i.e., bell shaped distribution), a distribution based on known parameters (i.e., means and standard deviation).

Non-parametric tests (like the Wilcoxon  or the McNemar test) do not make assumptions about the population distribution.  Instead, these tests rank the data from low to high and then analyze the ranks.  Some times these tests are known as distribution-free tests because the parameters of the population are not known.   Extension professionals work with populations where parameters are not known most of the time.

If you KNOW (reasonably) that the population’s distribution approximates a normal bell curve, choose a parametric test–in the case of post/pre, that would be a t-test, because the responses are related.

You need to use a non-parametric test if the following conditions are met:

  • the response is a rank or a score and the distribution is not normal;
  • some values are “out of range”–if someone says 11 on a scale of 1 – 10;
  • the data are measurements (like a post/pre) and you are sure the distribution is NOT normal;
  • you don’t have data from a previous sample with which to compare the current sample; or
  • you have a small sample size (statistical tests to test for normality don’t work with small samples).

The last criteria is the one to remember.

If you have a large sample, it doesn’t matter if the distribution is normal because the parametric test is robust enough to ignore the distribution.  The only caveat is determining what a “large sample” is.  One source I read says, “Unless the population distribution is really weird, you are probably safe choosing a parametric test when there are at least two dozen data points in each group.”  That means at least 24 data point in each group.  If the post/pre evaluation has six questions and each question is answered by 12 people both post and pre, each question has only 12 data points–12 post; 12 pre.  You can’t lump the questions (6) and multiply by the number of people (12) by post and pre (2).  Each question is viewed as a separate set of data points. My statistics professor always insisted on a sample size of 30 to have enough power to determine a difference if a difference exists.

If you have a large sample and use a non-parametric test, the test are slightly less powerful than a parametric test used with a large sample.  To see what the difference is, use a t-test and a Wilcoxon test to analyze one question on post/pre and see what the difference is.  Won’t be much.

If you have a small sample and you use a parametric test with a distribution that is NOT normal, the probability value may be inaccurate.  Again run both tests to see the difference.  You want to use the test with the most conservative probability value (0.0001 is more conservative than 0.001).

If you have a small sample and you use a non-parametric with a normal distribution, the probability value may be too high because the non-parametric test lacks power to determine a difference.  Again, run the tests to see the difference.  Choose the test that is more conservative.

My experience is that using a non-parametric test for much of the analyses done with data from Extension-based projects provides a more realistic analysis.

Next week I”ll be attending the American Evaluation Association Annual meeting in San Antonio, TX. I’ll be posting when I return on November 15.

One response I got for last week’s query was about on-line survey services.  Are they reliable?  Are they economical?  What are the design limitations?  What are the question format limitations?

Yes.  Depends.  Some.  Not many.

Let me take the easy question first:  Are they economical?

Depends.  Cost of postage for paper survey (both out and back) vs. the time it takes to enter questions in system.  Cost of system vs. length of survey.  These are things to consider.

Because most people have access to email today,  using an on-line survey service is often the easiest and most economical way to distribute an evaluation survey.  Most institutional review boards view an on-line survey like a mail survey and typically grant a waiver of documentation of informed consent.  The consenting document is the entry screen and often an agree to participate question is included on that screen.

Are they valid and reliable?

Yes, but…The old adage “Garbage in, garbage out” applies here.  Like a paper survey, and internet survey is only as good as the survey questions.  Don Dillman, in his third edition “Internet, mail, and mixed-mode surveys” (co-authored with Jolene D.  Smyth and Leah Melani Christian), talks about question development.  Since he wrote the book (literally), I use this resource a lot!

What are the design limitations?

Some limitations apply…Each online survey service is different.  The most common service is Survey Monkey (www.surveymonkey.com).  The introduction to Survey Monkey says, “Create and publish online surveys in minutes, and view results graphically and in real time.”  The basic account with Survey Monkey is free.  It has limitations (number of questions [10]; limited number of question formats [15]; number of responses [100]). And you can upgrade to the Pro or Unlimited  for a subscription fee ($19.95/mo or $200/annually, respectively).  There are others.  A search using “survey services” returns many options such as Zoomerang or InstantSurvey.

What are the question format limitations?

Not many–both open-ended and closed ended questions can be asked.  Survey Monkey has 15 different formats from which to choose (see below).  I’m sure there may be others; this list covers most formats.

  • Multiple Choice (Only one Answer)
  • Multiple Choice (Multiple Answers)
  • Matrix of Choices (Only one Answer per Row)
  • Matrix of Choices (Multiple Answers per Row)
  • Matrix of Drop-down Menus
  • Rating Scale
  • Single Textbox
  • Multiple Textboxes
  • Comment/Essay Box
  • Numerical Textboxes
  • Demographic Information (US)
  • Demographic Information (International)
  • Date and/or Time
  • Image
  • Descriptive Text

Oregon State University has an in-house service sponsored by the College of Business (BSG–Business Survey Groups).  OSU also has an institutional account with Student Voice, an on-line service designed initially for learning assessment which I have found useful for evaluations.  Check your institution for options available.  For your next evaluation that involves a survey, think electronically.

A good friend of mine asked me today if I knew of any attributes (which I interpreted to be criteria) of qualitative data (NOT qualitative research).  My friend likened the quest for attributes for qualitative data to the psychometric properties of a measurement instrument–validity and reliability–that could be applied to the data derived from those instruments.

Good question.  How does this relate to program evaluation, you may ask.  That question takes us to an understanding of paradigm.

Paradigm (according to Scriven in Evaluation Thesaurus) is a general concept or model for a discipline that may be influential in shaping the development of that discipline.  They do not (again according to Scriven) define truth; rather they define prima facie truth (i.e., truth on first appearance) which is not the same as truth.  Scriven goes on to say, “…eventually, paradigms are rejected as too far from reality and they are always governed by that possibility[i.e.,that they will be rejected] (page 253).”

So why is it important to understand paradigms.  They frame the inquiry. And evaluators are asking a question, that is, they are inquiring.

How inquiry is framed is based on the components of paradigm:

  • ontology–what is the nature of reality?
  • epistemology–what is the relationship between the known and the knower?
  • methodology–what is done to gain knowledge of reality, i.e., the world?

These beliefs shape how the evaluator sees the world and then guides the evaluator in the use of data, whether those data are derived from records, observations, interviews (i.e., qualitative data) or those data are derived from measurement,  scales,  instruments (i.e., quantitative data).  Each paradigm guides the questions asked and the interpretations brought to the answers to those questions.  This is the importance to evaluation.

Denzin and Lincoln (2005) in their 3rd edition volume of the Handbook of Qualitative Research

list what they call interpretive paradigms. They are described in Chapters 8 – 14 in that volume.  The paradigms are:

  1. Positivist/post positivist
  2. Constructivist
  3. Feminist
  4. Ethnic
  5. Marxist
  6. Cultural studies
  7. Queer theory

They indicate that each of these paradigms have criteria, a form of theory, and have a specific type of narration or report.  If paradigms have criteria, then it makes sense to me that the data derived in the inquiry formed by those paradigms would have criteria.  Certainly, the psychometric properties of validity and reliability (stemming from the positivist paradigm) relate to data, usually quantitative.  It would make sense to me that the parallel, though different, concepts in constructivist paradigm, trustworthiness and credibility,  would apply to data derived from that paradigm–often qualitative.

If that is the case–then evaluators need to be at least knowledgeable about paradigms.

In 1963, Campbell and Stanley (in their classic book, Experimental and Quasi-Experimental Designs for Research), discussed the retrospective pretest.  This is the method where by the participant’s attitude, knowledge, skills, behaviors, etc. existing prior to and after the program are assessed together AFTER the program. A novel approach to capturing what the participant knew, felt, did before they experienced the program.

Does it work?  Yes…and no (according to the folks in the know).

Campbell and Stanley mention the use of the retrospective pretest in measuring attitudes towards Blacks (they use the term Negro) of soldiers who are assigned to racially mixed vs. all white combat infantry units (1947) and to measure housing project occupants attitudes to being in integrated vs. segregated housing units when there was a housing shortage (1951).  Both tests showed no difference between the two groups in remembering prior attitudes towards the idea of interest.  Campbell and Stanley argue that having only posttest measures,  any difference found may have been attributable to selection bias.    They caution readers to “…be careful to note that the probable direction of memory bias is to distort the past…into agreement with (the) present…or has come to believe to be socially desirable…”

This brings up several biases that the Extension professional needs to be concerned with in planning and conducting an evaluation: selection bias, desired response bias, and response shift bias.  All of which can have serious implications for the evaluation.

Those are technical words for several limitations which can affect any evaluation.  Selection bias is the preference to put some participants into one group rather than the other.  Campbell and Stanley call this bias a threat to validity.  Desired response bias occurs when participants try to answer the way they think the evaluator wants them to answer.  Response shift bias happens when participants frame of reference or  understanding changes during the program, often due to misunderstanding  or preconceived ideas.

So these are the potential problems.  Are there any advantages/strengths to using the retrospective pretest?  There are at least two.  First, there is only one administration, at the end of the program.  This is advantageous when the program is short and when participants do not like to fill out forms (that is, minimizes paper burden).  And second, avoids the response-shift bias by not introducing information that may not be understood  prior to the program.

Theodore Lamb (2005) tested the two methods and concluded that the two approaches appeared similar and recommended the retrospective pretest if conducting a pretest/posttest is  difficult or impossible.  He cautions, however, that supplementing the data from the retrospective pretest with other data is necessary to demonstrate the effectiveness of the program.

There is a vast array of information about this evaluation method.  If you would like to know more, let me know.

There are four ways distributions can be compared.  Two were mentioned last week–central tendency and dispursion.  Central tendency talks about the average value.  Dispursion reflect the distribution’s variability.

The other two are skewness and kurtosis.

Skew (or skewness) is a measure of lack of symmetry.  Skew occurs when one end (or tail) of the distribution sticks out farther than the other.  Like this:

In this picture, the top image is that of positive skew and the bottom picture is negative skew.

Skew can happen when data are clustered at one end of a distribution like in a test that is too easy or too hard.  When the mean is is a larger number (i.e., greater) than the median, the distribution is positively skewed.  When the median is is a larger number (i.e., greater)  than the mean, the distribution is negatively skewed.

The other characteristic of a distribution is kurtosis.

Kurtosis refers to the overall shape of the distribution relative to its peak.  Distributions can be relatively flat, or platykurtic, or they can be relatively peaked, or leptokurtic.  This drawing provides a mnemonic to remember those terms:

A normal distribution, the bell curve, is called mesokurtic.

These terms are used to describe a distribution in a report or presentation.  When all four characteristics of a distribution are described, central tendency, dispursion, skew, and kurtosis, the reader has a clear picture of the data base.  From that point, frequencies and percentages can be reported.  Then statistical tests can be performed and reported as well.

The question about interpretation of data arose today.  I was with colleagues and the discussion focused on measures of central tendency and dispersion or variability.  These are important terms and concepts that form the foundation for any data analysis.  It is important to know how they are used.

MEASURES OF CENTRAL TENDENCY

There are five measures of central tendency.  Those measures of numbers that reflect the tendency of data to cluster around the center of the group.  Two of these (geometric mean and harmonic mean) won’t be discussed here as they are not typically used in the work Extension does.  The three I’m talking about are:

  • Median Symbolized by Md (read M subscript d)
  • Mode Symbolized by Mo (read M subscript o)

The mean is the sum of all the numbers divided by the total number of numbers.  Like this:

The median is the middle number of a sorted list (some folks use the 50% point), like this:

The mode is the most popular number, the number “voted” most frequently, like this:

Sometimes,  all of these measures fall on the same number, like this:          

Sometimes, all of these measures fall on different numbers, like this:

MEASURES OF VARIABILITY

There are four measure of variability, three of which I want to mention today.  The fourth, known as the Mean (average) deviation, is seldom used in Extension work.  They are:

  • Range Symbolized by R
  • Variance Symbolized by V
  • Standard deviation Symbolized by s or SD (for sample) and σ, the lower case Greek letter sigma (for standard deviation of a population).

The range is the difference between the largest and the smallest number in the sample, like this  

In this example, the blue distribution (distribution A) has a larger range than the red distribution (Distribution B).

Variance is more technical.  It is the sum of squares of the deviations (difference from the mean) about the mean minus 1. Subtracting one removes the bias from the calculation and that allows for a more conservative estimate and being more conservative reduces possible error.

There is a mathematical formula for computing the variance.  Fortunately, a computer software program like SPSS or SAS will do it for you.


The standard deviation results when the square root is taken of the variance.  It gives us an indication of  “…how much each score in a set of scores, on average, varies from the mean” (Salkind, 2004, p. 41).  Again, there is a mathematical formula that is computed by a software package.  Most people are familiar with the mean and standard deviation of IQ scores: mean=100 and sd = plus or minus 20.

Convention has it that the lower case Greek letters are used for parameters of populations and Roman letters to represent  corresponding  estimates of samples.  So you would see σ for standard deviation (lower case sigma) and μ for mean (lower case mu) for populations and s (or sd for standard deviation) and for samples.

These statistics relate to the measurement scale you have chosen to use.  Permissible statistics for a nominal scale are frequency and mode; for ordinal scale, median and percentiles; for an interval scale, mean, variance, standard deviation,  and Pearson correlation; and for a ratio scale, the geometric mean.  So think seriously about reporting a mean for your Likert-type scale.  What exactly does that tell you?

Having addressed the question about which measurement scale was used (“Statistics, not the dragon you think”), I want to talk about how many groups are being included in the evaluation and how those groups are determined.

The first part of that question is easy–there will be either one, two, or more than two groups.  Most of what Extension does results in one group, often an intact group.  An intact group is called a population and consists of all the participants in the program.  All program participants can be a very large number or a very small number.

The Tree School program is an example that has resulted in a very large number of participants (hundreds) .  It is a program that has been in existence for about 20 years.  Contacting all of these participants would be inefficient.  On the other hand, the 4H science teacher training program involved a  small number participants (about 75) and has been in existence for 5 years. Contacting all participants would be efficient.

With a large population, choosing a part of the bigger group is the best approach.  The part chosen is called a sample and is only a part of a population.  Identifying a part of the population starts with the contact list of participants.  The contact list is called the sampling frame.  It is the basis for determining the sample.

Identifying who will be included in the evaluation is called a sampling plan or a sampling approach.  There are two types of sampling approaches–probability sampling and nonprobability sampling.  Probability sampling methods are those which assure that the sample represents the population from which it is drawn.  Nonprobability sampling methods are those which are based on characteristics of the population.  Including all participants works well for a population with less than 100 participants.  If there are over 100 participants, choosing a subset of the sampling frame will be more efficient and effective.  There are several ways to select a sample and reduce the population to a manageable number of participants.  Probability sampling approaches include:

  • simple random sampling
  • stratified random sampling
  • systematic sampling
  • cluster sampling

Nonprobability sampling approaches include:

  • convenience sampling
  • snowball sampling
  • quota sampling
  • focus groups

More on these sampling approaches later.