Historically, April 15 is tax day (although in 2011, it is April 18 )–the day taxes are due to the revenue departments.

State legislatures are dealing with budgets and Congress is trying to balance a  Federal budget.

Everywhere one looks, money is the issue–this is especially true in these recession ridden time.  How does all this relate to evaluation, you ask?  This is the topic for today’s blog.  How does money figure into evaluation.

Let’s start with the simple and move to the complex.  Everything costs–and although I’m talking about money, time, personnel, and resources  (like paper, staples, electricity, etc.)  must also be taken into consideration.

When we talk about evaluation, four terms typically come to mind:  efficacy, effectiveness, efficiency, and fidelity.

Efficiency is the term that addresses money or costs.  Was the program efficient in its use of resources?  That is the question asked addressing efficiency.

To answer that question, there are three (at least) approaches that are used to address this question:

  1. Cost  or cost analysis;
  2. Cost effectiveness analysis; and
  3. Cost-benefit analysis.

Simply then:

  1. Cost analysis is the number of dollars it takes to deliver the program, including salary of the individual(s) planning the program.
  2. Cost effectiveness analysis is a computation of the target outcomes in an appropriate unit in ratio to the costs.
  3. Cost-benefit analysis is also a ratio of the costs of outcomes to the benefits of the program measured in the same units, usually money.

How are these computed?

  1. Cost can be measured by how much the consumer is willing to pay.  Costs can be the value of each resource that is consumed in the implementation of the program.  Or cost analysis can be “measuring costs so they can be related to procedures and outcomes” (Yates, 1996, p. 25).   So you list the money spent to implement the program, including salaries, and that is a cost analysis.  Simple.
  2. Cost effectiveness analysis says that there is some metric in which the outcomes are measured (number of times hands are washed during the day, for example) and that is put in ratio of the total costs of the program.  So movement from washing hands only once a day (a bare minimum) to washing hands at least six times a day would have the costs of the program (including salaries) divided by the changed number of times hands are washed a day (i.e., 5).  The resulting value is the cost-effectiveness analysis.  Complex.
  3. Cost-benefit analysis puts the outcomes in the same metric as the costs–in this case dollars.  The costs  (in dollars) of the program (including salaries) are put in ratio to the  outcomes (usually benefits) measured in dollars.  The challenge here is assigning a dollar amount to the outcomes.  How much is frequent hand washing worth? It is often measured in days saved from communicable/chronic/ acute  illnesses.  Computations of health days (reduction in days affected by chronic illness) is often difficult to value in dollars.  There is a whole body of literature in health economics for this topic, if you’re interested.  Complicated and complex.

Yates, B. T. (1996).  Analyzing costs, procedures, processes, and outcomes in human services.  Thousand Oaks, CA: Sage.

Hello, readers.  This week I’m doing something different with this blog.  This week, and the third week in each month from now on, I’ll be posting a column called Timely Topic.  This will be a post on a topic that someone (that means you reader) has suggested.  A topic that has been buzzing around in conversations.  A topic that has relevance to evaluation.  This all came about because a colleague from another land grant institution is concerned about the dearth of evaluation skills among Extension colleagues.  (Although this comment makes me wonder to whom this colleague is talking, that question is content for another post, another day.)  So thinking about how to get core evaluation information out to more folks, I decided to devote one post a month to TIMELY TOPICS.  To day’s post is about “THINKING CAREFULLY”.

Recently, I’ve been asked to review a statistics text book for my department. This particular book uses a program that is available on everyone’s computer.  The text has some important points to make and today’s post reflects one of those points.  The point is thinking carefully about using statistics.

As an evaluator–if only the evaluator of your own programs–you must think critically about the “…context of the data, the source of the data, the method used in data collection, the conclusions reached, and the practical implications” (Triola, 2010, p. 18).  The author posits that to understand general methods of using sample data; make inferences about populations; understand sampling and surveys; and important measures of key characteristics of data, as well as the use of valid statistical methods, one must recognize the misuse of statistics.

I’m sure all of you have heard the quote, “Figures don’t lie; liars figure,” which is attributed to Mark Twain.  I’ve always heard the quote as “Statistics lie and liars use statistics.”  Statistics CAN lie.  Liars CAN use statistics.  That is where thinking carefully comes in–to determine if the statistical conclusions being presented are seriously flawed.

As evaluators, we have a responsibility (according to the AEA guiding principles) to conduct systematic, data-based inquiry; provide competent performance; display honesty and integrity…of the entire evaluation process; respect the security, dignity, and self-worth of all respondents; and consider the diversity of the general and public interests and values.  This demands that we think carefully about the reporting of data.  Triola cautions, “Do not use voluntary response sample data for making conclusions about a population.”  How often have you used data from individuals who decide themselves (self-selected) whether to participate in your survey or not?  THINK CAREFULLY about your sample.  These data cannot be generalized to all people like your respondents because of the bias that is introduced by self-selection.

Other examples of misuse of statistics include

  • using correlation for concluding causation;
  • reporting data that involves a sponsors product;
  • identifying respondents inappropriately;
  • reporting data that is affected with a desired response bias;
  • using small samples to draw conclusions for large groups;
  • implying that being precise is being accurate; and
  • reporting misleading or unclear percentages. (This cartoon was drawn by Ben Shabad.)

When reporting statistics gathered from your evaluation, THINK CAREFULLY.

While I discussing evaluation in general earlier this week, the colleague with whom I was conversing asked me how data from a post/pre evaluation form are analyzed.  I pondered this for a nanosecond and said change scores…one would compute the difference between the post ranking and the pre ranking and subject that change to some statistical test.  “What test?” my colleague asked.

So, today’s post is on what test  and why?

First, you need to remember that the post/pre data are related response.  SPSS uses the label “paired samples” or “2-related samples” and those labels are used with a parametric test and a non-parametric test, respectively for responses from the same person (two related responses).

Parametric tests (like the t-test) are based on the assumption that the data are collected from a normal distribution (i.e., bell shaped distribution), a distribution based on known parameters (i.e., means and standard deviation).

Non-parametric tests (like the Wilcoxon  or the McNemar test) do not make assumptions about the population distribution.  Instead, these tests rank the data from low to high and then analyze the ranks.  Some times these tests are known as distribution-free tests because the parameters of the population are not known.   Extension professionals work with populations where parameters are not known most of the time.

If you KNOW (reasonably) that the population’s distribution approximates a normal bell curve, choose a parametric test–in the case of post/pre, that would be a t-test, because the responses are related.

You need to use a non-parametric test if the following conditions are met:

  • the response is a rank or a score and the distribution is not normal;
  • some values are “out of range”–if someone says 11 on a scale of 1 – 10;
  • the data are measurements (like a post/pre) and you are sure the distribution is NOT normal;
  • you don’t have data from a previous sample with which to compare the current sample; or
  • you have a small sample size (statistical tests to test for normality don’t work with small samples).

The last criteria is the one to remember.

If you have a large sample, it doesn’t matter if the distribution is normal because the parametric test is robust enough to ignore the distribution.  The only caveat is determining what a “large sample” is.  One source I read says, “Unless the population distribution is really weird, you are probably safe choosing a parametric test when there are at least two dozen data points in each group.”  That means at least 24 data point in each group.  If the post/pre evaluation has six questions and each question is answered by 12 people both post and pre, each question has only 12 data points–12 post; 12 pre.  You can’t lump the questions (6) and multiply by the number of people (12) by post and pre (2).  Each question is viewed as a separate set of data points. My statistics professor always insisted on a sample size of 30 to have enough power to determine a difference if a difference exists.

If you have a large sample and use a non-parametric test, the test are slightly less powerful than a parametric test used with a large sample.  To see what the difference is, use a t-test and a Wilcoxon test to analyze one question on post/pre and see what the difference is.  Won’t be much.

If you have a small sample and you use a parametric test with a distribution that is NOT normal, the probability value may be inaccurate.  Again run both tests to see the difference.  You want to use the test with the most conservative probability value (0.0001 is more conservative than 0.001).

If you have a small sample and you use a non-parametric with a normal distribution, the probability value may be too high because the non-parametric test lacks power to determine a difference.  Again, run the tests to see the difference.  Choose the test that is more conservative.

My experience is that using a non-parametric test for much of the analyses done with data from Extension-based projects provides a more realistic analysis.

Next week I”ll be attending the American Evaluation Association Annual meeting in San Antonio, TX. I’ll be posting when I return on November 15.

It occurs to me, as I have mentioned before (see July 13, 2010), that data management is the least likely part of evaluation to be taught.

Research design (from which evaluation borrows heavily), methodology (the tools for data collection), and report writing are all courses in which  Extension professionals could have participated.  Data management is not typically taught as a course.

So what’s a body to do???

Well, you could make it up as you go a long.  You could ask some senior member how they do it.  You could explore how a major software package (like SPSS) manages data.

I think there are several parts to managing data that all individuals conducting evaluations need to do.

  • Organize data sequentially–applying an order will help in the long run.
  • Create a data dictionary or a data code book
  • Develop a data base–use what you know.
  • Save the hard copy in a secure place–I know many folks who kept their dissertation data in the freezer in case of fire or flood.

Organize data.

I suggest that as the evaluations are returned to you, the evaluator, that they be numbered sequentially, 1 – n.  This sequential number can also serve as the identifying number, providing the individual with confidentiality.  The identifying number is typically the first column in the data base.

Data dictionary.

This is a hard copy record of the variables and how they are coded.  It includes any idiosyncrasies that occur as the data are being coded so that a stranger could code your data easily.  An idiosyncrasy that often occurs is for a participant to split the difference between two numbers on a scale.  You must decide how it is coded.  You must also not how you coded it the first time and the next time and the next time.  If you alternate between coding high and coding low, you need to note that.

Data base.

Most folks these days have Excel on their computers.  A few have a specific data analysis software program.  Excel can be imported into most software programs.  Use it.  It is easy.  If you know how, you can even run frequencies and percentages in Excel.  These analyses are the first analyses  you conduct.  Use the rows for cases (individual participants) and the columns as variables.  Name each variable in an identifiable manner AND PUT THAT IDENTIFICATION IN THE DATA DICTIONARY!

Data security.

I don’t necessarily advocate storing your data in the freezer, although I certainly did when I did my dissertation in the days before laptops and personal computers.  Make sure the data are secured with a password–not only does it protect the data from most nosy people, it makes IRB happy and assures confidentiality.

Oh, and one other point when talking about data.  The word “data” is a plural word and takes a plural verb; there fore–Data are.  I know that the spoken convention is to think of the word “data” as singular–in writing, it is plural.  A good habit to develop-

A good friend of mine asked me today if I knew of any attributes (which I interpreted to be criteria) of qualitative data (NOT qualitative research).  My friend likened the quest for attributes for qualitative data to the psychometric properties of a measurement instrument–validity and reliability–that could be applied to the data derived from those instruments.

Good question.  How does this relate to program evaluation, you may ask.  That question takes us to an understanding of paradigm.

Paradigm (according to Scriven in Evaluation Thesaurus) is a general concept or model for a discipline that may be influential in shaping the development of that discipline.  They do not (again according to Scriven) define truth; rather they define prima facie truth (i.e., truth on first appearance) which is not the same as truth.  Scriven goes on to say, “…eventually, paradigms are rejected as too far from reality and they are always governed by that possibility[i.e.,that they will be rejected] (page 253).”

So why is it important to understand paradigms.  They frame the inquiry. And evaluators are asking a question, that is, they are inquiring.

How inquiry is framed is based on the components of paradigm:

  • ontology–what is the nature of reality?
  • epistemology–what is the relationship between the known and the knower?
  • methodology–what is done to gain knowledge of reality, i.e., the world?

These beliefs shape how the evaluator sees the world and then guides the evaluator in the use of data, whether those data are derived from records, observations, interviews (i.e., qualitative data) or those data are derived from measurement,  scales,  instruments (i.e., quantitative data).  Each paradigm guides the questions asked and the interpretations brought to the answers to those questions.  This is the importance to evaluation.

Denzin and Lincoln (2005) in their 3rd edition volume of the Handbook of Qualitative Research

list what they call interpretive paradigms. They are described in Chapters 8 – 14 in that volume.  The paradigms are:

  1. Positivist/post positivist
  2. Constructivist
  3. Feminist
  4. Ethnic
  5. Marxist
  6. Cultural studies
  7. Queer theory

They indicate that each of these paradigms have criteria, a form of theory, and have a specific type of narration or report.  If paradigms have criteria, then it makes sense to me that the data derived in the inquiry formed by those paradigms would have criteria.  Certainly, the psychometric properties of validity and reliability (stemming from the positivist paradigm) relate to data, usually quantitative.  It would make sense to me that the parallel, though different, concepts in constructivist paradigm, trustworthiness and credibility,  would apply to data derived from that paradigm–often qualitative.

If that is the case–then evaluators need to be at least knowledgeable about paradigms.

I was reading another evaluation blog (the American Evaluation Association’s blog AEA365) which talked about data base design.  I was reminded that over the years, almost every Extension professional with whom I have worked has asked me the following question: “What do I do with my data now that I have all my surveys back?”

As Leigh Wang points out in her AEA365 comments, “Most training programs and publication venues focus on the research design, data collection, and data analysis phases, but largely leave the database design phase out of the research cycle.”  The questions that this statement raises are:

  1. How do/did you learn what to do with data once you have it?
  2. How do/did you decide to organize it?
  3. What software do/did you use?
  4. How important is it to make the data accessible to colleagues in the same field?

I want to know the answers to those questions.  I have some ideas.  Before I talk about what I do, I want to know what you do.  Email me, or comment on this blog.

There are four ways distributions can be compared.  Two were mentioned last week–central tendency and dispursion.  Central tendency talks about the average value.  Dispursion reflect the distribution’s variability.

The other two are skewness and kurtosis.

Skew (or skewness) is a measure of lack of symmetry.  Skew occurs when one end (or tail) of the distribution sticks out farther than the other.  Like this:

In this picture, the top image is that of positive skew and the bottom picture is negative skew.

Skew can happen when data are clustered at one end of a distribution like in a test that is too easy or too hard.  When the mean is is a larger number (i.e., greater) than the median, the distribution is positively skewed.  When the median is is a larger number (i.e., greater)  than the mean, the distribution is negatively skewed.

The other characteristic of a distribution is kurtosis.

Kurtosis refers to the overall shape of the distribution relative to its peak.  Distributions can be relatively flat, or platykurtic, or they can be relatively peaked, or leptokurtic.  This drawing provides a mnemonic to remember those terms:

A normal distribution, the bell curve, is called mesokurtic.

These terms are used to describe a distribution in a report or presentation.  When all four characteristics of a distribution are described, central tendency, dispursion, skew, and kurtosis, the reader has a clear picture of the data base.  From that point, frequencies and percentages can be reported.  Then statistical tests can be performed and reported as well.

The question about interpretation of data arose today.  I was with colleagues and the discussion focused on measures of central tendency and dispersion or variability.  These are important terms and concepts that form the foundation for any data analysis.  It is important to know how they are used.

MEASURES OF CENTRAL TENDENCY

There are five measures of central tendency.  Those measures of numbers that reflect the tendency of data to cluster around the center of the group.  Two of these (geometric mean and harmonic mean) won’t be discussed here as they are not typically used in the work Extension does.  The three I’m talking about are:

  • Median Symbolized by Md (read M subscript d)
  • Mode Symbolized by Mo (read M subscript o)

The mean is the sum of all the numbers divided by the total number of numbers.  Like this:

The median is the middle number of a sorted list (some folks use the 50% point), like this:

The mode is the most popular number, the number “voted” most frequently, like this:

Sometimes,  all of these measures fall on the same number, like this:          

Sometimes, all of these measures fall on different numbers, like this:

MEASURES OF VARIABILITY

There are four measure of variability, three of which I want to mention today.  The fourth, known as the Mean (average) deviation, is seldom used in Extension work.  They are:

  • Range Symbolized by R
  • Variance Symbolized by V
  • Standard deviation Symbolized by s or SD (for sample) and σ, the lower case Greek letter sigma (for standard deviation of a population).

The range is the difference between the largest and the smallest number in the sample, like this  

In this example, the blue distribution (distribution A) has a larger range than the red distribution (Distribution B).

Variance is more technical.  It is the sum of squares of the deviations (difference from the mean) about the mean minus 1. Subtracting one removes the bias from the calculation and that allows for a more conservative estimate and being more conservative reduces possible error.

There is a mathematical formula for computing the variance.  Fortunately, a computer software program like SPSS or SAS will do it for you.


The standard deviation results when the square root is taken of the variance.  It gives us an indication of  “…how much each score in a set of scores, on average, varies from the mean” (Salkind, 2004, p. 41).  Again, there is a mathematical formula that is computed by a software package.  Most people are familiar with the mean and standard deviation of IQ scores: mean=100 and sd = plus or minus 20.

Convention has it that the lower case Greek letters are used for parameters of populations and Roman letters to represent  corresponding  estimates of samples.  So you would see σ for standard deviation (lower case sigma) and μ for mean (lower case mu) for populations and s (or sd for standard deviation) and for samples.

These statistics relate to the measurement scale you have chosen to use.  Permissible statistics for a nominal scale are frequency and mode; for ordinal scale, median and percentiles; for an interval scale, mean, variance, standard deviation,  and Pearson correlation; and for a ratio scale, the geometric mean.  So think seriously about reporting a mean for your Likert-type scale.  What exactly does that tell you?

Statistics are not the dragon you think it is.

For many people, the field of statistics is a dragon in disguise and like dragons, most people shy away from statistics.

I have found that Neil Salkind’s book “Statistics for People Who (Think They) Hate Statistics” a good reference for understanding the basics of statistics.  The 4th edition is due out in September 2010.  This book  isn’t intimidating; it is easy to understand; it isn’t heavy on the math or formulas; it has a lot of tips.   I’m using it for this column.  I keep it on my desk along with Dillman.

Faculty who come to me with questions about analyzing their data typically want to know how to determine statistical significance.  But before I can talk to faculty about statistical significance, there are a few questions that need to be answered.

  • What type of measurement scale have you used?
  • How many groups do you have on which you have data?
  • How many variables do you have for those groups?
  • Are you examining relationships or differences?
  • What question(s) you want to answer?

Most people immediately jump to what test to use.  Don’t go there.  Start with what measurement scale do you have.  Then answer the other questions.

So let’s talk about scales of measurement.  All data are not created equally.  Some data are easier to analyze than other data.  Scale of measurement makes that difference.

There are four scales of measurement and most data fall into one of these four. They are either categorical (even if they have been converted to numbers) or numerical (originally numbers).  They are:

  • nominal
  • ordinal
  • interval
  • ratio

Scales of measurement are rules determining the particular levels at which outcomes are measured.  When you decide on an answer to a question, you are deciding on the scale of measurement, you are agreeing to the particular set of characteristics for that measurement.

Nominal scales name something. For example–gender is either male, female, or unknown/not stated; ethnicity is one of several names of groups.  When you gather demographic data, such as gender, ethnicity, or race, you are employing a nominal scale.  The data that result from nominal scales are categorical data–that is data resulting from categories which are mutually exclusive from each other.  The respondent is either male or female, not both.

Ordinal scale orders something; it puts the thing being measured in order–high to low; low to high.  Salkind gives the example of ranking candidates for a job.  Extension professionals (and many/most survey professionals) use ordinal scales in surveys (strongly agree to strongly disagree; don’t like to like a lot).  We do not know how much difference is between don’t like and likes a lot.  The data that result from ordinal scales are categorical data.

Interval scale is based on a continuum of equally spaced intervals along that continuum. Think of a thermometer; test score; weight.  We know that the intervals along the scale are equal to one another.  The data that result from interval scales are numerical data.

Ratio scale is a scale with absolute zero or a situation where the characteristic of interest is absent–like zero light or no molecular movement.  This rarely happens social or behavioral science, the work that most Extension Professionals do.  The data that result from ratio data are numerical data.

Why do we care?

  • Scales are ordered from the least precise (nominal)  to the most precise (ratio).
  • The scale used determines the detail provided by the data collected; more precision, more information.
  • The more precise scale is a scale which contains all the qualities of less precise scales (interval has the qualities of ordinal and nominal).

Using an inappropriate scale will invalidate your data and provide you with spurious outcomes which yield spurious impacts.

Statistically significant is a term that is often bandied about. What does it really mean? Why is it important?

First–why is it important?

It is important because it helps the evaluator make decisions based on the data gathered.600px-FUDGE_4dF_probability.svg

That makes sense–evaluators have to make decisions so that the findings can be used. If there  isn’t some way to set the findings apart from the vast morass of information, then it is only  background noise. So those of us who do analysis have learned to look at the probability level (written as a “p” value such as p=0.05). The “p” value helps us determine if something is true, not necessarily that something is important.

probability of occurringSecond–what does that number really mean?

Probability level means–can this  (fill in the blank here) happen by chance? If it can occur by chance, say 95 times out of 100, then it is probably not true.  When evaluators look at probability levels, we want really small numbers. Small numbers say that the likelihood that this change occurred by chance (or is untrue) is really unlikely. So even though a really small number occurs (like 0.05) it really means that there is a 95% chance that this change did not occur by chance and that it is really true. You can convert a p value by subtracting it from 100 (100-5=95; the likelihood that this did not occur by chance)

Convention has it that for something to be statistically significant, the value must be at least 0.05. This convention comes from academic research.  Smaller numbers aren’t necessarily better; they just confirm that the likelihood that true change occurs more often. There are software programs (Statxact for example) that can compute the exact probability; so seeing numbers like 0.047 would occur.2007-01-02-borat-visits-probability

Exploratory research (as opposed to confirmatory) may have a higher p value such as p=0.10.This means that the trend is moving in the desired direction.  Some evaluators let the key stakeholders determine if the probability level (p value) is at a level that indicates importance, for example, 0.062. Some would argue that 94 time out of 100 is not that much different from 95 time out of 100 of being true.

.