Molly « Evaluation is an Everyday Activity

This will be very brief.

The answer to the question, “What test do I use?” is, “It all depends.”

If you have one group you can do the following:

check the reliability.
check the validity.
look at relationships between variables.
predict something from other variables.
look at change across time.
look at scores on one variable measured under different conditions (within group difference)

If you have two groups you can do the following:

compare the two groups on one variable (between group difference).
look at change across time between the two groups
compare two groups on one variable under different conditions (within group difference).

If you have more than two groups, it gets more complicated and I’ll talk about that another day. Most Extension work doesn’t have more than two groups.

So you can see, it all depends. More later.

There are four ways distributions can be compared. Two were mentioned last week–central tendency and dispursion. Central tendency talks about the average value. Dispursion reflect the distribution’s variability.

The other two are skewness and kurtosis.

Skew (or skewness) is a measure of lack of symmetry. Skew occurs when one end (or tail) of the distribution sticks out farther than the other. Like this:

In this picture, the top image is that of positive skew and the bottom picture is negative skew.

Skew can happen when data are clustered at one end of a distribution like in a test that is too easy or too hard. When the mean is is a larger number (i.e., greater) than the median, the distribution is positively skewed. When the median is is a larger number (i.e., greater) than the mean, the distribution is negatively skewed.

The other characteristic of a distribution is kurtosis.

Kurtosis refers to the overall shape of the distribution relative to its peak. Distributions can be relatively flat, or platykurtic, or they can be relatively peaked, or leptokurtic. This drawing provides a mnemonic to remember those terms:

A normal distribution, the bell curve, is called mesokurtic.

These terms are used to describe a distribution in a report or presentation. When all four characteristics of a distribution are described, central tendency, dispursion, skew, and kurtosis, the reader has a clear picture of the data base. From that point, frequencies and percentages can be reported. Then statistical tests can be performed and reported as well.

The question about interpretation of data arose today. I was with colleagues and the discussion focused on measures of central tendency and dispersion or variability. These are important terms and concepts that form the foundation for any data analysis. It is important to know how they are used.

MEASURES OF CENTRAL TENDENCY

There are five measures of central tendency. Those measures of numbers that reflect the tendency of data to cluster around the center of the group. Two of these (geometric mean and harmonic mean) won’t be discussed here as they are not typically used in the work Extension does. The three I’m talking about are:

Mean Symbolized by (read bar X) NOTE: The Bar X refers to the arithmetic mean of a SAMPLE (see last week’s blog entry).

Median Symbolized by Md (read M subscript d)

Mode Symbolized by Mo (read M subscript o)

The mean is the sum of all the numbers divided by the total number of numbers. Like this:

The median is the middle number of a sorted list (some folks use the 50% point), like this:

The mode is the most popular number, the number “voted” most frequently, like this:

Sometimes, all of these measures fall on the same number, like this:

Sometimes, all of these measures fall on different numbers, like this:

MEASURES OF VARIABILITY

There are four measure of variability, three of which I want to mention today. The fourth, known as the Mean (average) deviation, is seldom used in Extension work. They are:

Range Symbolized by R

Variance Symbolized by V

Standard deviation Symbolized by s or SD (for sample) and σ, the lower case Greek letter sigma (for standard deviation of a population).

The range is the difference between the largest and the smallest number in the sample, like this

In this example, the blue distribution (distribution A) has a larger range than the red distribution (Distribution B).

Variance is more technical. It is the sum of squares of the deviations (difference from the mean) about the mean minus 1. Subtracting one removes the bias from the calculation and that allows for a more conservative estimate and being more conservative reduces possible error.

There is a mathematical formula for computing the variance. Fortunately, a computer software program like SPSS or SAS will do it for you.

The standard deviation results when the square root is taken of the variance. It gives us an indication of “…how much each score in a set of scores, on average, varies from the mean” (Salkind, 2004, p. 41). Again, there is a mathematical formula that is computed by a software package. Most people are familiar with the mean and standard deviation of IQ scores: mean=100 and sd = plus or minus 20.

Convention has it that the lower case Greek letters are used for parameters of populations and Roman letters to represent corresponding estimates of samples. So you would see σ for standard deviation (lower case sigma) and μ for mean (lower case mu) for populations and s (or sd for standard deviation) and for samples.

These statistics relate to the measurement scale you have chosen to use. Permissible statistics for a nominal scale are frequency and mode; for ordinal scale, median and percentiles; for an interval scale, mean, variance, standard deviation, and Pearson correlation; and for a ratio scale, the geometric mean. So think seriously about reporting a mean for your Likert-type scale. What exactly does that tell you?

Having addressed the question about which measurement scale was used (“Statistics, not the dragon you think”), I want to talk about how many groups are being included in the evaluation and how those groups are determined.

The first part of that question is easy–there will be either one, two, or more than two groups. Most of what Extension does results in one group, often an intact group. An intact group is called a population and consists of all the participants in the program. All program participants can be a very large number or a very small number.

The Tree School program is an example that has resulted in a very large number of participants (hundreds) . It is a program that has been in existence for about 20 years. Contacting all of these participants would be inefficient. On the other hand, the 4H science teacher training program involved a small number participants (about 75) and has been in existence for 5 years. Contacting all participants would be efficient.

With a large population, choosing a part of the bigger group is the best approach. The part chosen is called a sample and is only a part of a population. Identifying a part of the population starts with the contact list of participants. The contact list is called the sampling frame. It is the basis for determining the sample.

Identifying who will be included in the evaluation is called a sampling plan or a sampling approach. There are two types of sampling approaches–probability sampling and nonprobability sampling. Probability sampling methods are those which assure that the sample represents the population from which it is drawn. Nonprobability sampling methods are those which are based on characteristics of the population. Including all participants works well for a population with less than 100 participants. If there are over 100 participants, choosing a subset of the sampling frame will be more efficient and effective. There are several ways to select a sample and reduce the population to a manageable number of participants. Probability sampling approaches include:

simple random sampling
stratified random sampling
systematic sampling
cluster sampling

Nonprobability sampling approaches include:

convenience sampling
snowball sampling
quota sampling
focus groups

More on these sampling approaches later.

I had a conversation today about how to measure if I was making a difference in what I do. Although the conversation was referring to working with differences, I am conscious that the work work I do and the work of working with differences transcends most disciplines and positions. How does it relate to evaluation?

Perspective and voice.

These are two sides of the same coin. Individuals come to evaluation with a history or perspective. Individuals voice their view in the development of evaluation plans. If individuals are not invited and/or do not come to the table for the discussion, a voice is missing.

This conversation went on–the message was that voice and perspective are more important in evaluations which employ a qualitative approach rather than a quantitative approach. Yes—and no.

Certainly, words have perspective and provide a vehicle for voice. And words are the basis for qualitative methods. So this is the “Yes”. Is this still an issue when the target audience is homogeneous? Is it still an issue when the evaluator is “different” on some criteria than the target audience. Or as one mental health worker once stated, only an addict can provide effective therapy to another addict. Is that really the case? Or do voice and perspective always over lay an evaluation?

Let’s look at quantitative methods. Some would argue that numbers aren’t affected by perspective and voice. I will argue that the basis for these numbers is words. If words are turned into numbers are voice and perspective still an issue? This is the “Yes and no”.
I am reminded of the story of a brook and a Native American child. The standardized test asked which of the following is similar to a brook. The possible responses were (for the sake of this conversation) river, meadow, lake, inlet. The Native American child, growing up in the desert Southwest, had never heard of the word “brook”. Consequently got the item wrong. This was one of many questions where perspective affected the response. Wrong answers were totaled to a number subtracted from the possible total and a score (a number) resulted. That individual number was grouped with other individual numbers and compared to numbers from another group using a statistical test (for the sake of conversation), a t-test. Is the resulting statistic of significance valid? I would say not. So this is the “No”. Here the voice and perspective have been obfuscated.

The statistical significance between those groups is clear according to the computation; clear that is until one looks at the words behind the numbers. It is in the words behind the numbers that perspective and voice affect the outcomes.

Statistics are not the dragon you think it is.

For many people, the field of statistics is a dragon in disguise and like dragons, most people shy away from statistics.

I have found that Neil Salkind’s book “Statistics for People Who (Think They) Hate Statistics” a good reference for understanding the basics of statistics. The 4th edition is due out in September 2010. This book isn’t intimidating; it is easy to understand; it isn’t heavy on the math or formulas; it has a lot of tips. I’m using it for this column. I keep it on my desk along with Dillman.

Faculty who come to me with questions about analyzing their data typically want to know how to determine statistical significance. But before I can talk to faculty about statistical significance, there are a few questions that need to be answered.

What type of measurement scale have you used?
How many groups do you have on which you have data?
How many variables do you have for those groups?
Are you examining relationships or differences?
What question(s) you want to answer?

Most people immediately jump to what test to use. Don’t go there. Start with what measurement scale do you have. Then answer the other questions.

So let’s talk about scales of measurement. All data are not created equally. Some data are easier to analyze than other data. Scale of measurement makes that difference.

There are four scales of measurement and most data fall into one of these four. They are either categorical (even if they have been converted to numbers) or numerical (originally numbers). They are:

nominal
ordinal
interval
ratio

Scales of measurement are rules determining the particular levels at which outcomes are measured. When you decide on an answer to a question, you are deciding on the scale of measurement, you are agreeing to the particular set of characteristics for that measurement.

Nominal scales name something. For example–gender is either male, female, or unknown/not stated; ethnicity is one of several names of groups. When you gather demographic data, such as gender, ethnicity, or race, you are employing a nominal scale. The data that result from nominal scales are categorical data–that is data resulting from categories which are mutually exclusive from each other. The respondent is either male or female, not both.

Ordinal scale orders something; it puts the thing being measured in order–high to low; low to high. Salkind gives the example of ranking candidates for a job. Extension professionals (and many/most survey professionals) use ordinal scales in surveys (strongly agree to strongly disagree; don’t like to like a lot). We do not know how much difference is between don’t like and likes a lot. The data that result from ordinal scales are categorical data.

Interval scale is based on a continuum of equally spaced intervals along that continuum. Think of a thermometer; test score; weight. We know that the intervals along the scale are equal to one another. The data that result from interval scales are numerical data.

Ratio scale is a scale with absolute zero or a situation where the characteristic of interest is absent–like zero light or no molecular movement. This rarely happens social or behavioral science, the work that most Extension Professionals do. The data that result from ratio data are numerical data.

Why do we care?

Scales are ordered from the least precise (nominal) to the most precise (ratio).

The scale used determines the detail provided by the data collected; more precision, more information.
The more precise scale is a scale which contains all the qualities of less precise scales (interval has the qualities of ordinal and nominal).

Using an inappropriate scale will invalidate your data and provide you with spurious outcomes which yield spurious impacts.

A colleague of mine trying to explain observation to a student said, “Count the number of legs you see on the playground and divide by two. You have observed the number of students on the playground.” That is certainly one one way to look at the topic.

I’d like to be a bit more precise that that, though. Observation is collecting information through the use of the senses–seeing, hearing, tasting, smelling, feeling. To gather observations, the evaluator must have a clearly specified protocol–a step-by-step approach to what data are to be collected and how. The evaluator typically gets the first exposure to collecting information by observation at a very young age–learning to talk (hearing); learning to feed oneself (feeling); I’m sure you can think of other examples. When the evaluator starts school and studies science, when the teacher asks the student to “OBSERVE” the phenomenon and record what is seen, the evaluator is exposed to another approach to the method of observation.

As the process becomes more sophisticated, all manner of instruments may assist the evaluator–thermometers, chronometers, GIS, etc. And for that process to be able to be replicated (for validity), the steps become more and more precise.

Does that mean that looking at the playground and counting the legs and dividing by two has no place? Those who decry data manipulation would say agree that this form of observation yields information of questionable usefulness. Those who approach observation as an unstructured activity would disagree and say that exploratory observation could result in an emerging premise.

You will see observation as the basis for ethnographic inquiry. David Fetterman has a small volume (Ethnography: Step by step) published by Sage that explains how ethnography is used in field work. Take simple ethnography a step up and one can read about meta-ethnography by George W. Noblit and R. Dwight Hare. I think my anthropology friends would say that observation is a tool used extensively by anthropologists. It is a tool that can be used by evaluators as well.

How many time have you been interviewed?

How many times have you conducted an interview?

Did you notice any similarities? Probably.

My friend and colleague, Ellen Taylor-Powell has defined interviews as a method for collecting information by talking with and listening to people–a conversation if you will. These conversations traditionally happen over the phone or face to face–with social media, they could also happen via chat, IM, or some other technology-based approach. A resource I have found useful is the Evaluation Cookbook.

Interviews can be structured (not unlike a survey with discrete responses) or unstructured (not unlike a conversation). You might also hear interviews consisting of closed-ended questions and open-ended questions.

Perhaps the most common place for interviews is in the hiring process (seen in personnel evaluation).

Another place for the use of interviews is in the performance review process (seen in performance evaluation).

Unless the evaluator conducting personnel/performance evaluations, the most common place for interviews to occur when survey methodology is employed.

Dillman (I’ve mentioned him in previous posts) has sections in his second (pg. 140 – 148) and third (pg. 311-314) editions that talk about the use of interviews in survey construction. He makes a point in his third edition that I think is important for evaluators to remember and that is the issue of social desirability bias (pg. 313). Social desirability bias is the possibility that the respondent would answer with what s/he thinks the person asking the questions would want/hope to hear. Dillman goes on to say, “Because of the interaction with another person, interview surveys are more likely to produce socially desirable answers for sensitive questions, particularly for questions about potentially embarrassing behavior…”

Expect social desirability response bias with interviewing (and expect differences in social desirability when part of the interview is self-report and part is face-to-face). Social desirability responses could (and probably will) occur when questions do not appear particularly sensitive to the interviewer; the respondent may have a different cultural perspective which increases sensitivity. That same cultural difference could also manifest in increased agreement with interview questions often called acquiescence.

Interviews take time; cost more; and often yield a lot of data which may be difficult to analyze. Sometimes, as with a pilot program, interviews are worth it. Interviews can be used for formative and summative evaluations. Consider if interviews are the best source of evaluation data for the program in question.

I have six references on case study in my library. Robert K. Yin wrote two seminal books on case studies, one in 1993 (now in a 2nd edition, 1993 was the 1st edition) and the other in 1989 (now in the 4th edition, 1989 was the 1st edition). I have the 1994 edition (2nd edition of the 1989 book), and in it Yin says that “case studies are increasingly commonplace in evaluation research…are the preferred strategy when “how” and “why” questions are being posed, when the investigator has little control over events, and when the focus in on a contemporary phenomenon within some real-life context.

So what exactly is a case study?

A case study is typically an in-depth study of one or more individuals, institutions, communities, programs, populations. Whatever the “case” it is clearly bounded and what is studied is what is happening and important within those boundaries. Case studies use multiple sources of information to build the case. For a more detailed review see Wikipedia

There are three types of case studies

Explanatory
Exploratory
Descriptive

Over the years, case method has become more sophisticated.

Brinkerhoff has developed a method, the Success Case Method, as an evaluation approach that “easier, faster, and cheaper than competing approaches, and produces compelling evidence decision-makers can actually use.” As an evaluation approach, this method is quick and inexpensive and most of all, produces useful results.

Robert E. Stake has taken case study beyond one to many with his recent book, Multiple Case Study Analysis. It looks at cross-case analysis and can be used when broadly occurring phenomena need to be explored, such as leadership or management.

I’ve mentioned four of the six books, if you want to know the others, let me know.

Extension has consistently used survey as a method for collecting information.

Survey collects information through structured questionnaires resulting in quantitative data. Don Dillman wrote the book, Internet, Mail and Mixed-Mode Surveys: The Tailored Design Method . Although mail and individual interviews were once the norm, internet survey software has changed that.

Other ways are often more expedient, less costly, less resource intensive than survey. When needing to collect information, consider some of these other ways:

Case study
Interviews
Observation
Group Assessment
Expert or peer review
Portfolio reviews
Testimonials
Tests
Photographs, slides, videos
Diaries, journals
Logs
Document analysis
Simulations
Stories
Unobtrusive measures

I’ll talk about these in later posts and provide resources for each of these.

When deciding what information collection method (or methods) to use, remember there are three primary sources of evaluation information. Those sources often dictate the methods of information collection. The Three sources are:

Existing information
People
Pictorial records and observation

When using existing information, developing a systematic approach to LOOKING at the information source is what is important.

When gathering information from people, ASKING them is the approach to use–and how that asking is structured.

When using pictorial records and observations, determine what you are looking for before you collect information

Evaluation is an Everyday Activity

Program Evaluation Discussions

Author Archives: Molly

What test do I use

Differing distributions

Central tendency and variability

Population or sample?

Two sides of the same coin

Statistics, not the dragon you think.

Observations–another data collection approach

More on data sources–Interviews

Other ways to gather information–The Case Study

Why Not Survey

Contact Info