Molly « Evaluation is an Everyday Activity

I have been writing this blog since December 2009. That seems like forever from my perspective.

I write. I post. I wait. Nothing.

Oh, occasionally, I receive an email (which is wonderful and welcome) and early on I received a few comments (that was great).

Recently, nothing.

I know it is summer–and in Extension that is the time for fairs and camps and that means every one is busy.

Yet, I know that learning happens all the time and you have some amazing experiences that can teach. So, my good readers: What evaluation question have you had this week? Any question related to evaluating what you are doing is welcome. Let me hear from you. You can email me (molly.engle@oregonstate.edu) or you can post a comment (see comment link below).

It occurs to me, as I have mentioned before (see July 13, 2010), that data management is the least likely part of evaluation to be taught.

Research design (from which evaluation borrows heavily), methodology (the tools for data collection), and report writing are all courses in which Extension professionals could have participated. Data management is not typically taught as a course.

So what’s a body to do???

Well, you could make it up as you go a long. You could ask some senior member how they do it. You could explore how a major software package (like SPSS) manages data.

I think there are several parts to managing data that all individuals conducting evaluations need to do.

Organize data sequentially–applying an order will help in the long run.
Create a data dictionary or a data code book
Develop a data base–use what you know.
Save the hard copy in a secure place–I know many folks who kept their dissertation data in the freezer in case of fire or flood.

Organize data.

I suggest that as the evaluations are returned to you, the evaluator, that they be numbered sequentially, 1 – n. This sequential number can also serve as the identifying number, providing the individual with confidentiality. The identifying number is typically the first column in the data base.

Data dictionary.

This is a hard copy record of the variables and how they are coded. It includes any idiosyncrasies that occur as the data are being coded so that a stranger could code your data easily. An idiosyncrasy that often occurs is for a participant to split the difference between two numbers on a scale. You must decide how it is coded. You must also not how you coded it the first time and the next time and the next time. If you alternate between coding high and coding low, you need to note that.

Data base.

Most folks these days have Excel on their computers. A few have a specific data analysis software program. Excel can be imported into most software programs. Use it. It is easy. If you know how, you can even run frequencies and percentages in Excel. These analyses are the first analyses you conduct. Use the rows for cases (individual participants) and the columns as variables. Name each variable in an identifiable manner AND PUT THAT IDENTIFICATION IN THE DATA DICTIONARY!

Data security.

I don’t necessarily advocate storing your data in the freezer, although I certainly did when I did my dissertation in the days before laptops and personal computers. Make sure the data are secured with a password–not only does it protect the data from most nosy people, it makes IRB happy and assures confidentiality.

Oh, and one other point when talking about data. The word “data” is a plural word and takes a plural verb; there fore–Data are. I know that the spoken convention is to think of the word “data” as singular–in writing, it is plural. A good habit to develop-

Rensis Likert was a sociologist at the University of Michigan. He is credited with developing the Likert scale.

Before I say a few words about the scale and subsequently the item (two different entities), I want to clarify how to say his name:

Likert pronounced (he died in 1981) his name lick-urt (short i), like to lick something. Most people mispronounce it. I hope he is resting easy…

Lickert scales and Lickert items are two different things.

A Lickert scale is a multi-item instrument composed of items asking opinions (attitudes) on an agreement-disagreement continuum. The several items have response levels arranged horizontally. The response levels are anchored with sequential integers as well as words that assumes equal intervals. These words–strongly disagree, somewhat disagree, neither agree or disagree, somewhat agree, strongly agree–are symmetrical around a neutral middle point. Likert always measured attitude by agreement or disagreement. Today the methodology is applied to other domains.

A Lickert item is one of many that has response levels arranged horizontally and anchored with consecutive integers that are more or less evenly spaced, bivalent and symmetrical about a neutral middle. If it doesn’t have these characteristics, it is not a Lickert item–some authors would say that without these characteristics, the item is not even a Likert-type item. For example, an item asking how often you do a certain behavior with a scale of “never,” “sometimes, “average,” “often,” and “very often” would not be a Lickert item. Some writers would consider it a Likert-type item. If the middle point “average” is omitted, it would still be considered a Likert-type item.

Referring to ANY ordered category item as Likert-type is a misconception. Unless it has response levels arranged horizontally, anchored with consecutive integers, anchored with words that connote even spacing, and are bivalent, the item is only an ordered-category item or sometimes a visual analog scale or a semantic differential scale. More on visual analog scales and semantic differential scales at another time.

Bias causes problems for the evaluator.

Scriven says that the evaluative use of “bias” means the “…same as ‘prejudice’ ” with its antonyms being objectivity, fairness, impartiality. Bias causes systematic errors that are likely to affect humans and are often due to the tendency to prejudge issues because of previous experience or perspective.

Why is bias a problem for evaluators and evaluations?

It leads to invalidity.
It results in lack of reliability.
It reduces credibility.
It leads to spurious outcomes.

What types of bias are there that can affect evaluations?

Shared bias
Design bias
Selectivity bias
Item bias

I’m sure there are others. Knowing how these affect an evaluation is what I want to talk about.

Shared bias: Agreement among/between experts may be due to common error; often seen as conflict of interest; also seen in individual relationships. For example, an external expert asked to provide content validation for a nutrition education program which was developed by the evaluator’s sister-in-law. The likelihood that they share the same opinion of the program is high.

Design bias: Designing an evaluation to favor (or disfavor) a certain target group in order to support the program being evaluated. For example, selecting a sample of students enrolled in school on a day when absenteeism is high will result in a design bias against lower socio-economic students because absenteeism is usually higher among lower economic groups.

Selectivity bias: When a sample is inadvertently connected to desired outcomes, the evaluation will be affected. Similar to the design bias mentioned above.

Item bias: The construction of an individual item on an evaluation scale which adversely affects some subset of the target audience. For example, a Southwest Native American child who has never seen a brook is presented with an item asking the child to identify a synonym for brook out of a list of words. This raises the question about the objectivity of the scale as a whole.

Other types of bias that evaluators will experience include desired response bias and response shift bias

Desired response bias occurs when the participant provides an answer that s/he thinks the evaluator wants to hear. The responses the evaluator solicits are slanted towards what the participant thinks the evaluator wants to know. It is often found with general positive bias–that is, the tendency to report positive findings when the program doesn’t merit those findings. General positive bias is often seen with grade inflation–an average student is awarded a B when the student has actually earned a C grade.

Response shift bias occurs when the participant changes his/her frame of reference from the beginning of the program to the end of the program and then reports an experience that is less than the experience perceived at the beginning of the program. This results in lower program effectiveness.

A good friend of mine asked me today if I knew of any attributes (which I interpreted to be criteria) of qualitative data (NOT qualitative research). My friend likened the quest for attributes for qualitative data to the psychometric properties of a measurement instrument–validity and reliability–that could be applied to the data derived from those instruments.

Good question. How does this relate to program evaluation, you may ask. That question takes us to an understanding of paradigm.

Paradigm (according to Scriven in Evaluation Thesaurus) is a general concept or model for a discipline that may be influential in shaping the development of that discipline. They do not (again according to Scriven) define truth; rather they define prima facie truth (i.e., truth on first appearance) which is not the same as truth. Scriven goes on to say, “…eventually, paradigms are rejected as too far from reality and they are always governed by that possibility[i.e.,that they will be rejected] (page 253).”

So why is it important to understand paradigms. They frame the inquiry. And evaluators are asking a question, that is, they are inquiring.

How inquiry is framed is based on the components of paradigm:

ontology–what is the nature of reality?
epistemology–what is the relationship between the known and the knower?
methodology–what is done to gain knowledge of reality, i.e., the world?

These beliefs shape how the evaluator sees the world and then guides the evaluator in the use of data, whether those data are derived from records, observations, interviews (i.e., qualitative data) or those data are derived from measurement, scales, instruments (i.e., quantitative data). Each paradigm guides the questions asked and the interpretations brought to the answers to those questions. This is the importance to evaluation.

Denzin and Lincoln (2005) in their 3rd edition volume of the Handbook of Qualitative Research

list what they call interpretive paradigms. They are described in Chapters 8 – 14 in that volume. The paradigms are:

Positivist/post positivist
Constructivist
Feminist
Ethnic
Marxist
Cultural studies
Queer theory

They indicate that each of these paradigms have criteria, a form of theory, and have a specific type of narration or report. If paradigms have criteria, then it makes sense to me that the data derived in the inquiry formed by those paradigms would have criteria. Certainly, the psychometric properties of validity and reliability (stemming from the positivist paradigm) relate to data, usually quantitative. It would make sense to me that the parallel, though different, concepts in constructivist paradigm, trustworthiness and credibility, would apply to data derived from that paradigm–often qualitative.

If that is the case–then evaluators need to be at least knowledgeable about paradigms.

In 1963, Campbell and Stanley (in their classic book, Experimental and Quasi-Experimental Designs for Research), discussed the retrospective pretest. This is the method where by the participant’s attitude, knowledge, skills, behaviors, etc. existing prior to and after the program are assessed together AFTER the program. A novel approach to capturing what the participant knew, felt, did before they experienced the program.

Does it work? Yes…and no (according to the folks in the know).

Campbell and Stanley mention the use of the retrospective pretest in measuring attitudes towards Blacks (they use the term Negro) of soldiers who are assigned to racially mixed vs. all white combat infantry units (1947) and to measure housing project occupants attitudes to being in integrated vs. segregated housing units when there was a housing shortage (1951). Both tests showed no difference between the two groups in remembering prior attitudes towards the idea of interest. Campbell and Stanley argue that having only posttest measures, any difference found may have been attributable to selection bias. They caution readers to “…be careful to note that the probable direction of memory bias is to distort the past…into agreement with (the) present…or has come to believe to be socially desirable…”

This brings up several biases that the Extension professional needs to be concerned with in planning and conducting an evaluation: selection bias, desired response bias, and response shift bias. All of which can have serious implications for the evaluation.

Those are technical words for several limitations which can affect any evaluation. Selection bias is the preference to put some participants into one group rather than the other. Campbell and Stanley call this bias a threat to validity. Desired response bias occurs when participants try to answer the way they think the evaluator wants them to answer. Response shift bias happens when participants frame of reference or understanding changes during the program, often due to misunderstanding or preconceived ideas.

So these are the potential problems. Are there any advantages/strengths to using the retrospective pretest? There are at least two. First, there is only one administration, at the end of the program. This is advantageous when the program is short and when participants do not like to fill out forms (that is, minimizes paper burden). And second, avoids the response-shift bias by not introducing information that may not be understood prior to the program.

Theodore Lamb (2005) tested the two methods and concluded that the two approaches appeared similar and recommended the retrospective pretest if conducting a pretest/posttest is difficult or impossible. He cautions, however, that supplementing the data from the retrospective pretest with other data is necessary to demonstrate the effectiveness of the program.

There is a vast array of information about this evaluation method. If you would like to know more, let me know.

I was reading another evaluation blog (the American Evaluation Association’s blog AEA365) which talked about data base design. I was reminded that over the years, almost every Extension professional with whom I have worked has asked me the following question: “What do I do with my data now that I have all my surveys back?”

As Leigh Wang points out in her AEA365 comments, “Most training programs and publication venues focus on the research design, data collection, and data analysis phases, but largely leave the database design phase out of the research cycle.” The questions that this statement raises are:

How do/did you learn what to do with data once you have it?
How do/did you decide to organize it?
What software do/did you use?
How important is it to make the data accessible to colleagues in the same field?

I want to know the answers to those questions. I have some ideas. Before I talk about what I do, I want to know what you do. Email me, or comment on this blog.

Last week, I talked about formative and summative evaluations. Formative and summative evaluation roles can help you prioritize what evaluations you do when. I was then reminded of another approach to viewing evaluation that relates to prioritizing evaluations that might also be useful.

When I first started in this work, I realized that I could view evaluation in three parts–process, progress, product. Each part could be conducted or the total approach could be used. This approach provides insights to different aspects of a program. It can also provide a picture of the whole program. Deciding on which part to focus is another way to prioritize an evaluation.

Process evaluation captures the HOW of a program. Process evaluation has been defined as the evaluation that assesses the delivery of the program (Scheirer, 1994). Process evaluation identifies what the program is and if it is delivered as intended both to the “right audience” and in the “right amount”. The following questions (according to Scheirer) can guide a process evaluation:

Why is the program expected to produce its results?
For what types of people may it be effective?
In what circumstances may it be effective?
What are the day-to-day aspects of program delivery?

Progress evaluation captures the FIDELITY of a program–that is, did the program do what the planners said would be done in the time allotted? Progress evaluation has been very useful when I have grant activities and need to be accountable for the time-line.

Product evaluation captures a measure of the program’s products or OUTCOMES. Sometimes outputs are also captured and this is fine. Just keep in mind that outputs may be (and often are) necessary; they are not sufficient for demonstrating the impact of the program. A product evaluation is often summative. However, it can also be formative, especially if the program planners want to gather information to improve the program rather than to determine the ultimate effectiveness of the program.

This framework may be useful in helping Extension professionals decide what to evaluate and when. It may help determine what program needs a process, progress, or product evaluation. Trying to evaluate all your program all at once often defeats being purposeful in your evaluation efforts and often leads to results that are confusing, invalid, and/or useless. It makes sense to choose carefully what evaluation to do when–that is, prioritize.

A question was raised in a meeting this week about evaluation priorities and how to determine them. This reminded me that perhaps a discussion of formative and summative was needed as knowing about these roles of evaluation will help you answer your questions about priorities.

Michael Scriven coined the terms formative and summative evaluation in the late 1960s. Applying these terms to the role evaluation plays in a program has been and continues to be a useful distinction for investigators. Simply put, formative evaluation provides information for program improvement. Summative evaluation provides information to assist decision makers in making judgments about a program, typically for adoption, continuation, or expansion. Both are important.

When Extension professionals evaluate a program at the end of an training or other program, typically, they are gathering information for program improvement. The data gathered after a program are for use by the program designers to help improve it. Sometimes, Extension professionals gather outcome data at the end of a training or other program. Here, information is gathered to help determine the effectiveness of the program. These data are typically short term outcome data, and although they are impact data of a sort, they do not reflect the long term effectiveness of a program. These data gathered to determine outcomes are summative. In many cases, formative and summative are gathered at the same time.

Summative data are also gathered to reflect the intermediate and long term outcomes. As Ellen Taylor-Powell points out when she talks about logic models, impacts are the social, economic, civic, and/or environmental consequences of a program and tend to be longer term. I find calling these outcomes condition changes helps me keep in mind that they are the consequences or impacts of a program and are gathered using a summative form of evaluation.

So how do you know which to use when? Ask yourself the following questions:

What is the purpose of the evaluation? Do you want to know if the program works or if the participants were satisfied?
What are the circumstances surrounding the program?Is the program in its early development or late development? Are the politics surrounding the program challenging?
What resources are available for the evaluation? So you have a lot of time or only a few weeks? Do you have access to people to help you or are you on your own?
What accountability is required? Do you have to report about the effectiveness of a program or do you just have to offer it?
What knowledge generation is expected or desired? Do you need to generate scholarship or support for promotion and tenure?

Think of the answers to these questions as a decision tree as the answers to these questions will help you prioritize your evaluation. Those answers will help you decide if you are going to conduct a formative evaluation, a summative evaluation, or include components of both in your evaluation.

Last week, I talked briefly about what test to use to analyze your data. Most of the evaluation work conducted by Extension professionals results in one group, often an intact group or population. Today I want to talk about what you can do with those data.

One of the first things you can do is to run frequencies and percentages on these data. In fact, I recommend you compute them as the first analyzes you run. Most softwear (SPSS, SAS, Excel, etc.) programs will do this for you. When you run frequencies in SPSS, the computer returns an output that looks something like the first image:

When compute frequencies in SAS, the resulting output looks like the second image:

Both images report frequencies, percentages of those frequencies, and cumulative percentage (that is, it adds the percents of frequency A to the percent of frequency B, etc. until 100% is reached).

To compute frequencies in Excel, read here. Excel has a number of COUNT functions depending on what you want to know.

Once you have computed frequencies and percentages, most people want to know if change occurred. Although there are other analyses which can be performed (reliability, validity, correlation, prediction), all of these require that you know what type of data do you have

nominal–whether people’s answers named something (e.g., gender; marital status);
ordinal–whether people ordered their responses on how strongly they agreed (e.g., agree or disagree);
interval–the scores on a standardized scale (e.g., temperature or nutrition test).

If you have nominal data and you want to compute change, you need to know how many times participants are answering the questionnaire and how many categories you have in your questions (e.g.pre/post; yes/no). If you are giving the questionnaire twice and there are two categories for some of your questions, you can compute the McNemar change test. The McNemar change test is a non-parametric (meaning that the parameters are not known) that is applied to a 2×2 contingency table. It tests for changes in responses using the chi-square distribution and is useful for detecting changes in responses due to “before-and-after” designs. A 2×2 contingency table has two columns and two rows. The frequencies from the nominal data are in the cells where the rows and columns cross; the totals for rows and columns are in the margins (or the last row and the far right column). SPSS computes the following statistics when a cross tabs test is run–Pearson’s Chi Square, Continuity Correction, Likelihood Ratio, Fisher’s Exact Test, and Linear by Linear Association. A McNemar test can be specified.

Evaluation is an Everyday Activity

Program Evaluation Discussions

Author Archives: Molly

Now it is your turn.

Data Management

A few words about Likert

Bias

Does paradigm make a difference?

Retrospective pretest (aka post then pre)

Organizing your data

Process, progress, product?

Formative–Summative

One group and one group only

Contact Info