The topic of survey development seems to be  popping up everywhere–AEA365, Kirkpatrick Partners, eXtension Evaluation Community of Practice, among others.  Because survey development is so important to Extension faculty, I’m providing links and summaries.

 

 AEA365 says:

“… it is critical that you pre-test it with a small sample first.”  Real time testing helps eliminate confusion, improve clarity, and assures that you are asking a question that will give you an answer to what you want to know.  This is so important today when many surveys are electronic.

It is also important to “Train your data collection staff…Data collection staff are the front line in the research process.”  Since they are the people who will be collecting the data, they need to understand the protocols, the rationales, and the purposes of the survey.

Kirkpatrick Partners say:

“Survey questions are frequently impossible to answer accurately because they actually ask more than one question. ”  This is the biggest problem in constructing survey questions.  They provide some examples of asking more than one question.

 

Michael W. Duttweiler, Assistant Director for Program Development and Accountability at Cornell Cooperative Extension stresses the four phases of survey construction:

  1. Developing a Precise Evaluation Purpose Statement and Evaluation Questions
  2. Identifying and Refining Survey Questions
  3. Applying Golden Rules for Instrument Design
  4. Testing, Monitoring and Revising

He then indicates that the next three blog posts will cover point 2, 3, and 4.

Probably my favorite post on survey recently was one that Jane Davidson did back in August, 2012 in talking about survey response scales.  Her “boxers or briefs” example captures so many issues related to survey development.

Writing survey questions which give you useable data that answers your questions about your program is a challenge; it is not impossible.  Dillman writes the book about surveys; it should be on your desk.

Here is the Dillman citation:
Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009).  Internet, mail, and mixed-mode surveys: The tailored design method.  Hoboken, NJ: John Wiley & Sons, Inc.

What is the difference between need to know and nice to know?  How does this affect evaluation?  I got a post this week on a blog I follow (Kirkpatrick) that talks about how much data does a trainer really need?  (Remember that Don Kirkpatrick developed and established an evaluation model for professional training back in the 1954 that still holds today.)

Most Extension faculty don’t do training programs per se, although there are training elements in Extension programs.  Extension faculty are typically looking for program impacts in their program evaluations.  Program improvement evaluations, although necessary, are not sufficient.  Yes, they provide important information to the program planner; they don’t necessarily give you information about how effective your program has been (i.e., outcome information). (You will note that I will use the term “impacts” interchangeably with “outcomes” because most Extension faculty parrot the language of reporting impacts.)

OK.  So how much data do you really need?  How do you determine what is nice to have and what is necessary (need) to have?  How do you know?

  1. Look at your logic model.  Do you have questions that reflect what you expect to have happen as a result of your program?
  2. Review your goals.  Review your stated goals, not the goals you think will happen because you “know you have a good program”.
  3. Ask yourself, How will I USE these data?  If the data will not be used to defend your program, you don’t need it.
  4. Does the question describe your target audience?  Although not demonstrating impact, knowing what your target audience looks like is important.  Journal articles and professional presentations want to know this.
  5. Finally, ask yourself, Do I really need to know the answer to this question or will it burden the participant.  If it is a burden, your participants will tend to not answer, then you  have a low response rate; not something you want.

Kirkpatrick also advises to avoid redundant questions.  That means questions asked in a number of ways and giving you the same answer; questions written in positive and negative forms.  The other question that I always include because it will give me a way to determine how my program is making a difference is a question on intention including a time frame.  For example, “In the next six months do you intend to try any of the skills you learned to day?  If so, which one.”  Mazmaniam has identified the best predictor of behavior change (a measure of making a difference) is stated intention to change.  Telling someone else makes the participant accountable.  That seems to make the difference.

 

Reference:

Mazmanian, P. E., Daffron, S. R., Johnson, R. E., Davis, D. A., & Kantrowits, M. P. (1998).   Information about barriers to planned change: A Randomized controlled trail involving continuing medical education lectures and commitment to change.  Academic Medicine, 73(8).

 

P.S.  No blog next week; away on business.

 

 

 

Quantitative data analysis is typically what happens to data that are numbers (although qualitative data can be reduced to numbers, I’m talking here about data that starts as numbers.)  Recently, a library colleague sent me an article that was relevant to what evaluators often do–analyze numbers.

So why, you ask, am I talking about an article that is directed to librarians?  Although that article is is directed at librarians, it has relevance to Extension.  Extension faculty (like librarians), more often than not, use surveys to determine the effectiveness of their programs.  Extension faculty are always looking to present the most powerful survey conclusions (yes, I lifted from the article title), and no you don’t need to have a doctorate in statistics to understand these analyses.  The other good thing about this article is that it provides you with a link to an online survey-specific software:  (Raosoft’s calculator at http://www.raosoft.com/samplesize.html).

This article refers specifically to three metrics that are often overlooked by Extension faculty:  margin of error (MoE), confidence level (CL), and cross-tabulation analysis.   These are three statistics which will help you in your work. The article also does a nice job of listing the eight recommended best practices which I’ve appended here with only some of the explanatory text.

 

Complete List of Best Practices for Analyzing Multiple Choice Surveys

1. Inferential statistical tests. To be more certain of the conclusions drawn from survey data, use inferential statistical tests.

2. Confidence Level (CL). Choose your desired confidence level (typically 90%, 95%, or 99%) based upon the purpose of your survey and how confident you need to be of the results. Once chosen, don’t change it unless the purpose of your survey changes. Because the chosen confidence level is part of the formula that determines the margin of error, it’s also important to document the CL in your report or article where you document the margin of error (MoE).

3. Estimate your ideal sample size before you survey. Before you conduct your survey use a sample size calculator specifically designed for surveys to determine how many responses you will need to meet your desired confidence level with your hypothetical (ideal) margin of error (usually 5%).

4. Determine your actual margin of error after you survey. Use a margin of error calculator specifically designed for surveys (you can use the same Raosoft online calculator recommended above).

5. Use your real margin of error to validate your survey conclusions for your larger population.

6. Apply the chi-square test to your crosstab tables to see if there are relationships among the variables that are not likely to have occurred by chance.

7. Reading and reporting chi-square tests of cross-tab tables.

  • Use the .05 threshold for your chi-square p-value results in cross-tab table analysis.
  • If the chi-square p-value is larger than the threshold value, no relationship between the variables is detected. If the p-value is smaller than the threshold value, there is a statistically valid relationship present, but you need to look more closely to determine what that relationship is. Chi-square tests do not indicate the strength or the cause of the relationship.
  • Always report the p-value somewhere close to the conclusion it supports (in parentheses after the conclusion statement, or in a footnote, or in the caption of the table or graph).

8. Document any known sources of bias or error in your sampling methodology and in your survey design in your report, including but not limited to how your survey sample was obtained.

 

Bottom line:  read the article.

Hightower, C. & Kelly, S. (2012, Spring).  Infer more, describe less: More powerful survey conclusions through easy inferential tests.  Issues in Science and Technology Librarianship. DOI:10.5062/F45H7D64. [Online]. Available at: http://www.istl.org/12-spring/article1.html

Matt Keene, AEAs thought leader for June 2012 says, “Wisdom, rooted in knowledge of thyself, is a prerequisite of good judgment. Everybody who’s anybody says so – Philo Judaeus,

Socrates,  Lao-tse,

Plotinus, Paracelsus,

Swami Ramdas,  and Hobbs.

I want to focus on the “wisdom is a prerequisite of good judgement” and talk about how that relates to evaluation.  I also liked the list of “everybody who’s anybody.”   (Although I don’t know who Matt means by Hobbs–is that Hobbes  or the English philosopher for whom the well known previous figure was named, Thomas Hobbes , or someone else that I couldn’t find and don’t know?)  But I digress…

 

“Wisdom is a prerequisite for good judgement.”  Judgement is used daily by evaluators.  It results in the determination of value, merit, and/or worth of something.  Evaluators make a judgement of value, merit, and/or worth.  We come to these judgements through experience.  Experience with people, activities, programs, contributions, LIFE.  Everything we do provides us with experience; it is what we do with that experience that results in wisdom and, therefore, leads to good judgements.

Experience is a hard teacher; demanding, exacting, and often obtuse.  My 19 y/o daughter is going to summer school at OSU.  She got approval to take two courses and for those courses to transfer to her academic record at her college.  She was excited about the subject; got the book; read ahead; and looked forward to class, which started yesterday.  After class, I had never seen a more disappointed individual.  She found the material uninteresting (it was mostly review because she had read ahead), she found the instructor uninspiring (possibly due to class size of 35).  To me, it was obvious that she needed to re-frame this experience into something positive; she needed to find something she could learn from this experience that would lead to wisdom.  I suggested that she think of this experience as a cross cultural exchange; challenging because of cultural differences.  In truth, a large state college is very different from a small liberal arts college; truly a different culture.  She has four weeks to pull some wisdom from this experience; four weeks to learn how to make a judgement that is beneficial.  I am curious to see what happens.

Not all evaluations result in beneficial judgements; often, the answer, the judgement, is NOT what the stakeholders want to hear.  When that is the case, one needs to re-frame the experience so that learning occurs (both for the individual evaluator as well as the stakeholders) so that the next time the learning, the hard won wisdom, will lead to “good” judgement, even if the answer is not what the stakeholders want to hear.  Matt started his discussion with the saying that “wisdom, rooted in knowledge of self, is a prerequisite for good judgement”.  Knowing your self is no easy task; you can only control what you say, what you do, and how your react (a form of doing/action).  The study of those things is a life long adventure, especially when you consider how hard it is to change yourself.  Just having knowledge isn’t enough for a good judgement; the evaluator needs to integrate that knowledge into the self and own it; then the result will be “good judgements”; the result will be wisdom.

I started this post back in April.  I had an idea that needed to be remembered…it had to do with the unit of analysis; a question which often occurs in evaluation.  To increase sample size and, therefore,  power, evaluators often choose run analyses on the larger number when the aggregate, i.e., smaller number is probably the “true” unit of analysis.  Let me give you an example.

A program is randomly assigned to fifth grade classrooms in three different schools.  School A has three classrooms; school B has two classrooms; and school C has one classroom.  All together, there are approximately 180 students, six classrooms, three schools.  What is the appropriate unit of analysis?  Many people use students, because of the sample size issue.  Some people will use classroom because each got a different treatment.  Occasionally, some evaluators will use schools because that is the unit of randomization.  This issue elicits much discussion.  Some folks say that because students are in the school, they are really the unit of analysis because they are imbedded in the randomization unit.  Some folks say that students is the best unit of analysis because there are more of them.  That certainly is the convention.  What you need to decide is what is the unit and be able to defend that choice.  Even though I would loose power, I think I would go with the the unit of randomization.  Which leads me to my next point–truth.

At the end of the first paragraph, I use the words “true” in quotation marks. The Kirkpatricks in their most recent blog opened with a quote from the US CIA headquarters in Langley Virginia, “”And ye shall know the truth, and the truth shall make you free”.   (We wont’ talk about the fiction in the official discourse, today…)   (Don Kirkpatrick developed the four levels of evaluation specifically in the training and development field.)  Jim Kirkpatrick, Don’s son, posits that, “Applied to training evaluation, this statement means that the focus should be on discovering and uncovering the truth along the four levels path.”  I will argue that the truth is how you (the principle investigator, program director, etc.) see the answer to the question.  Is that truth with an upper case “T” or is that truth with a lower case “t”?  What do you want it to mean?

Like history (history is what is written, usually by the winners, not what happened), truth becomes what do you want the answer to mean.  Jim Kirkpatrick offers an addendum (also from the CIA), that of “actionable intelligence”.  He goes on to say that, “Asking the right questions will provide data that gives (sic) us information we need (intelligent) upon which we can make good decisions (actionable).”  I agree that asking the right question is important–probably the foundation on which an evaluation is based.  Making “good decisions”  is in the eyes of the beholder–what do you want it to mean.

I had a conversation (ok–an electronic conversation) with colleagues a few weeks ago.  The conversation was about the use of inappropriate analyses for manuscripts being submitted.  The specific question raised was about the t-test and went something like this:

Should a t-test be used on a sample that has not been RANDOMLY drawn from a population.  If the sample is not randomly drawn, or if the entire population is used (as is often the case with Extension evaluations), then a t-test is NOT appropriate.

What exactly is the appropriate test?  First, one has to identify if the underlying assumptions have been met; if not, a parametric, i.e., ttest, is not appropriate.

So what exactly are the assumptions underlying the use of a t-test?

Glass and Stanley (1970, pg. 297) [a classic statistics text in the field of education and psychology and there may be a more recent edition than the one I have on my shelf] say that for dependent samples (the kind most often used by Extension professionals ), the sample:

  1. is normally distributed;
  2. has homogeneous variances (i.e., same spread); and
  3. is RANDOMLY drawn from a population.

Some would add that the scale of measurement needs to result in interval or ratio data (not nominal–ordinal is a questionable case) and have a sample size of 30 or over.

This presents a quandary, to say the least.  Extension professionals know that journals want a probability value (if the study is quantitative) and how do you get a probability value with out a t-test?

Answer:  Use a nonparametric equivalent test.

The nonparametric equivalent for a test of dependent means is a Wilcoxon matched pair test.

Marascuilo and McSweeney (1977, pg 5) say that that the researcher needs to select “…a test for which the power of rejection is maximized when the hypothesis is tested false.”  They go on to say, “If the data adhere to the assumptions required for a classical, normally based t or F test, a researcher would be foolish…not (to) use them, since they are optimum, when justified.”  Justification is the key here.  And the justification is: Do the data meet the assumptions for the test?

Their bottom line is that a researcher should never think that a nonparametric test is exclusively a substitute for a parametric test.  It is not.  Use the right test for the hypothesis being tested.  It may be (probably is) a nonparametric test.

 

Citations

Glass, G. V., & Stanley, J. C. (1970). Statistical Methods in Education and Psychology.  Englewood Cliffs, NJ: Prentice-Hall, Inc.

Marascuilo, L. A. & McSweeney, M. (1977).  Nonparametric and Distribution-free Methods for the Social Sciences. Monterey, CA: Brooks/Cole Publishing.

 

 

A colleague asks for advice on handling evaluation stories, so that they don’t get brushed aside as mere anecdotes.  She goes on to say of the AEA365 blog she read, ” I read the steps to take (hot tips), but don’t know enough about evaluation, perhaps, to understand how to apply them.”  Her question raises an interesting topic.  Much of what Extension does can be captured in stories (i.e., qualitative data)  rather than in numbers (i.e., quantitative data).  Dick Krueger, former Professor and Evaluation Leader (read specialist) at the University of Minnesota has done a lot of work in the area of using stories as evaluation.  Today’s post summarizes his work.

 

At the outset, Dick asks the following question:  What is the value of stories?  He provides these three answers:

  1. Stories make information easier to remember
  2. Stories make information more believable
  3. Stories can tap into emotions.

There are all types of stories.  The type we are interested in for evaluation purposes are organizational stories.  Organizational stories can do the following things for an organization:

  1. Depict culture
  2. Promote core values
  3. Transmit and reinforce the culture
  4. Provide instruction to employees
  5. Motivate, inspire, and encourage

He suggests six common types of organizational stories:

  1. Hero stories  (someone in the organization who has done something beyond the normal range of achievement)
  2. Success stories (highlight organizational successes)
  3. Lessons learned stories (what major mistakes and triumphs teach the organization)
  4. “How it works around here” stories (highlight core organizational values reflected in actual practice
  5. “Sacred bundle” stories (a collection of stories that together depict the culture of an organization; core philosophies)
  6. Training and orientation stories (assists new employees in understanding how the organization works)

To use stories as evaluation, the evaluator needs to consider how stories might be used, that is, do they depict how people experience the program?  Do they understand program outcomes?  Do they get insights into program processes?

You (as evaluator) need to think about how the story fits into the evaluation design (think logic model; program planning).  Ask yourself these questions:  Should you use stories alone?  Should you use stories that lead into other forma of inquiry?  Should you use stories that augment/illustrate results from other forms of inquiry?

You need to establish criteria for stories.  Rigor can be applied to story even though the data are narrative.  Criteria include the following:   Is the story authentic–is it truthful?  Is the story verifiable–is there a trail of evidence back to the source of the story?  Is there a need to consider confidentiality?  What was the original intent–purpose behind the original telling?  And finally, what does the story represent–other people or locations?

You will need a plan for capturing the stories.  Ask yourself these questions:  Do you need help capturing the stories?  What strategy will you use for collecting the stories?  How will you ensure documentation and record keeping?  (Sequence the questions; write them down the type–set up; conversational; etc.)  You will also need a plan for analyzing and reporting the stories  as you, the evaluator,  are responsible for finding meaning.

 

A colleague asked me yesterday about authenticating anecdotes–you know–those wonderful stories you gather about how what you’ve done has made a difference in someones life?

 

I volunteer service to a non-profit board (two, actually) and the board members are always telling stories about how “X has happened” and how “Y was wonderful” yet,  my evaluator self says, “How do you know?”  This becomes a concern for organizations which do not have evaluation as part of their mission statement.  Evan though many boards hold accountable the Executive Director, few make evaluation explicit.

Dick Krueger  , who has written about focus groups, also writes and studies the use of stories in evaluation and much of what I will share with y’all today is from his work.

First, what is a story?  Creswell (2007, 2 ed.) defines story as “…aspects that surface during an interview in which the participant describes a situation, usually with a beginning, a middle, and an end, so that the researcher can capture a complete idea and integrate it, intact, into the qualitative narrative”.  Krueger elaborates on that definintion by saying that a story “…deals with an experience of an event, program, etc. that has a point or a purpose.”  Story differs from case study in that case study is a story that tries to understand a system, not an individual event or experience; a story deals with an experience that has a point.  Stories provide examples of core philosophies, of significant events.

There are several purposes for stories that can be considered evaluative.  These include depicting the culture, promoting core values, transmitting and reinforcing current culture, providing instruction (another way to transmit culture), and motivating, inspiring, and/or encouraging (people).  Stories can be of the following types:  hero stories, success stories, lesson-learned stories, core value  stories, cultural stories, and teaching stories.

So why tell a story?  Stories make information easier to remember, more believable, and tap into emotion.  For stories to be credible (provide authentication), an evaluator needs to establish criteria for stories.  Krueger suggests five different criteria:

  • Authentic–is it truthful?  Is there truth in the story?  (Remember “truth” depends on how you look at something.)
  • Verifiable–is there a trail of evidence back to the source?  Can you find this story again?
  • Confidential–is there a need to keep the story confidential?
  • Original intent–what is the basis for the story?  What motivated telling the story? and
  • Representation–what does the story represent?  other people?  other locations?  other programs?

Once you have established criteria for the stories collected, there will need to be some way to capture stories.  So develop a plan.  Stories need to be willingly shared, not coerced; documented and recorded; and collected in a positive situation.  Collecting stories is an example where the protections for  humans in research must be considered.  Are the stories collected confidentially?  Does telling the stories result in little or no risk?  Are stories told voluntarily?

Once the stories have been collected, analyzing and reporting those stories is the final step.  Without this, all the previous work  was for naught.  This final step authenticates the story.  Creswell provides easily accessible guidance for analysis.

…that there is a difference between a Likert item and a Likert scale?**

Did you know that a Likert item was developed by Rensis Likert, a psychometrician and an educator? 

And that the item was developed to have the individual respond to the level of agreement or disagreement with a specific phenomenon?

And did you know that most of the studies on Likert items use a five- or seven-points on the item? (Although sometimes a four- or six-point scale is used and that is called a forced-choice approach–because you really want an opinion, not a middle ground, also called a neutral ground.)

And that the choices in an odd-number choice usually include some variation on the following theme, “Strongly disagree”, “Disagree”, “Neither agree or disagree”, “Agree”, “Strongly Agree”?

And if you did, why do you still write scales, and call them Likert, asking for information using a scale that goes from “Not at all” to “A little extent” to “Some extent” to “Great extent?  Responses that are not even remotely equidistant (that is, have equal intervals with respect to the response options) from each other–a key property of a Likert item.

And why aren’t you using a visual analog scale to get at the degree of whatever the phenomenon is being measured instead of an item for which the points on the scale are NOT equidistant? (For more information on a visual analog scale see a brief description here or Dillman’s book.)

I sure hope Rensis Likert isn’t rolling over in his grave (he died in 1981 at the age of 78).

Extension professionals use survey as the primary method for data gathering.  The choice of survey is a defensible one.  However, the format of the survey, the question content, and the question construction must also be defensible.  Even though psychometric properties (including internal consistency, validity, and other statistics) may have been computed, if the basic underlying assumptions are violated, no psychometric properties will compensate for a poorly designed instrument, an instrument that is not defensible.

All Extension professionals who choose to use survey to evaluate their target audiences need to have scale development as a personal competency.  So take it upon yourself to learn about guidelines for scale development (yes, there are books written on the subject!).

<><><><><>

**Likert scale is the SUM of of responses on several Likert items.  A Likert item is just one 4 -, 5-, 6, or 7-point single statement asking for an opinion.

Reference:  Devellis, R. F. (1991).  Scale development:  Theory and applications. Newbury Park: Sage Publications. Note:  there is a newer edition.

Dillman, D. A, Smyth, J. D., & Christian, L. M. (2009).  Internet, mail, and mixed-mode surveys:  The tailored design method. (3rd ed.). Hoboken, NJ: John Wiley& Sons, Inc.

Last week, I spoke about how to questions  and applying them  to program planning, evaluation design, evaluation implementation, data gathering, data analysis, report writing, and dissemination.  I only covered the first four of those topics.  This week, I’ll give you my favorite resources for data analysis.

This list is more difficult to assemble.  This is typically where the knowledge links break down and interest is lost.  The thinking goes something like this.  I’ve conducted my program, I’ve implemented the evaluation, now what do I do?  I know my program is a good program so why do I need to do anything else?

YOU  need to understand your findings.  YOU need to be able to look at the data and be able to rigorously defend your program to stakeholders.  Stakeholders need to get the story of your success in short clear messages.  And YOU need to be able to use the findings in ways that will benefit your program in the long run.

Remember the list from last week?  The RESOURCES for EVALUATION list?  The one that says:

1.  Contact your evaluation specialist.

2.  Listen to stakeholders–that means including them in the planning.

3.  Read

Good.  This list still applies, especially the read part.  Here are the readings for data analysis.

First, it is important to know that there are two kinds of data–qualitative (words) and quantitative (numbers).  (As an aside, many folks think words that describe are quantitative data–they are still words even if you give them numbers for coding purposes, so treat them like words, not numbers).

  • Qualitative data analysis. When I needed to learn about what to do with qualitative data, I was given Miles and Huberman’s book.  (Sadly, both authors are deceased so there will not be a forthcoming revision of their 2nd edition, although the book is still available.)

Citation: Miles, M. B., & Huberman, A. Michael. (1994). Qualitative data analysis: An expanded source book. Thousand Oaks, CA: Sage Publications.

Fortunately, there are newer options, which may be as good.  I will confess, I haven’t read them cover to cover at this point (although they are on my to-be-read pile).

Citation:  Saldana, J.  (2009). The coding manual for qualitative researchers. Los Angeles, CA: Sage.

Bernard, H. R. & Ryan, G. W. (2010).  Analyzing qualitative data. Los Angeles, CA: Sage.

If you don’t feel like tackling one of these resources, Ellen Taylor-Powell has written a short piece  (12 pages in PDF format) on qualitative data analysis.

There are software programs for qualitative data analysis that may be helpful (Ethnograph, Nud*ist, others).  Most people I know prefer to code manually; even if you use a soft ware program, you will need to do a lot of coding manually first.

  • Quantitative data analysis. Quantitative data analysis is just as complicated as qualitative data analysis.  There are numerous statistical books which explain what analyses need to be conducted.  My current favorite is a book by Neil Salkind.

Citation: Salkind, N. J. (2004).  Statistics for people who (think they) hate statistics. (2nd ed. ). Thousand Oaks, CA: Sage Publications.

NOTE:  there is a 4th ed.  with a 2011 copyright available. He also has a version of this text that features Excel 2007.  I like Chapter 20 (The Ten Commandments of Data Collection) a lot.  He doesn’t talk about the methodology, he talks about logistics.  Considering the logistics of data collection is really important.

Also, you need to become familiar with a quantitative data analysis software program–like SPSS, SAS, or even Excel.  One copy goes a long way–you can share the cost and share the program–as long as only one person is using it at a time.  Excel is a program that comes with Microsoft Office.  Each of these has tutorials to help you.