Ever wonder where the 0.05 probability level number was derived? Ever wonder if that is the best number? How many of you were taught in your introduction to statistics course that 0.05 is the probability level necessary for rejecting the null hypothesis of no difference? This confidence may be spurious. As Paul Bakker indicates in the AEA 365 blog post for March 28, “Before you analyze your data, discuss with your clients and the relevant decision makers the level of confidence they need to make a decision.” Do they really need to be 95% confident? Or would 90% confidence be sufficient? What about 75% or even 55%?
Think about it for a minute? If you were a brain surgeon, you wouldn’t want anything less than 99.99% confidence; if you were looking at level of risk for a stock market investment, 55% would probably make you a lot of money. The academic community has held to and used the probability level of 0.05 for years (the computation of the p value dating back to 1770). (Quoting Wikipedia, ” In the 1770s Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.”) Fisher first proposed the 0.05 level in 1025 and established a one in 20 limit for statistical significance when considering a two tailed test. Sometimes the academic community makes the probability level even more restrictive by using 0.01 or 0.001 to demonstrate that the findings are significant. Scientific journals expect 95% confidence or a probability level of at least 0.05.
Although I have held to these levels, especially when I publish a manuscript, I have often wondered if this level makes sense. If I am only curious about a difference, do I need 0.05? Oor could I use 0.10 or 0.15 or even 0.20? I have often asked students if they are conducting confirmatory or exploratory research? I think confirmatory research expects a more stringent probability level. I think exploratory research requires a less stringent probability level. The 0.05 seems so arbitrary.
Then there is the grounded theory approach which doesn’t use a probability level. It generates theory from categories which are generated from concepts which are identified from data, usually qualitative in nature. It uses language like fit, relevance, workability, and modifiability. It does not report statistically significant probabilities as it doesn’t use inferential statistics. Instead, it uses a series of probability statements about the relationships between concepts.
So what do we do? What do you do? Let me know.
These three questions have buzzed around my head for a while in various formats.
When I attend a conference, I wonder.
When I conduct a program, I wonder, again.
When I explore something new, I am reminded that perhaps someone else has been here and wonder, yet again.
After all, aren’t both of these statements (capacity building and engagement) relating to a “foreign country” and a different culture?
How does all this relate to evaluation? Read on…
Premise: Evaluation is an everyday activity. You evaluate everyday; all the time; you call it making decisions. Every time you make a decision, you are building capacity in your ability to evaluate. Sure, some of those decisions may need to be revised. Sure, some of those decisions may just yield “negative” results. Even so, you are building capacity. AND you share that knowledge–with your children (if you have them), with your friends, with your colleagues, with the random shopper in the (grocery) store. That is building capacity. Building capacity can be systematic, organized, sequential. Sometimes formal, scheduled, deliberate. It is sharing “What do I know that they don’t know (in the hope that they too will know it and use it).
Premise: Everyone knows something. In knowing something, evaluation happens–because people made decisions about what is important and what is not. To really engage (not just outreach which much of Extension does), one needs to “do as” the group that is being engaged. To do anything else (“doing to” or “doing with”) is simply outreach and little or no knowledge is exchanged. Doesn’t mean that knowledge isn’t distributed; Extension has been doing that for years. Just means that the assumption (and you know what assumptions do) is that only the expert can distribute knowledge. Who is to say that the group (target audience, participants) aren’t expert in at least part of what is being communicated. Probably are. It is the idea that … they know something that I don’t know (and I would benefit from knowing).
Premise: Everything, everyone is connected. Being prepared is the best way to learn something. Being prepared by understanding culture (I’m not talking only about the intersection of race and gender; I’m talking about all the stereotypes you carry with you all the time) reinforces connections. Learning about other cultures (something everyone can do) helps dis-spell stereotypes and mitigate stereotype threats. And that is an evaluative task. Think about it. I think it captures the What do all of us need to know that few of us knows?” question.
The topic of survey development seems to be popping up everywhere–AEA365, Kirkpatrick Partners, eXtension Evaluation Community of Practice, among others. Because survey development is so important to Extension faculty, I’m providing links and summaries.
“… it is critical that you pre-test it with a small sample first.” Real time testing helps eliminate confusion, improve clarity, and assures that you are asking a question that will give you an answer to what you want to know. This is so important today when many surveys are electronic.
It is also important to “Train your data collection staff…Data collection staff are the front line in the research process.” Since they are the people who will be collecting the data, they need to understand the protocols, the rationales, and the purposes of the survey.
Kirkpatrick Partners say:
“Survey questions are frequently impossible to answer accurately because they actually ask more than one question. “ This is the biggest problem in constructing survey questions. They provide some examples of asking more than one question.
Michael W. Duttweiler, Assistant Director for Program Development and Accountability at Cornell Cooperative Extension stresses the four phases of survey construction:
He then indicates that the next three blog posts will cover point 2, 3, and 4.
Probably my favorite post on survey recently was one that Jane Davidson did back in August, 2012 in talking about survey response scales. Her “boxers or briefs” example captures so many issues related to survey development.
Writing survey questions which give you useable data that answers your questions about your program is a challenge; it is not impossible. Dillman writes the book about surveys; it should be on your desk.
Here is the Dillman citation:
Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009). Internet, mail, and mixed-mode surveys: The tailored design method. Hoboken, NJ: John Wiley & Sons, Inc.
Quantitative data analysis is typically what happens to data that are numbers (although qualitative data can be reduced to numbers, I’m talking here about data that starts as numbers.) Recently, a library colleague sent me an article that was relevant to what evaluators often do–analyze numbers.
So why, you ask, am I talking about an article that is directed to librarians? Although that article is is directed at librarians, it has relevance to Extension. Extension faculty (like librarians), more often than not, use surveys to determine the effectiveness of their programs. Extension faculty are always looking to present the most powerful survey conclusions (yes, I lifted from the article title), and no you don’t need to have a doctorate in statistics to understand these analyses. The other good thing about this article is that it provides you with a link to an online survey-specific software: (Raosoft’s calculator at http://www.raosoft.com/samplesize.html).
This article refers specifically to three metrics that are often overlooked by Extension faculty: margin of error (MoE), confidence level (CL), and cross-tabulation analysis. These are three statistics which will help you in your work. The article also does a nice job of listing the eight recommended best practices which I’ve appended here with only some of the explanatory text.
1. Inferential statistical tests. To be more certain of the conclusions drawn from survey data, use inferential statistical tests.
2. Confidence Level (CL). Choose your desired confidence level (typically 90%, 95%, or 99%) based upon the purpose of your survey and how confident you need to be of the results. Once chosen, don’t change it unless the purpose of your survey changes. Because the chosen confidence level is part of the formula that determines the margin of error, it’s also important to document the CL in your report or article where you document the margin of error (MoE).
3. Estimate your ideal sample size before you survey. Before you conduct your survey use a sample size calculator specifically designed for surveys to determine how many responses you will need to meet your desired confidence level with your hypothetical (ideal) margin of error (usually 5%).
4. Determine your actual margin of error after you survey. Use a margin of error calculator specifically designed for surveys (you can use the same Raosoft online calculator recommended above).
5. Use your real margin of error to validate your survey conclusions for your larger population.
6. Apply the chi-square test to your crosstab tables to see if there are relationships among the variables that are not likely to have occurred by chance.
7. Reading and reporting chi-square tests of cross-tab tables.
8. Document any known sources of bias or error in your sampling methodology and in your survey design in your report, including but not limited to how your survey sample was obtained.
Bottom line: read the article.
Hightower, C. & Kelly, S. (2012, Spring). Infer more, describe less: More powerful survey conclusions through easy inferential tests. Issues in Science and Technology Librarianship. DOI:10.5062/F45H7D64. [Online]. Available at: http://www.istl.org/12-spring/article1.html
A colleague asks, “What is the appropriate statistical analysis test when comparing means of two groups ?”
I’m assuming (yes, I know what assuming does) that parametric tests are appropriate for what the colleague is doing. Parametric tests (i.e., t-test, ANOVA,) are appropriate when the parameters of the population are known. If that is the case (and non-parametric tests are not being considered), I need to clarify the assumptions underlying the use of parametric tests, which have more stringent assumptions than nonparametric tests. Those assumptions are the following:
The sample is
If those assumptions are met, the part answer is, “It all depends”. (I know you have heard that before today.)
I will ask the following questions:
Once I know the answers to these questions I can suggest a test.
My current favorite statistics book, Statistics for People Who (Think They) Hate Statistics, by Neil J. Salkind (4th ed.) has a flow chart that helps you by asking if you are looking at differences between the sample and the population and relationships or differences between one or more groups. The flow chart ends with the name of a statistical test. The caveat is that you are working with a sample from a larger population that meets the above stated assumptions.
How you answer the questions above also depends on what test you can use. If you do not know the parameters, you will NOT use a parametric test. If you are using an intact population (and many Extension professionals use intact populations), you will NOT use inferential statistics as you will not be inferring to anything bigger than what you have at hand. If you have two groups and the groups are related (like a pre-post test or a post-pre test), you will use a parametric or non-parametric test for dependency. If you have two groups and are they unrelated (like boys and girls), you will use a parametric or non-parametric test for independence. If you have more than two groups you will use different test yet.
Extension professionals are rigorous in their content material; they need to be just as rigorous in their analysis of the data collected from the content material. Understanding the what analyses to use when is a good skill to have.
I came across this quote from Viktor Frankl today (thanks to a colleague)
“…everything can be taken from a man (sic) but one thing: the last of the human freedoms – to choose one’s attitude in any given set of circumstances, to choose one’s own way.” Viktor Frankl (Man’s Search for Meaning – p.104)
I realized that, especially at this time of year, attitude is everything–good, bad, indifferent–the choice is always yours.
How we choose to approach anything depends upon our previous experiences–what I call personal and situational bias. Sadler* has three classifications for these biases. He calls them value inertias (unwanted distorting influences which reflect background experience), ethical compromises (actions for which one is personally culpable), and cognitive limitations (not knowing for what ever reason).
When we approach an evaluation, our attitude leads the way. If we are reluctant, if we are resistant, if we are excited, if we are uncertain, all these approaches reflect where we’ve been, what we’ve seen, what we have learned, what we have done (or not). We can make a choice how to proceed.
The America n Evaluation Association (AEA) has long had a history of supporting difference. That value is imbedded in the guiding principles. The two principles which address supporting differences are
AEA also has developed a Cultural Competence statement. In it, AEA affirms that “A culturally competent evaluator is prepared to engage with diverse segments of communities to include cultural and contextual dimensions important to the evaluation. Culturally competent evaluators respect the cultures represented in the evaluation.”
Both of these documents provide a foundation for the work we do as evaluators as well as relating to our personal and situational bias. Considering them as we enter into the choice we make about attitude will help minimize the biases we bring to our evaluation work. The evaluative question from all this–When has your personal and situational biases interfered with you work in evaluation?
Attitude is always there–and it can change. It is your choice.
Sadler, D. R. (1981). Intuitive data processing as a potential source of bias in naturalistic evaluations. Education Evaluation and Policy Analysis, 3, 25-31.
I am reading the book, Eaarth, by Bill McKibben (a NY Times review is here). He writes about making a difference in the world on which we live. He provides numerous examples that have all happened in the 21st century, none of them positive or encouraging. He makes the point that the place in which we live today is not, and never will be again, like the place in which we lived when most of us were born. He talks about not saving the Earth for our grandchildren but rather how our parents needed to have done things to save the earth for them–that it is too late for the grandchildren. Although this book is very discouraging, it got me thinking.
Isn’t making a difference what we as Extension professionals strive to do?
Don’t we, like McKibben, need criteria to determine what that difference can/could/would be made and look like?
And if we have that criteria well established, won’t we be able to make a difference, hopefully positive (think hand washing here)? And like this graphic, , won’t that difference be worth the effort we have put into the attempt? Especially if we thoughtfully plan how to determine what that difference is?
We might not be able to recover (according to McKibben, we won’t) the Earth the way it was when most of us were born; I think we can still make a difference–a positive difference–in the lives of the people with whom we work. That is an evaluative opportunity.
I was talking with a colleague about evaluation capacity building (see last week’s post) and the question was raised about thinking like an evaluator. Got me thinking about the socialization of professions and what has to happen to build a critical mass of like minded people.
Certainly, preparatory programs in academia conducted by experts, people who have worked in the field a long time–or at least longer than you starts the process. Professional development helps–you know, attending meetings where evaluators meet (like the upcoming AEA conference, U. S. regional affiliates [there are many and they have conferences and meetings, too], and international organizations [increasing in number--which also host conferences and professional development sessions]–let me know if you want to know more about these opportunities). Reading new and timely literature on evaluation provides insights into the language. AND looking at the evaluative questions in everyday activities. Questions such as: What criteria? What standards? Which values? What worth? Which decisions?
The socialization of evaluators happens because people who are interested in being evaluators look for the evaluation questions in everything they do. Sometimes, looking for the evaluative question is easy and second nature–like choosing a can of corn at the grocery store; sometimes it is hard and demands collaboration–like deciding on the effectiveness of an educational program.
My recommendation is start with easy things–corn, chocolate chip cookies, wine, tomatoes; move to harder things with more variables–what to wear when and where, or whether to include one group or another . The choices you make will all depend upon what criteria is set, what standards have been agreed upon, and what value you place on the outcome or what decision you make.
The socialization process is like a puzzle, something that takes a while to complete, something that is different for everyone, yet ultimately the same. The socialization is not unlike evaluation…pieces fitting together–criteria, standards, values, decisions. Asking the evaluative questions is an ongoing fluid process…it will become second nature with practice.
Hopefully, the technical difficulties with images is no longer a problem and I will be able to post the answers to the history quiz and the post I had hoped to post last week. So, as promised, here are the answers to the quiz I posted the week of July 5. The keyed responses are in BOLD
7. James W. Altschuldt is the go-to person for needs assessment. He is the editor of the Needs Assessment Kit (or everything you wanted to know about needs assessment and didn’t know where to find the answer). He is also the co-author with Bell Ruth Witkin of two needs assessment books, and .
11. Ellen Taylor-Powell, the former Evaluation Specialist at University of Wisconsin Extension Service and is credited with developing the logic model later adopted by the USDA for use by the Extension Service. To go to the UWEX site, click on the words “logic model”.
15. Thomas A. Schwandt, a philosopher at heart who started as an auditor, has written extensively on evaluation ethics. He is also the co-author (with Edward S. Halpern) of Linking Auditing and Metaevaluation.
19. William R. Shadish co-edited (with Laura C. Leviton and Thomas Cook) of Foundations of Program Evaluation: Theories of Practice . His work in theories of evaluation practice earned him the Paul F. Lazarsfeld Award for Evaluation Theory, from the American Evaluation Association in 1994.
Although I’ve only list 20 leaders, movers and shakers, in the evaluation field, there are others who also deserve mention: John Owen, Deb Rog, Mark Lipsey, Mel Mark, Jonathan Morell, Midge Smith, Lois-Ellin Datta, Patricia Rogers, Sue Funnell, Jean King, Laurie Stevahn, John, McLaughlin, Michale Morris, Nick Smith, Don Dillman, Karen Kirkhart, among others.
If you want to meet the movers and shakers, I suggest you attend the American Evaluation Association annual meeting. In 2011, it will be held in Anaheim CA, November 2 – 5; professional development sessions are being offered October 31, November 1 and 2, and also November 6. More conference information can be found here.
We recently held Professional Development Days for the Division of Outreach and Engagement. This is an annual opportunity for faculty and staff in the Division to build capacity in a variety of topics. The question this training posed was evaluative:
How do we provide meaningful feedback?
Evaluating a conference or a multi-day, multi-session training is no easy task. Gathering meaningful data is a challenge. What can you do? Before you hold the conference (I’m using the word conference to mean any multi-day, multi-session training), decide on the following:
The answer to the first question is easy: YES. If the conference is an annual event (or a regular event), you will want to have participants’ feedback of their experience, so, yes, you will evaluate the conference. Look at a Penn State Tip Sheet 16 for some suggestions. (If this is a one time event, you may not; though as an evaluator, I wouldn’t recommend ignoring evaluation.)
The second question is more critical. I’ve mentioned in previous blogs the need to prioritize your evaluation. Evaluating a conference can be all consuming and result in useless data UNLESS the evaluation is FOCUSED. Sit down with the planners and ask them what they expect to happen as a result of the conference. Ask them if there is one particular aspect of the conference that is new this year. Ask them if feedback in previous years has given them any ideas about what is important to evaluate this year.
This year, the planners wanted to provide specific feedback to the instructors. The instructors had asked for feedback in previous years. This is problematic if planning evaluative activities for individual sessions is not done before the conference. Nancy Ellen Kiernan, a colleague at Penn State, suggests a qualitative approach called a Listening Post. This approach will elicit feedback from participants at the time of the conference. This method involves volunteers who attended the sessions and may take more persons than a survey. To use the Listening Post, you must plan ahead of time to gather these data. Otherwise, you will need to do a survey after the conference is over and this raises other problems.
The third question is also very important. If the results are just given to the supervisor, the likelihood of them being used by individuals for session improvement or by organizers for overall change is slim. Making the data usable for instructors means summarizing the data in a meaningful way, often visually. There are several way to visually present survey data including graphs, tables, or charts. More on that another time. Words often get lost, especially if words dominate the report.
There is a lot of information in the training and development literature that might also be helpful. Kirkpatrick has done a lot of work in this area. I’ve mentioned their work in previous blogs.
There is no one best way to gather feedback from conference participants. My advice: KISS–keep it simple and straightforward.