Ever wonder where the 0.05 probability level number was derived? Ever wonder if that is the best number? How many of you were taught in your introduction to statistics course that 0.05 is the probability level necessary for rejecting the null hypothesis of no difference? This confidence may be spurious. As Paul Bakker indicates in the AEA 365 blog post for March 28, “Before you analyze your data, discuss with your clients and the relevant decision makers the level of confidence they need to make a decision.” Do they really need to be 95% confident? Or would 90% confidence be sufficient? What about 75% or even 55%?
Think about it for a minute? If you were a brain surgeon, you wouldn’t want anything less than 99.99% confidence; if you were looking at level of risk for a stock market investment, 55% would probably make you a lot of money. The academic community has held to and used the probability level of 0.05 for years (the computation of the p value dating back to 1770). (Quoting Wikipedia, ” In the 1770s Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.”) Fisher first proposed the 0.05 level in 1025 and established a one in 20 limit for statistical significance when considering a two tailed test. Sometimes the academic community makes the probability level even more restrictive by using 0.01 or 0.001 to demonstrate that the findings are significant. Scientific journals expect 95% confidence or a probability level of at least 0.05.
Although I have held to these levels, especially when I publish a manuscript, I have often wondered if this level makes sense. If I am only curious about a difference, do I need 0.05? Oor could I use 0.10 or 0.15 or even 0.20? I have often asked students if they are conducting confirmatory or exploratory research? I think confirmatory research expects a more stringent probability level. I think exploratory research requires a less stringent probability level. The 0.05 seems so arbitrary.
Then there is the grounded theory approach which doesn’t use a probability level. It generates theory from categories which are generated from concepts which are identified from data, usually qualitative in nature. It uses language like fit, relevance, workability, and modifiability. It does not report statistically significant probabilities as it doesn’t use inferential statistics. Instead, it uses a series of probability statements about the relationships between concepts.
So what do we do? What do you do? Let me know.
Today’s post is longer than I usually post. I think it is important because it captures an aspect of data analysis and evaluation use that many of us skip right over: How to present findings using the tools that are available. Let me know if this works for you.
Ann Emery blogs at Emery Evaluation. She challenged readers a couple of weeks ago to reproduce a bubble chart in either Excel or R. This week she posted the answer. She has given me permission to share that information with you. You can look at the complete post at Dataviz Copycat Challenge: The Answers.
I’ve also copied it here in a shortened format:
“Here’s my how-to guide. At the bottom of this blog post, you can download an Excel file that contains each of the submissions. We each used a slightly different approach, so I encourage you to study the file and see how we manipulated Excel in different ways.
Here’s that chart from page 7 of the State of Evaluation 2012 report. We want to see whether we can re-create the chart in the lower right corner. The visualization uses circles, which means we’re going to create a bubble chart in Excel.
To fool Excel into making circles, we need to create a bubble chart in Excel. Click here for a Microsoft Office tutorial. According to the tutorial, “A bubble chart is a variation of a scatter chart in which the data points are replaced with bubbles. A bubble chart can be used instead of a scatter chart if your data has three data series.”
We’re not creating a true scatter plot or bubble chart because we’re not showing correlations between any variables. Instead, we’re just using the foundation of the bubble chart design – the circles. But, we still need to envision our chart on an x-y axis in order to make the circles.
It helps to sketch this part by hand. I printed page 7 of the report and drew my x and y axes right on top of the chart. For example, 79% of large nonprofit organizations reported that they compile statistics. This bubble would get an x-value of 3 and a y-value of 5.
I didn’t use sequential numbering on my axes. In other words, you’ll notice that my y-axis has values of 1, 3, and 5 instead of 1, 2, and 3. I learned that the formatting seemed to look better when I had a little more space between my bubbles.
Open a new Excel file and start typing in your values. For example, we know that 79% of large nonprofit organizations reported that they compile statistics. This bubble has an x-value of 3, a y-value of 5, and a bubble size of 79%.
Go slowly. Check your work. If you make a typo in this step, your chart will get all wonky.
Highlight the three columns on the right – the x column, the y column, and the frequency column. Don’t highlight the headers themselves (x, y, and bubble size). Click on the “Insert” tab at the top of the screen. Click on “Other Charts” and select a “Bubble Chart.”
First, add the basic data labels. Right-click on one of the bubbles. A drop-down menu will appear. Select “Add Data Labels.” You’ll get something that looks like this:
Second, adjust the data labels. Right-click on one of the data labels (not on the bubble). A drop-down menu will appear. Select “Format Data Labels.” A pop-up screen will appear. You need to adjust two things. Under “Label Contains,” select “Bubble Size.” (The default setting on my computer is “Y Value.”) Next, under “Label Position,” select “Center.” (The default setting on my computer is “Right.)
Your basic bubble chart is finished! Now, you just need to fiddle with the formatting. This is easier said than done, and probably takes the longest out of all the steps.
Here’s how I formatted my bubble chart:
For more details about formatting charts, check out these tutorials.
Click here to download the Excel file that I used to create this bubble chart. Please explore the chart by right-clicking to see how the various components were made. You’ll notice a lot of text boxes on top of each other!”
Just spent the last 40 minutes reading comments that people have made to my posts. Some were interesting; some were advertising (aka marketing) their own sites; one suggested I might revisit the “about” feature of my blog and express why I blog (other than it is part of my work). So I revisited my “about” page, took out conversation, and talked about the reality as I’ve experienced it for the last three plus years. So check out the about page–I also updated info about me and my family. The comment about updating my “about” page was a good one. It is an evaluative activity; one that was staring me in the face and I hadn’t realized it. I probably need to update my photo as well…next time…:)
At the end of January, participants in an evaluation capacity building program I lead will provide highlights of the evaluations they completed for this program. That the event happens to be in Tucson and I happen to be able to get out of the wet and dreary northwest is no accident. The event will capstone WECT (Western [Region] Evaluation Capacity Training–Say ‘west’) participants evaluations of the past 17 months. Since each participant will be presenting their programs and the evaluations they did of those programs. There will be a lot of data (hopefully). The participants and those data could use (or not) a new and innovative take on data visualization. Susan Kistler, AEA’s Executive Director, has blogged in AEA365 several times about data visualization. Perhaps these reposts will help.
Susan Kistler says • “Colleagues, I wanted to return to this ongoing discussion. At this year’s conference (Evaluation ’12), I did a presentation on 25 low-cost/no-cost tech tools for data visualization and reporting. An outline of the tools covered and the slides may be accessed via the related aea365 post here http://aea365.org/blog/?p=7491. If you download the slides, each tool includes a link to access it, cost information, and in most cases supplementary notes and examples as needed.
A couple of the new ones that were favorites included wallwisher and poll everywhere. I also have on my to do list to explore both datawrapper and amCharts over the holidays.
But…am returning to you all to ask if there is anything out there that just makes you do your happy dance in terms of new low-cost, no-cost tools for data visualization and/or reporting. (This is a genuine request–if there is something out there, let Susan know. You can comment on the blog, contact her through AEA (email@example.com), or let me know, I’ll forward it.
Susan also says in Saturday’s (December 15 , 2012) blog (and this would be very timely for WECT participants):
– Enroll in the Free Knight Center’s Introduction to Infographics and Data Visualization: The course is online, and free, and will be offered between January 12 and February 23. According to the course information, we’ll learn the basics of:
“How to analyze and critique infographics and visualizations in newspapers, books, TV, etc., and how to propose alternatives that would improve them.
How to plan for data-based storytelling through charts, maps, and diagrams.
How to design infographics and visualizations that are not just attractive but, above all, informative, deep, and accurate.
The rules of graphic design and of interaction design, applied to infographics and visualizations.
Optional: How to use Adobe Illustrator to create infographics.”
The topic of survey development seems to be popping up everywhere–AEA365, Kirkpatrick Partners, eXtension Evaluation Community of Practice, among others. Because survey development is so important to Extension faculty, I’m providing links and summaries.
“… it is critical that you pre-test it with a small sample first.” Real time testing helps eliminate confusion, improve clarity, and assures that you are asking a question that will give you an answer to what you want to know. This is so important today when many surveys are electronic.
It is also important to “Train your data collection staff…Data collection staff are the front line in the research process.” Since they are the people who will be collecting the data, they need to understand the protocols, the rationales, and the purposes of the survey.
Kirkpatrick Partners say:
“Survey questions are frequently impossible to answer accurately because they actually ask more than one question. “ This is the biggest problem in constructing survey questions. They provide some examples of asking more than one question.
Michael W. Duttweiler, Assistant Director for Program Development and Accountability at Cornell Cooperative Extension stresses the four phases of survey construction:
He then indicates that the next three blog posts will cover point 2, 3, and 4.
Probably my favorite post on survey recently was one that Jane Davidson did back in August, 2012 in talking about survey response scales. Her “boxers or briefs” example captures so many issues related to survey development.
Writing survey questions which give you useable data that answers your questions about your program is a challenge; it is not impossible. Dillman writes the book about surveys; it should be on your desk.
Here is the Dillman citation:
Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009). Internet, mail, and mixed-mode surveys: The tailored design method. Hoboken, NJ: John Wiley & Sons, Inc.
What is the difference between need to know and nice to know? How does this affect evaluation? I got a post this week on a blog I follow (Kirkpatrick) that talks about how much data does a trainer really need? (Remember that Don Kirkpatrick developed and established an evaluation model for professional training back in the 1954 that still holds today.)
Most Extension faculty don’t do training programs per se, although there are training elements in Extension programs. Extension faculty are typically looking for program impacts in their program evaluations. Program improvement evaluations, although necessary, are not sufficient. Yes, they provide important information to the program planner; they don’t necessarily give you information about how effective your program has been (i.e., outcome information). (You will note that I will use the term “impacts” interchangeably with “outcomes” because most Extension faculty parrot the language of reporting impacts.)
OK. So how much data do you really need? How do you determine what is nice to have and what is necessary (need) to have? How do you know?
Kirkpatrick also advises to avoid redundant questions. That means questions asked in a number of ways and giving you the same answer; questions written in positive and negative forms. The other question that I always include because it will give me a way to determine how my program is making a difference is a question on intention including a time frame. For example, “In the next six months do you intend to try any of the skills you learned to day? If so, which one.” Mazmaniam has identified the best predictor of behavior change (a measure of making a difference) is stated intention to change. Telling someone else makes the participant accountable. That seems to make the difference.
Mazmanian, P. E., Daffron, S. R., Johnson, R. E., Davis, D. A., & Kantrowits, M. P. (1998). Information about barriers to planned change: A Randomized controlled trail involving continuing medical education lectures and commitment to change. Academic Medicine, 73(8).
P.S. No blog next week; away on business.
Quantitative data analysis is typically what happens to data that are numbers (although qualitative data can be reduced to numbers, I’m talking here about data that starts as numbers.) Recently, a library colleague sent me an article that was relevant to what evaluators often do–analyze numbers.
So why, you ask, am I talking about an article that is directed to librarians? Although that article is is directed at librarians, it has relevance to Extension. Extension faculty (like librarians), more often than not, use surveys to determine the effectiveness of their programs. Extension faculty are always looking to present the most powerful survey conclusions (yes, I lifted from the article title), and no you don’t need to have a doctorate in statistics to understand these analyses. The other good thing about this article is that it provides you with a link to an online survey-specific software: (Raosoft’s calculator at http://www.raosoft.com/samplesize.html).
This article refers specifically to three metrics that are often overlooked by Extension faculty: margin of error (MoE), confidence level (CL), and cross-tabulation analysis. These are three statistics which will help you in your work. The article also does a nice job of listing the eight recommended best practices which I’ve appended here with only some of the explanatory text.
1. Inferential statistical tests. To be more certain of the conclusions drawn from survey data, use inferential statistical tests.
2. Confidence Level (CL). Choose your desired confidence level (typically 90%, 95%, or 99%) based upon the purpose of your survey and how confident you need to be of the results. Once chosen, don’t change it unless the purpose of your survey changes. Because the chosen confidence level is part of the formula that determines the margin of error, it’s also important to document the CL in your report or article where you document the margin of error (MoE).
3. Estimate your ideal sample size before you survey. Before you conduct your survey use a sample size calculator specifically designed for surveys to determine how many responses you will need to meet your desired confidence level with your hypothetical (ideal) margin of error (usually 5%).
4. Determine your actual margin of error after you survey. Use a margin of error calculator specifically designed for surveys (you can use the same Raosoft online calculator recommended above).
5. Use your real margin of error to validate your survey conclusions for your larger population.
6. Apply the chi-square test to your crosstab tables to see if there are relationships among the variables that are not likely to have occurred by chance.
7. Reading and reporting chi-square tests of cross-tab tables.
8. Document any known sources of bias or error in your sampling methodology and in your survey design in your report, including but not limited to how your survey sample was obtained.
Bottom line: read the article.
Hightower, C. & Kelly, S. (2012, Spring). Infer more, describe less: More powerful survey conclusions through easy inferential tests. Issues in Science and Technology Librarianship. DOI:10.5062/F45H7D64. [Online]. Available at: http://www.istl.org/12-spring/article1.html
I want to focus on the “wisdom is a prerequisite of good judgement” and talk about how that relates to evaluation. I also liked the list of “everybody who’s anybody.” (Although I don’t know who Matt means by Hobbs–is that Hobbes or the English philosopher for whom the well known previous figure was named, Thomas Hobbes , or someone else that I couldn’t find and don’t know?) But I digress…
“Wisdom is a prerequisite for good judgement.” Judgement is used daily by evaluators. It results in the determination of value, merit, and/or worth of something. Evaluators make a judgement of value, merit, and/or worth. We come to these judgements through experience. Experience with people, activities, programs, contributions, LIFE. Everything we do provides us with experience; it is what we do with that experience that results in wisdom and, therefore, leads to good judgements.
Experience is a hard teacher; demanding, exacting, and often obtuse. My 19 y/o daughter is going to summer school at OSU. She got approval to take two courses and for those courses to transfer to her academic record at her college. She was excited about the subject; got the book; read ahead; and looked forward to class, which started yesterday. After class, I had never seen a more disappointed individual. She found the material uninteresting (it was mostly review because she had read ahead), she found the instructor uninspiring (possibly due to class size of 35). To me, it was obvious that she needed to re-frame this experience into something positive; she needed to find something she could learn from this experience that would lead to wisdom. I suggested that she think of this experience as a cross cultural exchange; challenging because of cultural differences. In truth, a large state college is very different from a small liberal arts college; truly a different culture. She has four weeks to pull some wisdom from this experience; four weeks to learn how to make a judgement that is beneficial. I am curious to see what happens.
Not all evaluations result in beneficial judgements; often, the answer, the judgement, is NOT what the stakeholders want to hear. When that is the case, one needs to re-frame the experience so that learning occurs (both for the individual evaluator as well as the stakeholders) so that the next time the learning, the hard won wisdom, will lead to “good” judgement, even if the answer is not what the stakeholders want to hear. Matt started his discussion with the saying that “wisdom, rooted in knowledge of self, is a prerequisite for good judgement”. Knowing your self is no easy task; you can only control what you say, what you do, and how your react (a form of doing/action). The study of those things is a life long adventure, especially when you consider how hard it is to change yourself. Just having knowledge isn’t enough for a good judgement; the evaluator needs to integrate that knowledge into the self and own it; then the result will be “good judgements”; the result will be wisdom.
I started this post back in April. I had an idea that needed to be remembered…it had to do with the unit of analysis; a question which often occurs in evaluation. To increase sample size and, therefore, power, evaluators often choose run analyses on the larger number when the aggregate, i.e., smaller number is probably the “true” unit of analysis. Let me give you an example.
A program is randomly assigned to fifth grade classrooms in three different schools. School A has three classrooms; school B has two classrooms; and school C has one classroom. All together, there are approximately 180 students, six classrooms, three schools. What is the appropriate unit of analysis? Many people use students, because of the sample size issue. Some people will use classroom because each got a different treatment. Occasionally, some evaluators will use schools because that is the unit of randomization. This issue elicits much discussion. Some folks say that because students are in the school, they are really the unit of analysis because they are imbedded in the randomization unit. Some folks say that students is the best unit of analysis because there are more of them. That certainly is the convention. What you need to decide is what is the unit and be able to defend that choice. Even though I would loose power, I think I would go with the the unit of randomization. Which leads me to my next point–truth.
At the end of the first paragraph, I use the words “true” in quotation marks. The Kirkpatricks in their most recent blog opened with a quote from the US CIA headquarters in Langley Virginia, “”And ye shall know the truth, and the truth shall make you free”. (We wont’ talk about the fiction in the official discourse, today…) (Don Kirkpatrick developed the four levels of evaluation specifically in the training and development field.) Jim Kirkpatrick, Don’s son, posits that, “Applied to training evaluation, this statement means that the focus should be on discovering and uncovering the truth along the four levels path.” I will argue that the truth is how you (the principle investigator, program director, etc.) see the answer to the question. Is that truth with an upper case “T” or is that truth with a lower case “t”? What do you want it to mean?
Like history (history is what is written, usually by the winners, not what happened), truth becomes what do you want the answer to mean. Jim Kirkpatrick offers an addendum (also from the CIA), that of “actionable intelligence”. He goes on to say that, “Asking the right questions will provide data that gives (sic) us information we need (intelligent) upon which we can make good decisions (actionable).” I agree that asking the right question is important–probably the foundation on which an evaluation is based. Making “good decisions” is in the eyes of the beholder–what do you want it to mean.
I had a conversation (ok–an electronic conversation) with colleagues a few weeks ago. The conversation was about the use of inappropriate analyses for manuscripts being submitted. The specific question raised was about the t-test and went something like this:
Should a t-test be used on a sample that has not been RANDOMLY drawn from a population. If the sample is not randomly drawn, or if the entire population is used (as is often the case with Extension evaluations), then a t-test is NOT appropriate.
What exactly is the appropriate test? First, one has to identify if the underlying assumptions have been met; if not, a parametric, i.e., t-test, is not appropriate.
So what exactly are the assumptions underlying the use of a t-test?
Glass and Stanley (1970, pg. 297) [a classic statistics text in the field of education and psychology and there may be a more recent edition than the one I have on my shelf] say that for dependent samples (the kind most often used by Extension professionals ), the sample:
Some would add that the scale of measurement needs to result in interval or ratio data (not nominal–ordinal is a questionable case) and have a sample size of 30 or over.
This presents a quandary, to say the least. Extension professionals know that journals want a probability value (if the study is quantitative) and how do you get a probability value with out a t-test?
Answer: Use a nonparametric equivalent test.
The nonparametric equivalent for a test of dependent means is a Wilcoxon matched pair test.
Marascuilo and McSweeney (1977, pg 5) say that that the researcher needs to select “…a test for which the power of rejection is maximized when the hypothesis is tested false.” They go on to say, “If the data adhere to the assumptions required for a classical, normally based t or F test, a researcher would be foolish…not (to) use them, since they are optimum, when justified.” Justification is the key here. And the justification is: Do the data meet the assumptions for the test?
Their bottom line is that a researcher should never think that a nonparametric test is exclusively a substitute for a parametric test. It is not. Use the right test for the hypothesis being tested. It may be (probably is) a nonparametric test.
Glass, G. V., & Stanley, J. C. (1970). Statistical Methods in Education and Psychology. Englewood Cliffs, NJ: Prentice-Hall, Inc.
Marascuilo, L. A. & McSweeney, M. (1977). Nonparametric and Distribution-free Methods for the Social Sciences. Monterey, CA: Brooks/Cole Publishing.