Ever wonder where the 0.05 probability level number was derived? Ever wonder if that is the best number? How many of you were taught in your introduction to statistics course that 0.05 is the probability level necessary for rejecting the null hypothesis of no difference? This confidence may be spurious. As Paul Bakker indicates in the AEA 365 blog post for March 28, “Before you analyze your data, discuss with your clients and the relevant decision makers the level of confidence they need to make a decision.” Do they really need to be 95% confident? Or would 90% confidence be sufficient? What about 75% or even 55%?
Think about it for a minute? If you were a brain surgeon, you wouldn’t want anything less than 99.99% confidence; if you were looking at level of risk for a stock market investment, 55% would probably make you a lot of money. The academic community has held to and used the probability level of 0.05 for years (the computation of the p value dating back to 1770). (Quoting Wikipedia, ” In the 1770s Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. He concluded by calculation of a p-value that the excess was a real, but unexplained, effect.”) Fisher first proposed the 0.05 level in 1025 and established a one in 20 limit for statistical significance when considering a two tailed test. Sometimes the academic community makes the probability level even more restrictive by using 0.01 or 0.001 to demonstrate that the findings are significant. Scientific journals expect 95% confidence or a probability level of at least 0.05.
Although I have held to these levels, especially when I publish a manuscript, I have often wondered if this level makes sense. If I am only curious about a difference, do I need 0.05? Oor could I use 0.10 or 0.15 or even 0.20? I have often asked students if they are conducting confirmatory or exploratory research? I think confirmatory research expects a more stringent probability level. I think exploratory research requires a less stringent probability level. The 0.05 seems so arbitrary.
Then there is the grounded theory approach which doesn’t use a probability level. It generates theory from categories which are generated from concepts which are identified from data, usually qualitative in nature. It uses language like fit, relevance, workability, and modifiability. It does not report statistically significant probabilities as it doesn’t use inferential statistics. Instead, it uses a series of probability statements about the relationships between concepts.
So what do we do? What do you do? Let me know.
Statistics are not the dragon you think it is.
For many people, the field of statistics is a dragon in disguise and like dragons, most people shy away from statistics.
I have found that Neil Salkind’s book “Statistics for People Who (Think They) Hate Statistics” a good reference for understanding the basics of statistics. The 4th edition is due out in September 2010. This book isn’t intimidating; it is easy to understand; it isn’t heavy on the math or formulas; it has a lot of tips. I’m using it for this column. I keep it on my desk along with Dillman.
Faculty who come to me with questions about analyzing their data typically want to know how to determine statistical significance. But before I can talk to faculty about statistical significance, there are a few questions that need to be answered.
Most people immediately jump to what test to use. Don’t go there. Start with what measurement scale do you have. Then answer the other questions.
So let’s talk about scales of measurement. All data are not created equally. Some data are easier to analyze than other data. Scale of measurement makes that difference.
There are four scales of measurement and most data fall into one of these four. They are either categorical (even if they have been converted to numbers) or numerical (originally numbers). They are:
Scales of measurement are rules determining the particular levels at which outcomes are measured. When you decide on an answer to a question, you are deciding on the scale of measurement, you are agreeing to the particular set of characteristics for that measurement.
Nominal scales name something. For example–gender is either male, female, or unknown/not stated; ethnicity is one of several names of groups. When you gather demographic data, such as gender, ethnicity, or race, you are employing a nominal scale. The data that result from nominal scales are categorical data–that is data resulting from categories which are mutually exclusive from each other. The respondent is either male or female, not both.
Ordinal scale orders something; it puts the thing being measured in order–high to low; low to high. Salkind gives the example o
f ranking candidates for a job. Extension professionals (and many/most survey professionals) use ordinal scales in surveys (strongly agree to strongly disagree; don’t like to like a lot). We do not know how much difference is between don’t like and likes a lot. The data that result from ordinal scales are categorical data.
Interval scale is based on a continuum of equally spaced intervals along that continuum. Think of a thermometer; test score; weight. We know that the intervals along the scale are equal to one another. The data that result from interval scales are numerical data.
Ratio scale is a scale with absolute zero or a situation where the characteristic of interest is absent–like zero light or no molecular movement. This rarely happens social or behavioral science, the work that most Extension Professionals do. The data that result from ratio data are numerical data.
Why do we care?
Using an inappropriate scale will invalidate your data and provide you with spurious outcomes which yield spurious impacts.
Statistically significant is a term that is often bandied about. What does it really mean? Why is it important?
First–why is it important?
It is important because it helps the evaluator make decisions based on the data gathered.
That makes sense–evaluators have to make decisions so that the findings can be used. If there isn’t some way to set the findings apart from the vast morass of information, then it is only background noise. So those of us who do analysis have learned to look at the probability level (written as a “p” value such as p=0.05). The “p” value helps us determine if something is true, not necessarily that something is important.
Second–what does that number really mean?
Probability level means–can this (fill in the blank here) happen by chance? If it can occur by chance, say 95 times out of 100, then it is probably not true. When evaluators look at probability levels, we want really small numbers. Small numbers say that the likelihood that this change occurred by chance (or is untrue) is really unlikely. So even though a really small number occurs (like 0.05) it really means that there is a 95% chance that this change did not occur by chance and that it is really true. You can convert a p value by subtracting it from 100 (100-5=95; the likelihood that this did not occur by chance)
Convention has it that for something to be statistically significant, the value must be at least 0.05. This convention comes from academic research. Smaller numbers aren’t necessarily better; they just confirm that the likelihood that true change occurs more often. There are software programs (Statxact for example) that can compute the exact probability; so seeing numbers like 0.047 would occur.
Exploratory research (as opposed to confirmatory) may have a higher p value such as p=0.10.This means that the trend is moving in the desired direction. Some evaluators let the key stakeholders determine if the probability level (p value) is at a level that indicates importance, for example, 0.062. Some would argue that 94 time out of 100 is not that much different from 95 time out of 100 of being true.
.