One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. However, applying significance thresholds makes cumulative knowledge unreliable. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Also significance ( p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |