Data Sharing

Last semester I took a research ethics class.  I wrote a paper on preregistration and data sharing in academic research. I decided to modify the paper into two blog posts. Here is the first part on data sharing.

Statistics is the study of uncertainty.   Any research study not involving the entire population of group will not be able to provide a definite conclusion with 100% certainty.   Conclusions can be made with a high degree of certainty (95-99%) but false positives and false negatives are inevitable in any large statistical analysis.  This means that studies can fail to make the right call, and after multiple replications the original conclusion may be overturned.

One way to improve the statistical integrity of research is to have a database of the data from non-published studies.  Ideally, this database would be accessible to all academic researchers.   A research would then be able to see the data from other similar studies.   The research would then be able to compare his data with the data from the other studies.  At a significance level of .05,  approximately 1 in 20 studies that were statistically significant were a false positive.    This number applies to theoretically perfect studies that meet all the statistically assumptions used.   Any modelling error increases that rate.  With each external replication of a study the probability of a false positive or a false negative greatly decreases.   Grants from the National Science Foundation1, and the National Institute of Health2 currently require that data from the funded studies be made available to the public after the study was completed.  But not all grants and funding sources require this disclosure.    Without an universal requirement for data disclosure, it can be difficult to confirm that the study and the results are legitimate.

Advocates of open data say that data sharing saves time and reduces false positives and false negatives.  A research can look at previously conducted studies and try to replicate the results.   The results of the data can then be recalculated by another research to confirm accuracy.   In a large study with lots of data it is very easy to make a few mistakes.  These mistakes could cause the results to be misinterpreted.   Open data can even help discover fraudulent studies.  There are methods to estimate the probability the data is fraudulent by looking at the relative frequency of the digits.   The distributions of the digits should be pretty uniform and in one case the data didn’t look quite right.  In 2009, Strategic Vision (a polling company) came under fire from potentially falsifying polls, after a Five Thirty Eight analysis3  discovered that something didn’t look quite right.  This isn’t an academic example, but open access data could prevent fraudulent studies from being accepted as fact as in the infamous vaccines cause autism study.  The statistical analysis of the randomness isn’t definite, but they can raise questions that prompt further investigations of the data.   Open data makes replication easier. False positives and false negatives can cause harm in some cases.  Easier replication can help confirm findings quicker.

 

Works Cited

[1] Public Access To the Results of NSF-Funded Research. (n.d.). Retrieved April 28, 2017, from https://www.nsf.gov/news/special_reports/public_access/

[2] NIH’s Commitment to Public Accountability. (n.d.). Retrieved April 28, 2017, from https://grants.nih.gov/grants/public_accountability/

 

[3] Silver, N. (2014, May 07). Strategic Vision Polls Exhibit Unusual Patterns, Possibly Indicating Fraud. Retrieved April 28, 2017, from https://fivethirtyeight.com/features/strategic-vision-polls-exhibit-unusual/