Data Sharing

Last semester I took a research ethics class.  I wrote a paper on preregistration and data sharing in academic research. I decided to modify the paper into two blog posts. Here is the first part on data sharing.

Statistics is the study of uncertainty.   Any research study not involving the entire population of group will not be able to provide a definite conclusion with 100% certainty.   Conclusions can be made with a high degree of certainty (95-99%) but false positives and false negatives are inevitable in any large statistical analysis.  This means that studies can fail to make the right call, and after multiple replications the original conclusion may be overturned.

One way to improve the statistical integrity of research is to have a database of the data from non-published studies.  Ideally, this database would be accessible to all academic researchers.   A research would then be able to see the data from other similar studies.   The research would then be able to compare his data with the data from the other studies.  At a significance level of .05,  approximately 1 in 20 studies that were statistically significant were a false positive.    This number applies to theoretically perfect studies that meet all the statistically assumptions used.   Any modelling error increases that rate.  With each external replication of a study the probability of a false positive or a false negative greatly decreases.   Grants from the National Science Foundation1, and the National Institute of Health2 currently require that data from the funded studies be made available to the public after the study was completed.  But not all grants and funding sources require this disclosure.    Without an universal requirement for data disclosure, it can be difficult to confirm that the study and the results are legitimate.

Advocates of open data say that data sharing saves time and reduces false positives and false negatives.  A research can look at previously conducted studies and try to replicate the results.   The results of the data can then be recalculated by another research to confirm accuracy.   In a large study with lots of data it is very easy to make a few mistakes.  These mistakes could cause the results to be misinterpreted.   Open data can even help discover fraudulent studies.  There are methods to estimate the probability the data is fraudulent by looking at the relative frequency of the digits.   The distributions of the digits should be pretty uniform and in one case the data didn’t look quite right.  In 2009, Strategic Vision (a polling company) came under fire from potentially falsifying polls, after a Five Thirty Eight analysis3  discovered that something didn’t look quite right.  This isn’t an academic example, but open access data could prevent fraudulent studies from being accepted as fact as in the infamous vaccines cause autism study.  The statistical analysis of the randomness isn’t definite, but they can raise questions that prompt further investigations of the data.   Open data makes replication easier. False positives and false negatives can cause harm in some cases.  Easier replication can help confirm findings quicker.


Works Cited

[1] Public Access To the Results of NSF-Funded Research. (n.d.). Retrieved April 28, 2017, from

[2] NIH’s Commitment to Public Accountability. (n.d.). Retrieved April 28, 2017, from


[3] Silver, N. (2014, May 07). Strategic Vision Polls Exhibit Unusual Patterns, Possibly Indicating Fraud. Retrieved April 28, 2017, from

We Don’t Live in Statsland

Statsland is a magical world that exists only in (certain) Statistics textbooks. In Statsland,  statistics is easy.  We can invoke Central Limit theorem and use the normal distribution when n is larger than 30.   In Statsland we either know or can easily determine the correct distribution.  In Statsland 95% confidence intervals have a 95% chance of containing the real value.  But we don’t live in Statsland.

The point of doing statistics is that it would be too difficult (or impossible) to find the true value of a population.  You aren’t likely to find  the exact value, but you can be pretty close.   In a statistics textbook problem, you probably have enough information to do a good job of estimating the desired value. But in applied statistics you may not have as much information.  If you know the mean and standard deviation of a population you do not need to do much (if any) statistics.  Any time you have to estimate or substitute information, your model will not perform as well as a theoretically perfect model.

Statistics never was and never will be an exact science.   In most cases, your model will be wrong.  There are no perfect answers.  Your confidence intervals will rarely perform as they theoretically should.  The requisite sample size to invoke Central Limit Theorem is not clear cut.  Your approach should vary on the individual problem.   There is no universal formula to examine data.   Applied Statistics should be flexible and instead of rigid.   The world is not a statistics textbook problem, and should never be treated as such.


Coincidences: A Lesson in Expected Value

As I followed the election I noticed the frequent mentions counties (or cities) that have been known “predict” the presidential election winner. The idea is that a the winner of a certain county has matched the winner of the election for multiple elections. Let’s look at county A for an example. To simplify things lets assume the odds of predicting a winner in a presidential election are 50-50. This would mean that the probability of getting 8 elections right would be 1 in 256. This means that it is unlikely that county A would predict the election by chance. But what about the rest of the counties in America? There are over 3,000 counties in America (according to an economist article found here:, so we can expect about 12 of these counties would have “predicted” the winner of the presidential election.

Rare events happen all the time. Rare are rare, but rare is not impossible. Let’s say that there is a (hypothetical) free sweepstakes with a 1 in 100 chance of winning $100. You may not think that you wouldn’t know anyone who won, but if a sweepstakes like this exists you might be surprised about the likely outcome. It may not be likely that you specifically win, but if all your Facebook friends enter the contest someone you know is probably going to win. If you have at least 99 Facebook friends it is likely that you or someone you know will win the sweepstakes. You may think its a coincidence or luck, but it is really math. Expected value can’t tell you who is going to win, but it can tell you someone you know is likely to win. Now expected value is not a magic bullet. You may have 0 friends win or 2 friends win, but the most likely event is that someone will win. Unfortunately (legit) sweepstakes like this don’t exist, but it is a good example of how your perception of probability may not match reality. Another example is it probably going to rain 1 in 10 days where the probability of rain is 10%, but it is easy to pretend like it never rains when the probability of rain is 10%.

You may wonder why expected value matters. But it’s actually quite important when looking at everyday events. Sometimes it is easy to underestimate the chance that something odd or rare would happen. You may think it’s odd that runs when the meteorologist says the chance of that happening is 10%. Or that it only takes 23 people to have a 50% chance of there being, two people with the same birthday (details here). It is easy to forget that once in a lifetime event do happen once in a lifetime. How you think about probability is important. So before you yell at the TV meteorologist that said there was a 10% chance of rain but it rained, try to remember that unlikely does not equal impossible.

Models May Fail but Statistics Matters Anyway

The 2016 presidential election brought attention to the limitations of Statistics.  Most models predicted a Clinton win but Trump will most likely be the president (the results are currently unofficial and recounts are in progress but most experts believe that Trump will be officially elected president). However all models are not 100% certain and the goal of statistics is to find the most likely event.  I have spent the last few weeks reflecting on the results and what this means for the field of political science statistics.  Recently I read a book by David Salsburg called: The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. It’s a history of sorts of how the field was developed and then applied to science.  While an exact date of the beginning of statistics is hard to pinpoint the first journals and departments were founded in the early twentieth century.  Statistics is a young field and is constantly growing and evolving as more data and situations are studied.  In the beginning some of the problems may have been trivial, but it is important to try to understand the world around us. Collecting data from an entire population is incredibly difficult and sometimes impossible, so methods of estimation were created.  You may wonder why prediction is necessary or helpful.  After all eventually the election happens and the president is chosen, so why do we care about knowing this in advance?  Why does prediction matter?  Statistics models and research is not just about what is being studied but about creating better ways to understand the world around us.   We can begin to better understand things like the opinions of the people, development of diseases,  and the economy.  Statistics can create better government, better medicine, and better education, and a better world.  If we can understand how polls measure the voting habits of the American people, then we may be able to get a better picture of citizens views on multiple issues and candidates.  If we can help understand how diseases like cancer behave, then we can create better more individualized medicine.  If we can understand how individual students learn and what they know, then we can create a better educational system.  Statistics isn’t perfect.  Statisticians can disagree and still both have valid models and reasoning.  The data may be imperfect and incomplete.  The model may be wrong.  The experiment may seem trivial and unimportant. But there is so much potential for the field of Statistics to change our world.  Just because prominent statisticians like Nate Silver may not have seen a Trump presidency as the most likely event doesn’t mean that the field should be discounted.

Statistics 101

I figured a great start would be to explain what statistics by defining the basic terms with non-mathematical language.

What: Statistics is the study of data.
Why: To understand the world around us and try to make better decisions.

What: An outlier is a data point that is far away from the rest of the data
Why: Outliers affect the mean and standard deviation.

What: The population is the entire group of people or objects you are studying.
Why: It is important to understand your population so that you are collecting the right data.

What:  The sample is a group of individuals taken from the population.
Why:  It would be almost impossible to collect data on the entire population in most cases.  So statisticians use samples to help make decisions.

Measures of Central Tendency
What: Measures of central tendency are ways to find the middle of the data set.
Why: Statistics is about finding the most likely event and a way to do that is to find the middle.

What: The mean is the average of a set of data. It the total of the data divided by the number of data points. It is a measure of central tendency.
Why: The mean is a way to find the middle, but it can be skewed by outliers.  However, the mean is still a great way to find the middle in most situations

What: The median is the data point that is in the middle of the data.
Why: The median is not affected by outliers, which makes it useful in cases with outliers like income (there are people who make hundreds of times the median income).

Measures of Variability
What: Measures of variability are ways to determine how spread the data is.
Why: Measures of variability help to compare the data and make decisions.

What: The range is the difference in the smallest and largest value.
Why: The range is used to understand how spread the data is. It is affected by outliers.

Standard Deviation
What:  The standard deviation is the way of measuring the differences in the data.  It is defined by the following formula where Σ is the sum, x is the data point, and n is the number of data points.

Why: Standard deviation helps define the statistical distributions.

Inter-Quartile Range (IQR)
What: The Inter-quartile range is the difference in the 25th and 75th percentile.
Why: It helps find the spread in the center of the data, and isn’t affected by outliers.

Normal Distribution
What: The most commonly used distribution in statistics.
Why:  If there are enough data points all things follow the normal distributions.

Margin of Error
What: The margin of error is a way of explaining error in a sample. Samples don’t have all the information so they have error.
Why: Since samples are incomplete they don’t have all the information on the entire population.  Margin of error helps us acknowledge that the observed mean is different from the actual mean.


This is not the end of statistics, but these are the basic terms I will frequently use.