Models May Fail but Statistics Matters Anyway

The 2016 presidential election brought attention to the limitations of Statistics.  Most models predicted a Clinton win but Trump will most likely be the president (the results are currently unofficial and recounts are in progress but most experts believe that Trump will be officially elected president). However all models are not 100% certain and the goal of statistics is to find the most likely event.  I have spent the last few weeks reflecting on the results and what this means for the field of political science statistics.  Recently I read a book by David Salsburg called: The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. It’s a history of sorts of how the field was developed and then applied to science.  While an exact date of the beginning of statistics is hard to pinpoint the first journals and departments were founded in the early twentieth century.  Statistics is a young field and is constantly growing and evolving as more data and situations are studied.  In the beginning some of the problems may have been trivial, but it is important to try to understand the world around us. Collecting data from an entire population is incredibly difficult and sometimes impossible, so methods of estimation were created.  You may wonder why prediction is necessary or helpful.  After all eventually the election happens and the president is chosen, so why do we care about knowing this in advance?  Why does prediction matter?  Statistics models and research is not just about what is being studied but about creating better ways to understand the world around us.   We can begin to better understand things like the opinions of the people, development of diseases,  and the economy.  Statistics can create better government, better medicine, and better education, and a better world.  If we can understand how polls measure the voting habits of the American people, then we may be able to get a better picture of citizens views on multiple issues and candidates.  If we can help understand how diseases like cancer behave, then we can create better more individualized medicine.  If we can understand how individual students learn and what they know, then we can create a better educational system.  Statistics isn’t perfect.  Statisticians can disagree and still both have valid models and reasoning.  The data may be imperfect and incomplete.  The model may be wrong.  The experiment may seem trivial and unimportant. But there is so much potential for the field of Statistics to change our world.  Just because prominent statisticians like Nate Silver may not have seen a Trump presidency as the most likely event doesn’t mean that the field should be discounted.

Statistics 101

I figured a great start would be to explain what statistics by defining the basic terms with non-mathematical language.

Statistics
What: Statistics is the study of data.
Why: To understand the world around us and try to make better decisions.

Outlier
What: An outlier is a data point that is far away from the rest of the data
Why: Outliers affect the mean and standard deviation.

Population
What: The population is the entire group of people or objects you are studying.
Why: It is important to understand your population so that you are collecting the right data.

Sample
What:  The sample is a group of individuals taken from the population.
Why:  It would be almost impossible to collect data on the entire population in most cases.  So statisticians use samples to help make decisions.

Measures of Central Tendency
What: Measures of central tendency are ways to find the middle of the data set.
Why: Statistics is about finding the most likely event and a way to do that is to find the middle.

Mean
What: The mean is the average of a set of data. It the total of the data divided by the number of data points. It is a measure of central tendency.
Why: The mean is a way to find the middle, but it can be skewed by outliers.  However, the mean is still a great way to find the middle in most situations

Median
What: The median is the data point that is in the middle of the data.
Why: The median is not affected by outliers, which makes it useful in cases with outliers like income (there are people who make hundreds of times the median income).

Measures of Variability
What: Measures of variability are ways to determine how spread the data is.
Why: Measures of variability help to compare the data and make decisions.

Range
What: The range is the difference in the smallest and largest value.
Why: The range is used to understand how spread the data is. It is affected by outliers.

Standard Deviation
What:  The standard deviation is the way of measuring the differences in the data.  It is defined by the following formula where Σ is the sum, x is the data point, and n is the number of data points.

stdev_s
Why: Standard deviation helps define the statistical distributions.

Inter-Quartile Range (IQR)
What: The Inter-quartile range is the difference in the 25th and 75th percentile.
Why: It helps find the spread in the center of the data, and isn’t affected by outliers.

Normal Distribution
What: The most commonly used distribution in statistics.
Why:  If there are enough data points all things follow the normal distributions.

Margin of Error
What: The margin of error is a way of explaining error in a sample. Samples don’t have all the information so they have error.
Why: Since samples are incomplete they don’t have all the information on the entire population.  Margin of error helps us acknowledge that the observed mean is different from the actual mean.

 

This is not the end of statistics, but these are the basic terms I will frequently use.

Welcome

My name is Brittany Alexander.  I completed an undergraduate degree in Mathematics at Texas Tech University in May 2018.  I am currently a Ph.D. student in the Statistics department at Texas A&M.   My passion is statistics and how it affects the world around us, with a focus on political science.   Currently, I am researching methods of predicting American elections and analyzing public opinion data in general.  What I have learned in my research is that people may not understand statistics and the role they play in our lives.  My goal is to educate people about basic statistical concepts like margin of error, correlation vs. causation, and why an average isn’t always the best way of finding the middle of a data set.  Data is everywhere, from what TV shows we watch, to how many steps our fitness tracker records.  I want to help you understand the world around you by explaining how you can use statistics in your daily life.  This blog is mix of posts focusing on statistical education, and data-centric political and polling analysis with some posts at the intersection of the two.  I try to use as little theory and math as possible in my explanations. My opinions are always my own,  and I am committed to transparency in my political coverage.