What my Undergraduate Research experience was like in Statistics

I am entering my third and final year of my undergraduate degree.  I have been doing research since almost day 1, and I wanted to share what my experience was like. As a statistician, I feel like I have to mention this is from a sample size of 1 and may not reflect all undergraduate research experiences.

First, I want to give a little background.  The summer before my senior year of high school, I was chosen to participate in an NSF (National Science Foundation) funded REU (Research Experience for Undergraduates)  at Texas Tech.  There I was exposed to what research was like.  We had a series of workshops each led by different researchers over a two week period. I loved the Texas Tech math department and decided to attend Texas Tech for my undergraduate degree. I meet my current research advisor Dr. Ellingson at the REU.

Right after classes started during my freshman year, I decided to email Dr. Ellingson and see if could do research with him.  I started work on image analysis (Dr. Ellingson’s specialty).  I was also following the GOP nomination because it was interesting to me.  I had an idea to predict the nomination using Bayesian statistics, similar to how Five Thirty Eight predicts elections.  I had talked with Dr. Ellingson about political science statistics before and how there was a need for a statistically sound open source academic model.  He agreed to help guide me through the process of building a model to predict the GOP nomination process.

At the time of the GOP nomination my math background was pretty limited, so I decided to just use Baye’s theorem and used the normal distribution to estimate likelihood.  I did all the calculations in excel and I downloaded csv files from Huffington Post Pollster with the poll data.  I used previous voting results from similar states as the prior in my model.  More info about my model can be found here. What I found the most challenging was making a lot decisions about how I was going to predict the election.  I also struggled with making the decisions about the delegate assignments which often involved breaking the results down by congressional districts, even when the poll data was state wide.  After the first Super Tuesday (March 1st) I began to realize that how difficult it is to find a good prior state and reassign support of candidates who dropped out of the race.  The nomination process taught me that failure is inevitable in research, especially in statistics, where everything is at least slightly uncertain.

In the summer of 2016, I started gearing up for the general election. I decided to use Scipy (a python package for science and stats) to make my predictions.  Making the programs was incredibly difficult.  I had over a dozen variations to match different combinations of poll data.  I had the programs up and running by early October, but I discovered a couple of bugs that invalidated my early test predictions.  The original plan was to run the model on the swing states two or three times before the real election. In the middle of October I discovered a bug in one of my programs.  I had to then fix the bug in every program.  I then finally did some manual calculations to confirm the programs worked.  It was difficult to have to admit that my early predictions were totally off, but I am glad I found it before the election.  Research isn’t like a homework assignment with answers in a solution manual.  You don’t know what is exactly going to happen and it is easy to make mistakes.

I ended up writing a paper on my 2016 general election model.  Writing an paper on your research is very different than writing a paper on other peoples research.  My paper was 14 pages (and over 6500 words) long, and only about one or two pages were about what other people’s research on the topic.  It took a very long time to write, and I had 17 drafts.  I hated writing the paper at first, but when I finished it felt amazing. It was definitely worth the effort.

Undergraduate research is difficult, but I loved the entire process.  I got to work with real data to solve a real problem.  I learned how to read a research paper, and eventually I got to write my own.  I got to give presentations to both general audiences and mathematicians and statisticians.  I got to use my research to  inform others about statistics. If you are thinking about doing undergraduate research, you definitely should.

 

Data Sharing

Last semester I took a research ethics class.  I wrote a paper on preregistration and data sharing in academic research. I decided to modify the paper into two blog posts. Here is the first part on data sharing.

Statistics is the study of uncertainty.   Any research study not involving the entire population of group will not be able to provide a definite conclusion with 100% certainty.   Conclusions can be made with a high degree of certainty (95-99%) but false positives and false negatives are inevitable in any large statistical analysis.  This means that studies can fail to make the right call, and after multiple replications the original conclusion may be overturned.

One way to improve the statistical integrity of research is to have a database of the data from non-published studies.  Ideally, this database would be accessible to all academic researchers.   A research would then be able to see the data from other similar studies.   The research would then be able to compare his data with the data from the other studies.  At a significance level of .05,  approximately 1 in 20 studies that were statistically significant were a false positive.    This number applies to theoretically perfect studies that meet all the statistically assumptions used.   Any modelling error increases that rate.  With each external replication of a study the probability of a false positive or a false negative greatly decreases.   Grants from the National Science Foundation1, and the National Institute of Health2 currently require that data from the funded studies be made available to the public after the study was completed.  But not all grants and funding sources require this disclosure.    Without an universal requirement for data disclosure, it can be difficult to confirm that the study and the results are legitimate.

Advocates of open data say that data sharing saves time and reduces false positives and false negatives.  A research can look at previously conducted studies and try to replicate the results.   The results of the data can then be recalculated by another research to confirm accuracy.   In a large study with lots of data it is very easy to make a few mistakes.  These mistakes could cause the results to be misinterpreted.   Open data can even help discover fraudulent studies.  There are methods to estimate the probability the data is fraudulent by looking at the relative frequency of the digits.   The distributions of the digits should be pretty uniform and in one case the data didn’t look quite right.  In 2009, Strategic Vision (a polling company) came under fire from potentially falsifying polls, after a Five Thirty Eight analysis3  discovered that something didn’t look quite right.  This isn’t an academic example, but open access data could prevent fraudulent studies from being accepted as fact as in the infamous vaccines cause autism study.  The statistical analysis of the randomness isn’t definite, but they can raise questions that prompt further investigations of the data.   Open data makes replication easier. False positives and false negatives can cause harm in some cases.  Easier replication can help confirm findings quicker.

 

Works Cited

[1] Public Access To the Results of NSF-Funded Research. (n.d.). Retrieved April 28, 2017, from https://www.nsf.gov/news/special_reports/public_access/

[2] NIH’s Commitment to Public Accountability. (n.d.). Retrieved April 28, 2017, from https://grants.nih.gov/grants/public_accountability/

 

[3] Silver, N. (2014, May 07). Strategic Vision Polls Exhibit Unusual Patterns, Possibly Indicating Fraud. Retrieved April 28, 2017, from https://fivethirtyeight.com/features/strategic-vision-polls-exhibit-unusual/

A Non-Technical Overview of My Research

Recently I have been writing up a draft of a research article on my general election model to submit for academic publication.  But that paper is technical and requires you to have some exposure to statistical research to understand.  I wanted to explain my research without going into all the technical details.

Introduction

The President of the United States is elected every four years.  The Electoral College decides the winner,  by the votes of electors chosen by their home state.  Usually the electors are chosen based on the winner of that state and they vote for the winner of that state. Nate Silver correctly predicted the winner of the 2008 election with Bayesian statistics.  Silver got 49 out of 50 states correct.   Silver certainly wasn’t the first person to predict the election, but he received a lot of attention for his model.   Silver’s runs Five Thirty Eight  which talks about statistics and current events.  Bayesian statistics is a branch of statistics that uses information you already know (called a prior) and adjusts the model as more information comes in.  My model like Nate Silver’s used  Bayesian statistics. We do not know the details of the Silver model, besides that it used Bayesian statistics.  To the best of my knowledge, my method is the first publicly available model that used poll data from other states as the prior.  A prediction was made for 2016, where I correctly predicted 6 states.  Then the model was applied to 2008 and 2012, where my prediction of state winners matched the prediction of Five Thirty Eight.

Methodology

I took poll data from Pollster, which provided me csv files for the 2016 and 2012 election.  For 2008 I had to create the csvs by hand.  I had a series of computer programs in Python (a common programming language) to analyze.  My model, used the normal distribution.  My approach divided the 50 states into 5 regional categories: swing states,  southern red states,  midwestern red states, northern blue states,  and western blue states.  The poll data source used as the prior were National,  Texas,  Nebraska,  New York, and California respectively.  This approach is currently believed to be unique, but since multiple models are proprietary it is unknown if this has been used before.  I only used polls if they were added to pollster before the Saturday before election date.   For the 2016 election analysis this meant November 5th.  I posted my predictions on November 5th.

I outline more of my method here.

Results and Discussion

My model worked pretty well compared to other models.  Below is a table of other models and their success rate at predicting the winning candidate in all 50 states plus (and Washington D.C.).

Race Real Clear Politics Princeton Election Consortium Five Thirty Eight (Polls Plus) PredictWise (Fundamental) Sabato’s Crystal Ball My Model
2008 Winner Accuracy 0.96078 0.98039 0.98039  N/A 1 0.98039
2012 Winner Accuracy 0.98039 0.98039 1 0.98039 0.96078 1
2016 Winner Accuracy 0.92157 0.90196 0.90196 0.90196 0.90196 0.88235
Average Accuracy 0.95425 0.95425 0.96078 0.94118 0.95425 0.95425

As you can see all the models do a similar job at picking the winner in each state, which predicts the electoral college.  There are other ways to compare accuracy, but I don’t want to discuss this here since it gets a little technical.   No one was right for every state in every election.  It would probably be impossible to create a model that would consistently predict the winner in all states, because of the variability of political opinions.   Election prediction is not an exact science.  But there is the potential to apply polling analysis to estimate public opinion on certain issues and politicians.  Right now the errors in polls are too large determine public opinion on close issues.   But further research could determine ways to reduce error in polling analysis.