My New Project: Revised Models to Predict American Presidential Elections Preregistration

My current project is a series of new models to predict American Presidential Elections like in the original model with some minor changes.   The new models have 3 different methods to reassign undecided voters,  2 different conjugate priors, and 3 different ways to calculate using the Gaussian conjugate prior.   The models deal with hypothetical election results with only the two major parties’ candidates. In total there are 12 models.  This is a pre-registration post with my methodology and some thoughts on what I think will happen and what I am looking for in the results.

One of the key features of this project is while it still takes a similar approach of using poll data from other states as the prior, it expands the prior to be a pooled collection of all the polls from within the category.   I believe that this new method will help address some of the issues I faced choosing one source of polls as the prior, and will possibly help in swing states where it will use polls from other swing states.

One of the goals of this project is to have better more definitions of swing states and prior regions.  The original model had definitions that were admittedly somewhat ad-hoc.   In this new project, I define a swing state as a state that has been won by both a Democratic candidate and a Republication candidate in the past four elections.   Overall, I like this definition because it is easy to use, but I wish it could capture future swing states like Indiana in 2008, and Michigan, Pennsylvania, and Wisconsin in 2016.   Since I don’t have the same time constraints I had with the 2016 model,  I have been able to put more thought into how prior regions should be defined.   This time I am going to stick closer to the US Census regions (found here) and divide the West and Midwest Census regions into a red state and a blue state subgroup.   I am going to split the Southern and Northeastern regions into two subgroups of the region with the same partisan alignment.   I am going to more Delaware, Maryland, and Washington DC into the Middle Pacific subregion of the Northeastern region since I think that is too small of a region and Washington DC and Delaware usually only have a handful of polls.  I think these states would benefit from being joined with the Middle Pacific region and will help even out the between state demographic variation.   Since the Census regions are more based on geography than culture and politics according to the history of the Census regions found here, I feel comfortable doing this.   I am also changing my mind from the previous model on the placement of Missouri.  The fact that the race was so close in Missouri in 2008,  indicates to me that its political culture may be more like the Midwest than the South.  To me, a key feature of the Midwest (and the smaller Western states) is that state partisanship is weaker than other states, and swings are more common compared to the Northeastern region or the South.  I am going to keep Missouri in the Midwest region, where it is in the US Census regions.  I am splitting the Northwest into the Middle Atlantic and New England subgroups.  In the South, I am going to split it into two regions:  one containing the West South Central region plus Tennessee and Kentucky, and another with the South Atlantic region plus Mississippi and Alabama.   Dividing the south was a difficult decision, but I looked at the Electorate Profiles and decided that that was the best way to preserve demographic similarly among key groups (Whites , Hispanics, African Americans, college-educated individuals, high-income earners percentage, and percentage in poverty) into the Southern regions.  Deciding the group for the Southern blue states was hard because they were too small of a group to be alone, and while the Middle Atlantic region wasn’t a great fit it was the best fit.

The models use three different methods to reassign undecided and minor party voters.  The first method reassigns the voters based on the past election results.  The second method splits the undecided voters equally between the two candidates.  Lastly, the third method reassigns the undecided voters proportional to their support.  For example consider a poll of a hundred people with 50 supporters of the democrat, 40 supporters of the republican, and 10 undecided voters.   The state voted 60% of the democrat and 40% for the republican in the last election.  Under the first method, 4 of the undecided voters would be reassigned to the Republican candidate, and the other voters would be reassigned to the Democratic candidate, making the poll results 56 for Democrats and 44 for Republicans.  The second method would reassign 5 voters to the democrat and 5 voters to the republican making the adjust pool results 55 Democrats and 45 Republicans.  Under the third method, the Democratic candidate received 55.556% of the two-party support, and the Republican received 44.444% of the  two-party support,  this translates to a fraction of a person so the multiplied figures of 5.556 and 4.444 are rounded to 6 and 4 respectively.   I realize I could drop the undecided voters from the polls as done in this paper by Lock & Gelman, but I am using poll data to predict the election result and not using a time series approach.   I haven’t found anyone using past election results to reassign voters.   FiveThirtyEight splits the undecideds evenly between the two candidates, so that is why I included that method.  This paper by  Christensen & Florence talks about the proportional reassignment of undecided voters.   The Christensen & Florence paper talks about an undergraduate project on predicting elections and has been a heavy inspiration for my research.

Conjugate Prior and Calculation Methods

These models use either the binomial or Gaussian conjugate prior.   The goal of the models is to predict the proportion of votes for the Democratic candidate among the two major party candidates.  The data is binomial with a Bernoulli likelihood, but the extent of the independence of people concerns me.   I think individuals show up multiple times in the polls, meaning that the observations are not independent.  If the data was truly i.i.d,  I would be ok with using the beta conjugate prior,  but since it is likely not the case I am afraid this causes on an underestimation of the variance.     I am curious what effects using the normal approximation to the binomial distribution have in the contexts of predicting elections based on polls.  I also want to see the effects of different methods of reassigning voters and the new prior has on the original calculation method from the previous study.  In the original study, I used the standard deviation and count of polls inside the Gaussian conjugate prior.   There are 4 different models:  Beta conjugate prior, a Gaussian model that uses the normal approximation to the binomial distribution and updates after every poll,  a Gaussian model that averages the polls and finds the standard deviation of the poll data and uses that information to make the calculation, a Gaussian model that turns the polls into one giant poll and uses the normal approximation to the binomial distribution.  If I had to choose the better assumption,  I would go with polls are independent over people are independent.   But I plan on eventually exploring ways to remove that the independence assumption.

Choosing the “Best” Model

I don’t think I am going to take all twelve models and turn them into multilevel models or run simulations.  Based on the data I have every model is run 153 times (3 times for the 50 states plus DC) to predict the 2008, 2012, 2016 elections.   The pooled models would likely not translate well into a time series model.  The main question I am asking is: do these changes make the model even more accurate, or at least as accurate as the original model?   I also want to know if the method used to reassign undecided voters matters.   I don’t  think it will since the proportion of undecided voters are is small and the difference between the polls and past vote usually similar.  I don’t like the idea of splitting the vote evenly between the two candidates because I think it doesn’t work as well in highly partisan states.  I don’t think that undecided voters at any point in time in West Virginia or Massachusetts are going to vote are going to turn out and vote equally for the two major candidates.   What I am hoping to get out of this is a rough idea is if any of these changes have a practical effect on accuracy.   And if there is no difference I am going to probably opt for proportionally reassigning voters and iteratively updating the model.

Looking Forward to Further Research

This project is an intermediate step in the process of testing the use of poll data from other areas as a part of the prior in a Bayesian model to predict American national elections.   Since there are a lot of key changes in this new set of models,  I want to get more data on the accuracy of my idea of exclusively poll-based models to predict elections.   What I hope later is to turn this into a time series multilevel model with and without the inclusion of a fundamental model.   I don’t have anything against fundamental modeling, but an exclusively poll based model requires less data collection than fundamental modeling.  I want to see the viability of this method,  because if it can match the performance of fundamental models then this may be a better strategy.  I want to make my own fundamental model that treats swing states differently from partisan states in the future.   I intend to look at state-level and regional-level effects on voting behavior.   The big assumption of this method is that state-level effects within a region are small and that pooling the polls across a region mitigates this effect so that the pooled polls are a good preliminary estimate of voting behavior.