2018 Midterms

It’s (finally) Election season again which means my Friday politics blog posts are going to start back up again from now at least the end of August.  I am starting a Statistics Ph.D. program at Texas A&M in the fall so I might not have time to blog regularly for the end of election season,  but I will try.  I am so excited to follow the 2018 midterm elections!

This summer I plan to immerse myself back into the midterm elections as I did for the 2016 Presidential race.  The goal right now is to use the model I already built to predict the 33 Senate races and possibly a few House races if I have time.  This election gives me the opportunity to try to build a few new, different kinds of priors to use in my Iterative Gaussian model.   I will continue to break the races into groups and use a method similar what I have used to predict Presidential elections,  but I am planning to bring in poll data from the Generic House Ballot with some adjustments.

The Senate races are going to bring in new challenges because each race has a different set of candidates that will complicate how I choose races to serve as the prior.  What will probably happen is that I will break races down into categories based on the competitiveness of a race and what direction I think it is leaning.  Hopefully, we will see polling for every Senate race start to pop up after the primaries.   If some of the races don’t have any polls, I might have to get creative.

Right now,  I think Republicans will keep the Senate Majority. There are only 8 Republican Senate seats up for reelection.  Of those, only Nevada was not won by Trump in the last election.  Additionally, there are vulnerable Democratic senators in West Virginia, Missouri, Montana, and Indiana which are all states Trump won by a large margin.  Given the polarization of the electorate,  I don’t think enough Trump voters (at least those happy with his performance) can be convinced to vote for a Democrat.  I also believe Ted Cruz will be reelected to his seat in Texas.  O’Rouke might have more money and is gaining popularity,  but it will be an uphill battle for him to convince enough moderate Republicans and Republican-leaning independents to vote for him even though he will probably vote the party line if elected.

My New Project: Revised Models to Predict American Presidential Elections Preregistration

My current project is a series of new models to predict American Presidential Elections like in the original model with some minor changes.   The new models have 3 different methods to reassign undecided voters,  2 different conjugate priors, and 3 different ways to calculate using the Gaussian conjugate prior.   The models deal with hypothetical election results with only the two major parties’ candidates. In total there are 12 models.  This is a pre-registration post with my methodology and some thoughts on what I think will happen and what I am looking for in the results.

One of the key features of this project is while it still takes a similar approach of using poll data from other states as the prior, it expands the prior to be a pooled collection of all the polls from within the category.   I believe that this new method will help address some of the issues I faced choosing one source of polls as the prior, and will possibly help in swing states where it will use polls from other swing states.

One of the goals of this project is to have better more definitions of swing states and prior regions.  The original model had definitions that were admittedly somewhat ad-hoc.   In this new project, I define a swing state as a state that has been won by both a Democratic candidate and a Republication candidate in the past four elections.   Overall, I like this definition because it is easy to use, but I wish it could capture future swing states like Indiana in 2008, and Michigan, Pennsylvania, and Wisconsin in 2016.   Since I don’t have the same time constraints I had with the 2016 model,  I have been able to put more thought into how prior regions should be defined.   This time I am going to stick closer to the US Census regions (found here) and divide the West and Midwest Census regions into a red state and a blue state subgroup.   I am going to split the Southern and Northeastern regions into two subgroups of the region with the same partisan alignment.   I am going to more Delaware, Maryland, and Washington DC into the Middle Pacific subregion of the Northeastern region since I think that is too small of a region and Washington DC and Delaware usually only have a handful of polls.  I think these states would benefit from being joined with the Middle Pacific region and will help even out the between state demographic variation.   Since the Census regions are more based on geography than culture and politics according to the history of the Census regions found here, I feel comfortable doing this.   I am also changing my mind from the previous model on the placement of Missouri.  The fact that the race was so close in Missouri in 2008,  indicates to me that its political culture may be more like the Midwest than the South.  To me, a key feature of the Midwest (and the smaller Western states) is that state partisanship is weaker than other states, and swings are more common compared to the Northeastern region or the South.  I am going to keep Missouri in the Midwest region, where it is in the US Census regions.  I am splitting the Northwest into the Middle Atlantic and New England subgroups.  In the South, I am going to split it into two regions:  one containing the West South Central region plus Tennessee and Kentucky, and another with the South Atlantic region plus Mississippi and Alabama.   Dividing the south was a difficult decision, but I looked at the Electorate Profiles and decided that that was the best way to preserve demographic similarly among key groups (Whites , Hispanics, African Americans, college-educated individuals, high-income earners percentage, and percentage in poverty) into the Southern regions.  Deciding the group for the Southern blue states was hard because they were too small of a group to be alone, and while the Middle Atlantic region wasn’t a great fit it was the best fit.

The models use three different methods to reassign undecided and minor party voters.  The first method reassigns the voters based on the past election results.  The second method splits the undecided voters equally between the two candidates.  Lastly, the third method reassigns the undecided voters proportional to their support.  For example consider a poll of a hundred people with 50 supporters of the democrat, 40 supporters of the republican, and 10 undecided voters.   The state voted 60% of the democrat and 40% for the republican in the last election.  Under the first method, 4 of the undecided voters would be reassigned to the Republican candidate, and the other voters would be reassigned to the Democratic candidate, making the poll results 56 for Democrats and 44 for Republicans.  The second method would reassign 5 voters to the democrat and 5 voters to the republican making the adjust pool results 55 Democrats and 45 Republicans.  Under the third method, the Democratic candidate received 55.556% of the two-party support, and the Republican received 44.444% of the  two-party support,  this translates to a fraction of a person so the multiplied figures of 5.556 and 4.444 are rounded to 6 and 4 respectively.   I realize I could drop the undecided voters from the polls as done in this paper by Lock & Gelman, but I am using poll data to predict the election result and not using a time series approach.   I haven’t found anyone using past election results to reassign voters.   FiveThirtyEight splits the undecideds evenly between the two candidates, so that is why I included that method.  This paper by  Christensen & Florence talks about the proportional reassignment of undecided voters.   The Christensen & Florence paper talks about an undergraduate project on predicting elections and has been a heavy inspiration for my research.

Conjugate Prior and Calculation Methods

These models use either the binomial or Gaussian conjugate prior.   The goal of the models is to predict the proportion of votes for the Democratic candidate among the two major party candidates.  The data is binomial with a Bernoulli likelihood, but the extent of the independence of people concerns me.   I think individuals show up multiple times in the polls, meaning that the observations are not independent.  If the data was truly i.i.d,  I would be ok with using the beta conjugate prior,  but since it is likely not the case I am afraid this causes on an underestimation of the variance.     I am curious what effects using the normal approximation to the binomial distribution have in the contexts of predicting elections based on polls.  I also want to see the effects of different methods of reassigning voters and the new prior has on the original calculation method from the previous study.  In the original study, I used the standard deviation and count of polls inside the Gaussian conjugate prior.   There are 4 different models:  Beta conjugate prior, a Gaussian model that uses the normal approximation to the binomial distribution and updates after every poll,  a Gaussian model that averages the polls and finds the standard deviation of the poll data and uses that information to make the calculation, a Gaussian model that turns the polls into one giant poll and uses the normal approximation to the binomial distribution.  If I had to choose the better assumption,  I would go with polls are independent over people are independent.   But I plan on eventually exploring ways to remove that the independence assumption.

Choosing the “Best” Model

I don’t think I am going to take all twelve models and turn them into multilevel models or run simulations.  Based on the data I have every model is run 153 times (3 times for the 50 states plus DC) to predict the 2008, 2012, 2016 elections.   The pooled models would likely not translate well into a time series model.  The main question I am asking is: do these changes make the model even more accurate, or at least as accurate as the original model?   I also want to know if the method used to reassign undecided voters matters.   I don’t  think it will since the proportion of undecided voters are is small and the difference between the polls and past vote usually similar.  I don’t like the idea of splitting the vote evenly between the two candidates because I think it doesn’t work as well in highly partisan states.  I don’t think that undecided voters at any point in time in West Virginia or Massachusetts are going to vote are going to turn out and vote equally for the two major candidates.   What I am hoping to get out of this is a rough idea is if any of these changes have a practical effect on accuracy.   And if there is no difference I am going to probably opt for proportionally reassigning voters and iteratively updating the model.

Looking Forward to Further Research

This project is an intermediate step in the process of testing the use of poll data from other areas as a part of the prior in a Bayesian model to predict American national elections.   Since there are a lot of key changes in this new set of models,  I want to get more data on the accuracy of my idea of exclusively poll-based models to predict elections.   What I hope later is to turn this into a time series multilevel model with and without the inclusion of a fundamental model.   I don’t have anything against fundamental modeling, but an exclusively poll based model requires less data collection than fundamental modeling.  I want to see the viability of this method,  because if it can match the performance of fundamental models then this may be a better strategy.  I want to make my own fundamental model that treats swing states differently from partisan states in the future.   I intend to look at state-level and regional-level effects on voting behavior.   The big assumption of this method is that state-level effects within a region are small and that pooling the polls across a region mitigates this effect so that the pooled polls are a good preliminary estimate of voting behavior.

 

Correction Notice for Results

I was rechecking my error calculations after I received a comment about the error calculations from a reviewer of my paper.   An example calculation was incorrect.  This was a minor error, but a further examination led to the discovery of a error in the 2-Party error of my model in 2012.  The error was approximately half of what it was supposed to be.  This mistake made my model falsely appear more accurate than the Five Thirty Model do to this underestimation.   All of the error calculations are currently being reexamined for possible errors.  I have already recalculated all the errors but I want to check them a couple of more times to be safe.  A corrected table will be posted once it is checked again.

Update: 12/5

No other major errors were found in the re-checking process.  All calculations have been checked three times post the discovery of the error of the 2012 2-Party error for my model.

Update: 1/20  fixed typo in tested model 2008 for both all candidates and 2-party and adjusted average

Below is the updated table to replace the former tables used in both the ESR Virtual Poster and the USPROC Paper:

 

Tested Model RMSE Tested Model RMSE Swing States RCP RMSE Swing State 538 RMSE 538 RMSE Swing State
2008 All Candidates 3.5474 3.14788 4.23389 3.19332 1.66958
2008 -2 Party 2.89669 2.57051 3.63513 3.0305 1.47846
2012 All Candidates 3.25139 1.94492 2.33511 2.38019 1.2979
2012 2-Party 2.37053 1.17163 1.61076 1.98642 0.9342
2016 All Candidates 6.82013 3.95985 3.32952 5.37952 3.56511
2016 2-Party 3.95985 3.14325 2.04295 3.81296 2.31948
All Candidate  Average 4.53964 3.01755 3.299507 3.65101 2.17753
2-Party Average 3.07569 2.29513 2.42961 2.94329 1.57738
2-Party Average Compared to 538 0.95695 0.68727 0.64923
2-Party Compared to RCP 1.05859

 

Column1 Tested Model RMSE Tested Model RMSE SS RCP RMSE SS FiveThirtyEight Polls PlusRMSE 538 RMSE SS
2008 3.5474 3.14788 4.23389 3.19332 1.66958
2008 -2 Party 2.89669 2.57051 3.63513 3.0305 1.47846
2012 3.25139 1.94492 2.33511 2.38019 1.2979
2012 2-Party 2.37053 1.17163 1.61076 1.98642 0.9342
2016 6.82013 6.42335 8.23311 5.37952 4.14228
2016 2-Party 3.95985 3.03986 1.89412 3.81296 2.41263
2016 SS without UT and AZ 3.99534 3.32952 3.56511
2016 SS without UT and AZ 2-Party 3.14325 2.04295 2.31948
Overall Average 4.53964 3.83872 4.93404 3.65101 2.36992
2-Party Average 3.07569 2.26067 2.38 2.94329 1.60843
2 – Party Average Compared to 538 0.95695 0.71148 0.67581
2- Party Compared to RCP 1.05279

 

 

What a Pulmonary Embolism Taught Me About Statistics

On May 3rd, 2017, I was released from the hospital following an overnight stay for the treatment of a pulmonary embolism.  I am now almost fully recovered.  I think this experience is a great opportunity to teach statistics through a real-life example. I learned three things from this experience.

Vastly different fields can have the same underlying statistical processes

So far I have worked almost exclusively with political science data. My research is about how to estimate a proportion from a sample and how to compare it to other proportions.  When my doctor told me I might have a pulmonary embolism,  I wanted to see the data for myself.  So I read the journal articles, FDA case reports, and any data I could find to try to get an estimate for the chance I had a pulmonary embolism.  What I quickly realized is that the data about adverse drug reactions had similarities with the political science data I was familiar working with.   The data had issues with nonresponse bias and limitations due to unideal sample sizes.  Although political science and pharmacology are very different fields the share similar kinds of statistical problems.

Bayesian statistics is a powerful tool in many fields

Through this process, I saw how Bayesian statistics could help solve a difficult and important problem.  My doctor came by and saw me during my brief hospital stay.  She talked about how while she knew that it was unlikely any random woman in her twenties would have a pulmonary embolism,  but the details of my case suggested that the probability I had a pulmonary embolism was significant.  In short, the Bayesian mindset is about incorporating your prior beliefs and adapting them in the presence of additional information.  I don’t think my doctor used Bayes Theorem (the formal formula for estimating a probability given prior information), but she used Bayesian reasoning.  She had initial beliefs about the cause of my symptoms, and she updated her beliefs when she got new information  (like lab results).  This is probably normal reasoning for a doctor trying to diagnose a patient, but it showed me how Bayesian statistics could be applied to other fields.   A more formal use of Bayesian statistics would provide even better information to estimate probabilities.  I always knew Bayesian statistics could be useful in other cases besides politics, but this experience showed me a new area I am interested in researching.

 I am interested in other fields to apply statistics to besides politics

I wish I could have discovered my interest in biostatistics without a life-threating medical event, but I am glad.  I was exposed to a problem that is important and would use some of the same techniques I was exposed to during my work on political science.  While I still love political science statistics, I feel like I have now answered the question on what I can research in years where is no major election.  I enjoyed reading clinical trials and studies and analyzing their statistics.  Maybe someday I can even study how to improve statistical methods to prevent and diagnosis pulmonary embolisms like mine.

Six months after returning home from the hospital,  I am grateful that God has found a way to use my PE for good.

 

What my Undergraduate Research experience was like in Statistics

I am entering my third and final year of my undergraduate degree.  I have been doing research since almost day 1, and I wanted to share what my experience was like. As a statistician, I feel like I have to mention this is from a sample size of 1 and may not reflect all undergraduate research experiences.

First, I want to give a little background.  The summer before my senior year of high school, I was chosen to participate in an NSF (National Science Foundation) funded REU (Research Experience for Undergraduates)  at Texas Tech.  There I was exposed to what research was like.  We had a series of workshops each led by different researchers over a two week period. I loved the Texas Tech math department and decided to attend Texas Tech for my undergraduate degree. I meet my current research advisor Dr. Ellingson at the REU.

Right after classes started during my freshman year, I decided to email Dr. Ellingson and see if could do research with him.  I started work on image analysis (Dr. Ellingson’s specialty).  I was also following the GOP nomination because it was interesting to me.  I had an idea to predict the nomination using Bayesian statistics, similar to how Five Thirty Eight predicts elections.  I had talked with Dr. Ellingson about political science statistics before and how there was a need for a statistically sound open source academic model.  He agreed to help guide me through the process of building a model to predict the GOP nomination process.

At the time of the GOP nomination my math background was pretty limited, so I decided to just use Baye’s theorem and used the normal distribution to estimate likelihood.  I did all the calculations in excel and I downloaded csv files from Huffington Post Pollster with the poll data.  I used previous voting results from similar states as the prior in my model.  More info about my model can be found here. What I found the most challenging was making a lot decisions about how I was going to predict the election.  I also struggled with making the decisions about the delegate assignments which often involved breaking the results down by congressional districts, even when the poll data was state wide.  After the first Super Tuesday (March 1st) I began to realize that how difficult it is to find a good prior state and reassign support of candidates who dropped out of the race.  The nomination process taught me that failure is inevitable in research, especially in statistics, where everything is at least slightly uncertain.

In the summer of 2016, I started gearing up for the general election. I decided to use Scipy (a python package for science and stats) to make my predictions.  Making the programs was incredibly difficult.  I had over a dozen variations to match different combinations of poll data.  I had the programs up and running by early October, but I discovered a couple of bugs that invalidated my early test predictions.  The original plan was to run the model on the swing states two or three times before the real election. In the middle of October I discovered a bug in one of my programs.  I had to then fix the bug in every program.  I then finally did some manual calculations to confirm the programs worked.  It was difficult to have to admit that my early predictions were totally off, but I am glad I found it before the election.  Research isn’t like a homework assignment with answers in a solution manual.  You don’t know what is exactly going to happen and it is easy to make mistakes.

I ended up writing a paper on my 2016 general election model.  Writing an paper on your research is very different than writing a paper on other peoples research.  My paper was 14 pages (and over 6500 words) long, and only about one or two pages were about what other people’s research on the topic.  It took a very long time to write, and I had 17 drafts.  I hated writing the paper at first, but when I finished it felt amazing. It was definitely worth the effort.

Undergraduate research is difficult, but I loved the entire process.  I got to work with real data to solve a real problem.  I learned how to read a research paper, and eventually I got to write my own.  I got to give presentations to both general audiences and mathematicians and statisticians.  I got to use my research to  inform others about statistics. If you are thinking about doing undergraduate research, you definitely should.

 

Data Sharing

Last semester I took a research ethics class.  I wrote a paper on preregistration and data sharing in academic research. I decided to modify the paper into two blog posts. Here is the first part on data sharing.

Statistics is the study of uncertainty.   Any research study not involving the entire population of group will not be able to provide a definite conclusion with 100% certainty.   Conclusions can be made with a high degree of certainty (95-99%) but false positives and false negatives are inevitable in any large statistical analysis.  This means that studies can fail to make the right call, and after multiple replications the original conclusion may be overturned.

One way to improve the statistical integrity of research is to have a database of the data from non-published studies.  Ideally, this database would be accessible to all academic researchers.   A research would then be able to see the data from other similar studies.   The research would then be able to compare his data with the data from the other studies.  At a significance level of .05,  approximately 1 in 20 studies that were statistically significant were a false positive.    This number applies to theoretically perfect studies that meet all the statistically assumptions used.   Any modelling error increases that rate.  With each external replication of a study the probability of a false positive or a false negative greatly decreases.   Grants from the National Science Foundation1, and the National Institute of Health2 currently require that data from the funded studies be made available to the public after the study was completed.  But not all grants and funding sources require this disclosure.    Without an universal requirement for data disclosure, it can be difficult to confirm that the study and the results are legitimate.

Advocates of open data say that data sharing saves time and reduces false positives and false negatives.  A research can look at previously conducted studies and try to replicate the results.   The results of the data can then be recalculated by another research to confirm accuracy.   In a large study with lots of data it is very easy to make a few mistakes.  These mistakes could cause the results to be misinterpreted.   Open data can even help discover fraudulent studies.  There are methods to estimate the probability the data is fraudulent by looking at the relative frequency of the digits.   The distributions of the digits should be pretty uniform and in one case the data didn’t look quite right.  In 2009, Strategic Vision (a polling company) came under fire from potentially falsifying polls, after a Five Thirty Eight analysis3  discovered that something didn’t look quite right.  This isn’t an academic example, but open access data could prevent fraudulent studies from being accepted as fact as in the infamous vaccines cause autism study.  The statistical analysis of the randomness isn’t definite, but they can raise questions that prompt further investigations of the data.   Open data makes replication easier. False positives and false negatives can cause harm in some cases.  Easier replication can help confirm findings quicker.

 

Works Cited

[1] Public Access To the Results of NSF-Funded Research. (n.d.). Retrieved April 28, 2017, from https://www.nsf.gov/news/special_reports/public_access/

[2] NIH’s Commitment to Public Accountability. (n.d.). Retrieved April 28, 2017, from https://grants.nih.gov/grants/public_accountability/

 

[3] Silver, N. (2014, May 07). Strategic Vision Polls Exhibit Unusual Patterns, Possibly Indicating Fraud. Retrieved April 28, 2017, from https://fivethirtyeight.com/features/strategic-vision-polls-exhibit-unusual/

My New Project

Update 09/23/17:  I am switching to two proportion Z tests.  I am setting the population proportion to .5 to prevent an underestimation of variance.

 

This is a bit of a technical post,  I will have a better explanation later.

Post election, I have been working on a paper and thinking about what to do next.  I am really interested in breaking down voter behavior in the swing states. I have collected exit poll data from the 11 swing states.  I want to test if voter behavior across the swing states was consistent with the national vote or the swing state average.

For phase 1 of this experiment, I will run Chi-Square Test of Homogeneity between a swing state compared to the average of the other swing state and the national vote.  I will look at each category four different ways: Trump vs. not Trump,  Clinton vs. not Clinton, Other vs Clinton and Turmp, and overall.  This will probably be around 1500 tests.   I will have an initial alpha level of 0.05.  I will then run a two proportion z-tests on the tests were the p value was less than 0.05.  I will do the z-tests on the direction that matches the data.

For phase 2, I will collect data from 2008 and 2012 in states that have a statistically significant portion of significant tests.  Then I will compare voting behavior with Chi-Square Test of Homogeneity on: 2008 vs 2012,  2008 vs 2016, and 2012 vs 2016.  Then significant results will be tested using a two proportion z-test.

I am going with the Chi-Square test first for two reasons.  The Chi-Square test is not subject to errors in the direction of an effect, and the Chi-Square test is less sensitive than a two proportion z-test.  I have to be very careful in my interpretation of the results since an analysis this large means that there is a big  potential for false positives and false negatives. This analysis will probably take me most of next year. I’ll give an update on my progress in December.

My Comments on the Special Elections in 2017

I thought I would provide my perspective on the special election that occurred last week in Kansas, and the upcoming special house election in Georgia.

For full disclosure, I am a republican who is against some of the President’s policies on immigration, and health care.

I do not think Trump’s performance will have a major affect the voting behavior of people with strong party ties.  Republicans vote Republican most of the time, and Democrats vote Democrat most of the time.  Independents and moderates are more of a wild card.  Independents may not vote the same as they did in 2016.

The districts in question are in no way representative of the whole country. They  Any result from these elections cannot be applied to the whole country or  “predict” the entire midterm election outcome.  You could maybe use the results to for certain districts, but certainly not the entire country.  For statistical analysis to work properly, the samples need to be reasonably representative.

Special elections are all about who turns out.  In the Kansas election, Democrats spent a lot of money and attention on the race since there are only a few races this year.   The money and a lack of an incumbent is probably why the race was closer than the 2016 race.  The 2017 Kansas race had about half the votes compared to the 2016 race,  this big of a change can affect the outcome.  In Georgia, I expect a race  that is closer than usual for that district, but still with a Republican win. I doubt that a Democrat will win a majority of the votes in the primary.

These special elections need to be interpreted in context.  They are two races in House districts that haven’t been competitive in years.  We should not even try to extrapolate to the entire country from these races.  Favorability polls are a much better indicator of political sentiment  However, I think that the favorability polls like the general election polls could be underestimating Trump’s support.  It has been difficult to get Republicans to respond to the polls, and this may affect the accuracy of polls.  After the midterms in 2018,  there will be a clearer picture of support for the Republican party.  Until then we can only guess.

 

 

We Don’t Live in Statsland

Statsland is a magical world that exists only in (certain) Statistics textbooks. In Statsland,  statistics is easy.  We can invoke Central Limit theorem and use the normal distribution when n is larger than 30.   In Statsland we either know or can easily determine the correct distribution.  In Statsland 95% confidence intervals have a 95% chance of containing the real value.  But we don’t live in Statsland.

The point of doing statistics is that it would be too difficult (or impossible) to find the true value of a population.  You aren’t likely to find  the exact value, but you can be pretty close.   In a statistics textbook problem, you probably have enough information to do a good job of estimating the desired value. But in applied statistics you may not have as much information.  If you know the mean and standard deviation of a population you do not need to do much (if any) statistics.  Any time you have to estimate or substitute information, your model will not perform as well as a theoretically perfect model.

Statistics never was and never will be an exact science.   In most cases, your model will be wrong.  There are no perfect answers.  Your confidence intervals will rarely perform as they theoretically should.  The requisite sample size to invoke Central Limit Theorem is not clear cut.  Your approach should vary on the individual problem.   There is no universal formula to examine data.   Applied Statistics should be flexible and instead of rigid.   The world is not a statistics textbook problem, and should never be treated as such.

 

A Non-Technical Overview of My Research

Recently I have been writing up a draft of a research article on my general election model to submit for academic publication.  But that paper is technical and requires you to have some exposure to statistical research to understand.  I wanted to explain my research without going into all the technical details.

Introduction

The President of the United States is elected every four years.  The Electoral College decides the winner,  by the votes of electors chosen by their home state.  Usually the electors are chosen based on the winner of that state and they vote for the winner of that state. Nate Silver correctly predicted the winner of the 2008 election with Bayesian statistics.  Silver got 49 out of 50 states correct.   Silver certainly wasn’t the first person to predict the election, but he received a lot of attention for his model.   Silver’s runs Five Thirty Eight  which talks about statistics and current events.  Bayesian statistics is a branch of statistics that uses information you already know (called a prior) and adjusts the model as more information comes in.  My model like Nate Silver’s used  Bayesian statistics. We do not know the details of the Silver model, besides that it used Bayesian statistics.  To the best of my knowledge, my method is the first publicly available model that used poll data from other states as the prior.  A prediction was made for 2016, where I correctly predicted 6 states.  Then the model was applied to 2008 and 2012, where my prediction of state winners matched the prediction of Five Thirty Eight.

Methodology

I took poll data from Pollster, which provided me csv files for the 2016 and 2012 election.  For 2008 I had to create the csvs by hand.  I had a series of computer programs in Python (a common programming language) to analyze.  My model, used the normal distribution.  My approach divided the 50 states into 5 regional categories: swing states,  southern red states,  midwestern red states, northern blue states,  and western blue states.  The poll data source used as the prior were National,  Texas,  Nebraska,  New York, and California respectively.  This approach is currently believed to be unique, but since multiple models are proprietary it is unknown if this has been used before.  I only used polls if they were added to pollster before the Saturday before election date.   For the 2016 election analysis this meant November 5th.  I posted my predictions on November 5th.

I outline more of my method here.

Results and Discussion

My model worked pretty well compared to other models.  Below is a table of other models and their success rate at predicting the winning candidate in all 50 states plus (and Washington D.C.).

Race Real Clear Politics Princeton Election Consortium Five Thirty Eight (Polls Plus) PredictWise (Fundamental) Sabato’s Crystal Ball My Model
2008 Winner Accuracy 0.96078 0.98039 0.98039  N/A 1 0.98039
2012 Winner Accuracy 0.98039 0.98039 1 0.98039 0.96078 1
2016 Winner Accuracy 0.92157 0.90196 0.90196 0.90196 0.90196 0.88235
Average Accuracy 0.95425 0.95425 0.96078 0.94118 0.95425 0.95425

As you can see all the models do a similar job at picking the winner in each state, which predicts the electoral college.  There are other ways to compare accuracy, but I don’t want to discuss this here since it gets a little technical.   No one was right for every state in every election.  It would probably be impossible to create a model that would consistently predict the winner in all states, because of the variability of political opinions.   Election prediction is not an exact science.  But there is the potential to apply polling analysis to estimate public opinion on certain issues and politicians.  Right now the errors in polls are too large determine public opinion on close issues.   But further research could determine ways to reduce error in polling analysis.