election_text_analysis

Note to Dr. Brambor and the teaching assistants: thank you for everything in this class! Based on Dr. Brambor’s feedback, I reduced the scope of my project based on the proposal. I removed the lemmatization step as well as the n-grams steps. I also just focused on data from 2008-2020. I also focused more on the data steps we learned in this class rather than the analysis. I was able to use the timeseries file directly from ANES that had almost all questions normalized, but didn’t contain the open-ended responses. For this project, I joined the open-ended responses from 2008, 2012, 2016, and 2020 with the timeseries data. I then built some functions to calculate the frequency of words across a set of responses.

Package on PyPI

https://pypi.org/project/election_text_analysis/

Docs

https://election-text-analysis.readthedocs.io/en/latest/

Overview

Functions to load and analyzes open-ended data from the ANES election perception surveys conducted every 4 years.

ANES (American National Election Studies) conducts a large-scale survey every four years, coinciding with US Presidential elections. The survey focuses on voter preferences and election-related behavior, as well as questions on public opinion and attitudes. These studies are conducted as pre-election and post-election interviews.

Some of the most interesting questions asked in the ANES survey are open-ended text responses. These include questions asking voters what they like (and dislike) about each party’s candidate. For some years of data, these questions were asked for other positions as well (ie House and Senate candidates).

These open-ended responses have the potential for fascinating analysis on how voter preferences and reasoning (in their own words) have changed over time. This module aims to make it easier to analyze those open-ended responses.

Installation

$ pip install election_text_analysis

Usage

>>> from election_text_analysis import download_data, read_data, analyze

>>> # Downloads all data necessary to analyze open-ends from 2008-2020
>>> download_data.download_all()

# Loads a dataframe of data over time, with open-ends from 2008-2020
>>> df = read_data.read_all_data()

Dataset

The full codebook for all columns can be found at https://electionstudies.org/anes_timeseries_cdf_codebook_var_20220916/

The “Year” variable is stored in column ‘VCF0004’. For example, here is a count of rows of data for every year since 1984 (the first year we have open-ended data)

>>> greater_than_1984 = df[df['VCF0004'] >= 1984]
>>> greater_than_1984['VCF0004'].value_counts().sort_index()
VCF0004
  2257
  2176
  2040
  1980
  2485
  1795
  1714
  1281
  1807
  1511
  1212
  2322
  5914
  4270
  8280
Name: count, dtype: int64

There are 8 open-ended columns:

open_ended_columns = ['Like About Democratic Candidate', 'Dislike About Democratic Candidate', 'Like About Republican Candidate', 'Dislike About Republican Candidate', 'Like About Democratic Party', 'Dislike About Democratic Party', 'Like About Republican Party', 'Dislike About Republican Party']

What people dislike about the Democratic Candidate vs the Republican Candidate in 2020:

dem_dislike_2020   = df[df['VCF0004'] == 2020]['Dislike About Democratic Candidate']
repub_dislike_2020 = df[df['VCF0004'] == 2020]['Dislike About Republican Candidate']
analyze.summarize_word_frequency_differences(dem_dislike_2020, repub_dislike_2020, group_1_label="Dem 2020 dislikes", group_2_label="Repub 2020 dislikes")

These words occurred more often in Dem 2020 dislikes:
        word  Dem 2020 dislikes freq  Repub 2020 dislikes freq
      age                7.729841                  0.219912
    biden                6.240370                  0.619752
      old                5.906523                  0.479808
    years                7.498716                  2.219112
     left                5.392912                  0.239904
 abortion                5.110426                  0.359856
    party                5.264510                  0.799680
   mental                4.571135                  0.419832
    taxes                4.725218                  0.839664
socialist                3.775039                  0.019992


These words occurred more often in Repub 2020 dislikes:
         word  Dem 2020 dislikes freq  Repub 2020 dislikes freq
    racist                1.129944                 11.735306
    people                5.136107                 13.114754
   country                7.524397                 13.954418
      lies                0.873138                  7.057177
      liar                1.206985                  7.277089
      lack                1.438110                  7.297081
everything                2.670776                  8.256697
     covid                0.667694                  6.117553
  pandemic                0.308166                  5.317873
      self                0.256805                  3.118752

What people dislike about the Democratic Candidate in 2020 vs 2016

dem_dislike_2016   = df[df['VCF0004'] == 2016]['Dislike About Democratic Candidate']
analyze.summarize_word_frequency_differences(dem_dislike_2020, dem_dislike_2016, group_1_label="2020 Dem dislikes", group_2_label="2016 Dem dislikes")

These words occurred more often in 2020 Dem dislikes:
        word  2020 Dem dislikes freq  2016 Dem dislikes freq
      age                7.729841                0.269750
    years                7.498716                1.078998
    biden                6.240370                0.000000
  country                7.524397                1.425819
president                7.832563                1.888247
      old                5.906523                0.192678
     left                5.392912                0.346821
    party                5.264510                0.385356
   mental                4.571135                0.000000
    taxes                4.725218                0.385356


These words occurred more often in 2016 Dem dislikes:
            word  2020 Dem dislikes freq  2016 Dem dislikes freq
         liar                1.206985               10.481696
         lies                0.873138                6.589595
       emails                0.051361                5.086705
        trust                1.386749                6.319846
        email                0.000000                4.393064
    dishonest                0.487930                4.662813
   dishonesty                0.128403                3.159923
untrustworthy                0.205444                2.967245
      scandal                0.051361                2.543353
      clinton                0.359527                2.658960

What people dislike about the Republican Candidate in 2020 vs 2016

repub_dislike_2016 = df[df['VCF0004'] == 2016]['Dislike About Republican Candidate']
analyze.summarize_word_frequency_differences(repub_dislike_2020, repub_dislike_2016, group_1_label="2020 Repub dislikes", group_2_label="2016 Repub dislikes")

These words occurred more often in 2020 Repub dislikes:
        word  2020 Repub dislikes freq  2016 Repub dislikes freq
  country                 13.954418                  2.522460
   people                 13.114754                  5.563234
    covid                  6.117553                  0.000000
president                  9.496202                  3.420871
    trump                  7.117153                  1.105736
     lies                  7.057177                  1.243953
 pandemic                  5.317873                  0.000000
     liar                  7.277089                  2.073255
 american                  4.258297                  0.621977
   office                  4.258297                  0.691085


These words occurred more often in 2016 Repub dislikes:
           word  2020 Repub dislikes freq  2016 Repub dislikes freq
  experience                  0.599760                  5.977885
       mouth                  1.939224                  3.489979
       views                  1.379448                  2.660677
       think                  4.938025                  6.081548
   political                  1.559376                  2.522460
  temperment                  0.039984                  0.967519
inexperience                  0.059976                  0.932965
       women                  3.618553                  4.457498
       bigot                  0.899640                  1.727713
        know                  1.459416                  2.246026

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

election_text_analysis was created by Nikhila Anand. It is licensed under the terms of the MIT license.

Credits

election_text_analysis was created with cookiecutter and the py-pkgs-cookiecutter template.