`election_text_analysis.read_data`

Tools to read data from ANES election survey results.

There are two types of files: 1) a single timeseries file that contains demographics etc over time 2) a file for each survey year that contains open ends (the format differs by year)

Although the format of the open-ends file differs by year, it always contains the open-ends in some structured format, as well as a “Case ID” key that indicates which participant the row maps to.

For the timeseries data, a utility function is provided to read the data as a Pandas DataFrame.

For each year of open-ended data, a utility function is provided to read the specific format from that year and map column labels onto standard labels for each open-ended question. This function also converts the “Case ID” (or similar) column into a key that can be mapped to the timeseries data.

Finally, a utility function is provided to add the open-ended data from specified years/files to the overall timeseries file.

Module Contents

Functions

`read_all_data`([data_dir, timeseries_filename, ...])	Read and combine all necessary from a given directory. This reads the timeseries data file,
`read_timeseries_data`([data_dir, filename])	Reads the timeseries data file from the given data_dir. This loads the CSV as a
`read_2020_open_ends`([data_dir, filename])	Reads the 2020 open-ended data file. In this file, each open-ended response
`read_2016_open_ends`([data_dir, filename, ...])	Reads the 2016 open-ended data file. In this file, each open-ended response
`read_2012_open_ends`([data_dir, filename])	Reads the 2012 open-ended data file. In this file, each open-ended response
`read_2008_open_ends`([data_dir, filename])	Reads the 2008 open-ended data file. In this file, each open-ended response

election_text_analysis.read_data.read_all_data(data_dir='downloaded_data', timeseries_filename='anes_timeseries_cdf_csv_20220916.csv', open_ends_2020_filename='2020.xlsx', open_ends_2016_filename='2016.xlsx', overall_2016_filename='anes_timeseries_2016_rawdata.txt', open_ends_2012_filename='2012.xlsx', open_ends_2008_filename='2008.xls')

Read and combine all necessary from a given directory. This reads the timeseries data file, reads the open-ended data file for each year, and combines them before returning a DataFrame.

There are a large number of filename input parameters to this function - the defaults have all been set to the download names from download_data.py, so they do not need to be overriden unless using different filenames.

Parameters:

output_dir (str (optional, default="downloaded_data")) – An optional output directory to write the downloaded files to (defaults to downloaded_data)
timeseries_filename (str (optional, default="anes_timeseries_cdf_csv_20220916.csv")) – An optional filename to read timeseries data from
open_ends_2020_filename (str (optional, default="2020.xlsx")) – An optional filename to read 2020 open-ended data from
open_ends_2016_filename (str (optional, default="2016.xlsx")) – An optional filename to read 2016 open-ended data from
overall_2016_filename (str (optional, default="anes_timeseries_2016_rawdata.txt")) – An optional filename to read 2016 overall data, in order to create a mapping to the keys used in the open-ended data
open_ends_2012_filename (str (optional, default="2012.xlsx")) – An optional filename to read 2012 open-ended data from
open_ends_2008_filename (str (optional, default="2008.xls")) – An optional filename to read 2008 open-ended data from

Returns:

A combined DataFrame of all timeseries data, with open-ended data appended in columns (wherever present). This DataFrame is indexed by a unique identifier that combines year and case ID (a participant identifier). This DataFrame has each column from the original timeseries dataset, as well as the following open-ended columns: ‘Like About Democratic Candidate’, ‘Dislike About Democratic Candidate’, ‘Like About Republican Candidate’, ‘Dislike About Republican Candidate’, ‘Like About Democratic Party’, ‘Dislike About Democratic Party’, ‘Like About Republican Party’, ‘Dislike About Republican Party’

Return type:

pandas.DataFrame

Examples

>>> ts_df = read_all_data()

election_text_analysis.read_data.read_timeseries_data(data_dir='downloaded_data', filename='anes_timeseries_cdf_csv_20220916.csv')

Reads the timeseries data file from the given data_dir. This loads the CSV as a Pandas DataFrame, and creates an index from year and case ID (participant ID).

Parameters:

data_dir (str (optional, default="downloaded_data")) – An optional output directory to write the downloaded files to (defaults to downloaded_data)
filename (str (optional, default="anes_timeseries_cdf_csv_20220916.csv")) – The CSV file to read data from

Returns:

A DataFrame containing the timeseries data, with an index created by combining year and Case ID.

Return type:

pandas.DataFrame

Examples

>>> ts = read_timeseries_data()

election_text_analysis.read_data.read_2020_open_ends(data_dir='downloaded_data', filename='2020.xlsx')

Reads the 2020 open-ended data file. In this file, each open-ended response is stored in a different tab. This function calls the read_open_ends_by_tab function with the correct set of tabs for 2020 data.

Parameters:

data_dir (str (optional, default="downloaded_data")) – An optional output directory to write the downloaded files to (defaults to downloaded_data)
filename (str (optional, default="2020.xlsx")) – The Excel file to read data from

Returns:

A DataFrame containing the open-ended data, with an index created by combining year and Case ID.

Return type:

pandas.DataFrame

Examples

>>> df = read_2020_open_ends()

election_text_analysis.read_data.read_2016_open_ends(data_dir='downloaded_data', filename='2016.xlsx', overall_2016_filename='anes_timeseries_2016_rawdata.txt')

Reads the 2016 open-ended data file. In this file, each open-ended response is stored in a different tab. This function calls the read_open_ends_by_tab function with the correct set of tabs for 2016 data. Because the 2016 open-ended data is keyed by a different key (V160001_orig, the original ID) rather than the usual year + case ID key, we also remap the index from V160001_orig to V160001. In order to do this, we read V160001_orig and V160001 from the original (non-open-ended) 2016 data file.

Parameters:

data_dir (str (optional, default="downloaded_data")) – An optional output directory to write the downloaded files to (defaults to downloaded_data)
filename (str (optional, default="2016.xlsx")) – The Excel file to read data from

Returns:

A DataFrame containing the open-ended data, with an index created by combining year and Case ID.

Return type:

pandas.DataFrame

Examples

>>> df = read_2016_open_ends()

election_text_analysis.read_data.read_2012_open_ends(data_dir='downloaded_data', filename='2012.xlsx')

Reads the 2012 open-ended data file. In this file, each open-ended response is stored in a different column in a single tab.

Parameters:

data_dir (str (optional, default="downloaded_data")) – An optional output directory to write the downloaded files to (defaults to downloaded_data)
filename (str (optional, default="2012.xlsx")) – The Excel file to read data from

Returns:

A DataFrame containing the open-ended data, with an index created by combining year and Case ID.

Return type:

pandas.DataFrame

Examples

>>> df = read_2012_open_ends()

election_text_analysis.read_data.read_2008_open_ends(data_dir='downloaded_data', filename='2008.xls')

Reads the 2008 open-ended data file. In this file, each open-ended response is stored in a different column in a single tab.

Parameters:

data_dir (str (optional, default="downloaded_data")) – An optional output directory to write the downloaded files to (defaults to downloaded_data)
filename (str (optional, default="2008.xlsx")) – The Excel file to read data from

Returns:

A DataFrame containing the open-ended data, with an index created by combining year and Case ID.

Return type:

pandas.DataFrame

Examples

>>> df = read_2008_open_ends()

election_text_analysis.read_data

Module Contents

Functions

`election_text_analysis.read_data`