election_text_analysis.analyze

Module Contents

Functions

get_unique_words(text[, stopwords])

Given a piece of text, returns all unique words in the text. This also

get_word_frequencies(series)

Given a Pandas Series, calculates the total word frequency across all items in the series.

compare_word_frequencies(group_1_series, group_2_series)

Given two Pandas Series objects, calculate the frequency of each word in each

summarize_word_frequency_differences(group_1_series, ...)

Given two Pandas Series objects, summarize the difference in frequency between

Attributes

DEFAULT_STOPWORDS

election_text_analysis.analyze.DEFAULT_STOPWORDS = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',...
election_text_analysis.analyze.get_unique_words(text, stopwords=DEFAULT_STOPWORDS)

Given a piece of text, returns all unique words in the text. This also converts the text to lowercase, removes punctuation, joins negations (by replacing “not word” with “not_word”), removes stopwords, and removes words of 2 or fewer characters.

Parameters:
  • text (str) – The text string from which to calculate all unique words

  • stopwords (list (optional)) – A list of stopwords (or “filler words”) to ignore from our final list of unique words. Defaults to [‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’, ‘he’, ‘him’, ‘his’, ‘himself’, ‘she’, ‘her’, ‘hers’, ‘herself’, ‘it’, ‘its’, ‘itself’, ‘they’, ‘them’, ‘their’, ‘theirs’, ‘themselves’, ‘what’, ‘which’, ‘who’, ‘whom’, ‘this’, ‘that’, ‘these’, ‘those’, ‘am’, ‘is’, ‘are’, ‘was’, ‘were’, ‘be’, ‘been’, ‘being’, ‘have’, ‘has’, ‘had’, ‘having’, ‘do’, ‘does’, ‘did’, ‘doing’, ‘a’, ‘an’, ‘the’, ‘and’, ‘but’, ‘if’, ‘or’, ‘because’, ‘as’, ‘until’, ‘while’, ‘of’, ‘at’, ‘by’, ‘for’, ‘with’, ‘about’, ‘against’, ‘between’, ‘into’, ‘through’, ‘during’, ‘before’, ‘after’, ‘above’, ‘below’, ‘to’, ‘from’, ‘up’, ‘down’, ‘in’, ‘out’, ‘on’, ‘off’, ‘over’, ‘under’, ‘again’, ‘further’, ‘then’, ‘once’, ‘here’, ‘there’, ‘when’, ‘where’, ‘why’, ‘how’, ‘all’, ‘any’, ‘both’, ‘each’, ‘few’, ‘more’, ‘most’, ‘other’, ‘some’, ‘such’, ‘no’, ‘nor’, ‘not’, ‘only’, ‘own’, ‘same’, ‘so’, ‘than’, ‘too’, ‘very’, ‘s’, ‘t’, ‘can’, ‘will’, ‘just’, ‘don’, ‘should’, ‘now’, ‘don’, ‘must’, ‘dont’]

Returns:

A list of strings, each of which is a unique word in our original input text

Return type:

list(str)

Examples

>>> get_unique_words("This is a sentence, it is a long sentence...", stopwords=[])
['long', 'sentence', 'this']
>>> get_unique_words("This is a sentence, it is a long sentence...")
['long', 'sentence']
election_text_analysis.analyze.get_word_frequencies(series)

Given a Pandas Series, calculates the total word frequency across all items in the series. This calculates the unique words across each item in the series, and then finds the percentage of items in the series which contain each unique word.

Parameters:

series (Pandas.Series) – The series from which we will calculate all word frequency (assumed to contain open-ended comments)

Returns:

A dict of word: frequency pairs.

Return type:

dict

Examples

>>> import pandas as pd
>>> sentences = pd.Series(["This is a sentence, it is a long sentence...", "This is another sentence", "A third sentence"])
>>> analyze.get_word_frequencies(sentences)
{'long': 0.3333333333333333,
 'sentence': 1.0,
 'another': 0.3333333333333333,
 'third': 0.3333333333333333}
election_text_analysis.analyze.compare_word_frequencies(group_1_series, group_2_series)

Given two Pandas Series objects, calculate the frequency of each word in each series, as well as the difference in frequency between the two series.

Parameters:
  • group_1_series (Pandas.Series) – The first series from which we will calculate word frequency

  • group_2_series (Pandas.Series) – The second series from which we will calculate word frequency

Returns:

A list of dicts of frequency information, where each dict contains: word, group_1 (frequency in group_1_series), group_2 (frequency in group_2_series), and delta (group1 - group2)

Return type:

list

Examples

>>> group_1_series = pd.Series(["This is a sentence, it is a long sentence...", "This is another sentence", "A third sentence"])
>>> group_2_series = pd.Series(["These are sentences", "This is also a sentence", "All of these are sentences happily"])
>>> analyze.compare_word_frequencies(group_1_series, group_2_series)
[{'word': 'sentences',
  'group_1': 0.0,
  'group_2': 0.6666666666666666,
  'delta': -0.6666666666666666},
 {'word': 'happily',
  'group_1': 0.0,
  'group_2': 0.3333333333333333,
  'delta': -0.3333333333333333},
 {'word': 'also',
  'group_1': 0.0,
  'group_2': 0.3333333333333333,
  'delta': -0.3333333333333333},
 {'word': 'third',
  'group_1': 0.3333333333333333,
  'group_2': 0.0,
  'delta': 0.3333333333333333},
 {'word': 'long',
  'group_1': 0.3333333333333333,
  'group_2': 0.0,
  'delta': 0.3333333333333333},
 {'word': 'another',
  'group_1': 0.3333333333333333,
  'group_2': 0.0,
  'delta': 0.3333333333333333},
 {'word': 'sentence',
  'group_1': 1.0,
  'group_2': 0.3333333333333333,
  'delta': 0.6666666666666667}]
election_text_analysis.analyze.summarize_word_frequency_differences(group_1_series, group_2_series, group_1_label='Group 1', group_2_label='Group 2', num_words=10)

Given two Pandas Series objects, summarize the difference in frequency between the two series. This first calculates the frequency of each word in each series, and then prints out a summary of the words that are more frequent in each of group_1_series and group_2_series, respectively.

Parameters:
  • group_1_series (Pandas.Series) – The first series from which we will calculate word frequency

  • group_2_series (Pandas.Series) – The second series from which we will calculate word frequency

  • group_1_label (str (optional)) – A label to use when printing the output summary for group_1_series

  • group_2_label (str (optional)) – A label to use when printing the output summary for group_2_series

  • num_words (int (optional)) – The number of words to print out in each summary (default to 10)

Returns:

A list of dicts of frequency information, where each dict contains: word, group_1 (frequency in group_1_series), group_2 (frequency in group_2_series), and delta (group1 - group2)

Return type:

list

Examples

>>> group_1_series = pd.Series(["This is a sentence, it is a long sentence...", "This is another sentence", "A third sentence"])
>>> group_2_series = pd.Series(["These are sentences", "This is also a sentence", "All of these are sentences happily"])
>>> analyze.summarize_word_frequency_differences(group_1_series, group_2_series)
These words occurred more often in Group 1:
       word  Group 1 freq  Group 2 freq
0  sentence    100.000000     33.333333
1     third     33.333333      0.000000
2      long     33.333333      0.000000
3   another     33.333333      0.000000
These words occurred more often in Group 2:
        word  Group 1 freq  Group 2 freq
0  sentences           0.0     66.666667
1    happily           0.0     33.333333
2       also           0.0     33.333333