NTTC (Name That Twitter Community!): Process and analyze community-detected data

by Chris Lindgren chris.a.lindgren@gmail.com Distributed under the BSD 3-clause license. See LICENSE.txt or http://opensource.org/licenses/BSD-3-Clause for details.

Overview

A set of functions that process and create topic models from a sample of community-detected Twitter networks' tweets. It also analyzes if there are potential persistent community hubs (either/and by top mentioned or top RTers).

It assumes you seek an answer to the following questions: 1. What communities persist or are ephemeral across periods in the corpora, and when? 2. What can these communities be named, based on their top RTs and users, top mentioned users, as well as generated topic models? 3. Of these communities, what are their topics over time? - TODO: Build corpus of tweets per community groups across periods and then build LDA models for each set.

Accordingly, it assumes you have a desire to investigate communities across periods and the tweets from each detected community across already defined periodic episodes with the goal of naming each community AND examining their respective topics over time in the corpus.

It functions only with Python 3.x and is not backwards-compatible (although one could probably branch off a 2.x port with minimal effort).

Warning: nttc performs no custom error-handling, so make sure your inputs are formatted properly! If you have questions, please let me know via email.

System requirements

arrow
tsm
nltk
networkx
matplot
pandas
numpy
emoji
pprint
gensim
spacy
re
tqdm
sklearn
joblibs
MulticoreTSNE
hdbscan
seaborn
stop_words

Installation

pip install nttc

Objects

nttc initializes and uses the following objects:

`periodObject`

Object with properties that store per Community subgraph properties. Object properties as follows:

.comm_nums: List of retrieved community numbers from the imported nodes data
.subgraphs_dict: Dictionary of period's community nodes and edges data.

`communitiesObject`

Object with properties that generate topic model and also help you name them more easily. Object properties are as follows:

.content_slice: dict of a sample community's content segments
.split_docs: split version of sampled tweets
.id2word: dict version of split_docs
.texts: Listified version of sample
.corpus: List of sample terms with frequency counts
.readme: If desired, printout readable version
.model: Stores the LDA topic model object
.perplexity: Computed perplexity score of topic model
.coherence: Computed coherence score of topic model
.top_rts: Sample of top 10 Rters and RTs for the community
.top_mentions: Sample of top 10 people mentioned
.full_hub: Combined version of top_rts and top_mentions as a DataFrame

`communityGroupsObject`

Object with properties that analyze community likeness scores and then groups alike communities across periods. Object properties as follows:

.best_matches_mentions: Dictionary of per Period with per Period hub top_mentions (users) values as lists
.best_matches_rters: Dictionary of per Period with per Period hub top_rters (users) values as lists
.sorted_filtered_comms: List of tuples, where each tuple has 1) the tested pair of communities between 2 periods, and 2) their JACC score. Example: ('1_0x4_0', 0.4286)

.groups_mentions: A list of sets, where each set are alike mention groups across periods, based on your given JACC threshold:

[{'1_8', '2_18'},
{'3_7', '4_2'},
{'7_11', '8_0'},
{'10_11', '4_14', '5_14', '6_7', '9_11'},
{'1_0', '2_11', '3_5', '4_0', '5_5', '6_12'},
{'10_10', '1_9', '2_3', '3_3', '4_6', '5_2', '6_3', '7_0', '8_2', '9_4'},
{'10_6', '1_2', '2_4', '3_4', '4_13', '5_6', '6_5', '7_4', '8_7', '9_0'},
{'10_0', '1_12', '2_6', '3_0', '4_5', '5_7', '6_6', '7_3', '8_9', '9_5'}]

.groups_rters: A list of sets, where each set are alike RTer groups across periods, based on your given JACC threshold:

[{'1_8', '2_18', '5_14'},
{'10_20', '5_18'},
{'5_2', '6_3', '7_0'},
{'5_1', '7_1'},
{'10_12', '2_3', '3_13', '6_8', '7_5', '8_4', '9_1'}]

General Functions

nttc contains the following general functions:

initializePO: Initializes a periodObject().
initializeCGO: Initializes a communityGroupsObject().
get_csv: Loads CSV data as a pandas DataFrame.
batch_csv: Merge a folder of CSV files into either one allPeriodsObject that stores a dict of all network nodes and edges per Period, or returns only the aforementioned dict, if no object is passed as an arg.
write_csv: Writes DataFrame as a CSV file.

Infomap Data-Processing Functions

nttc contains the following functions to process data into a usable format for the Infomap network analysis system.

For example, it takes an edge list with usernames (username1, username2), and it translates it into the necessary Pajek file format (.net).

`listify_unique_users`

Take edge list (List of lists [source, target]) and create a list of unique users.

`check_protected_dataype_names`

Verify that edge names don't conflict with Python protected datatypes. If they do, append 2 underscores to its end and log it.

`index_unique_users`

Take list of unique users and append IDs

`target_part_lookup`

Lookup target in unique list and return to netify_edges()

`write_net_dict`

Writes s Dict of vertices (nodes) and arcs (edges) in preperation for formatting it into the Pajek file format (.net). It returns a dictionary akin to the following:

p_dict = {
        'vertices': verts, # A List of vertices (nodes) with an ID [1, user1]
        'arcs': arcs # A list of arcs (edges) [source, target]
    }

`vert_lookup`

Helper function for write_net_dict. It finds the matching username and returns the period_based ID.

`netify_edges`

Accepts list of lists (edges) and replaces the usernames with their unique IDs. This prepares output for the infomap code system.

`write_net_txt`

Outputs .txt file with edges in a .net format for the Infomap system:

source target [optional weight]
  1 2
  2 4
  2 8
  5 4
  ...

It also contains functions that enable you to isolate and output a CSV file with the hubs from each period. It does so with custom parsers for the infomap .map and .ftree file formats:

`read_map_or_ftree`

Helper function for infomap_hub_maker. Slices period's .map or .ftree into their line-by-line indices and returns a dict of those values for use.

`indices_getter`

Helper function for batch_map. Parses each line in the file and returns a list of lists, where each sublists is a line in the file.

`batch_map`

Retrieves all map files in a single directory. It assumes that you have only the desired files in said directory. Returns a dict of each files based on their naming scheme with custom regex pattern. Each key denotes the file and its values are list of lists, where each sublist is a lines in the file.

regex= Regular expression for filename scheme
path= String. Path for directory with .map or .ftree files

`networks_controller`

Uses Dict data structure hydrated from the following functions:

.batch_map()
.ftree_edge_maker(), and
.infomap_hub_maker().

It appends node names to edge data and also creates a node list for each module.

Args:
- p_sample: Tuple of Integers. Desired period range to sample.
- m_sample: Tuple of Integers. desired module range to sample.
- Both assume a continuous range: 1-10, 3-6, etc.
- Dict. Output from batch_map(), ftree_edge_maker(), and infomap_hub_maker(), which includes.
  - DataFrame. Module edge data.
  - DataFrame. Module node data with names.
Return:
- dict_network: Appends more accessible edge and node data.

`network_organizer`

Organizes infomap .ftree network edge and node data into Dict.

Args:
- m_edges: DataFrame. Per period module edge data
- m_mod: List of Dicts. Per period list of module data

Return:

return_dict: Dict. Network node and edge data with names:

    { 
        return_dict: {
            'nodes': DataFrame,
            'edges': Dataframe
        }
    }

`content_sampler`

Sample content in each period per module, based on map equation flow-based community detection.

Args:
- network: Dict. Each community across periods edge and node data.
- corpus: DataFrame.
- period_dates: Dict of lists.
- sample_size: Integer.
- random: Boolean. True pulls randomized sample. False pulls top x tweets.
Return:
- Dict of DataFrames. Sample of content in each module per period

`sample_getter`

Samples corpus based on module edge data from infomap data.

Args: - sample_size: Integer. Number of edges to sample. To keep all results, use -1 (Int) value. - edges: List of Dicts. Edge data. - period_corpus: DataFrame. Content corpus to be sampled. - sample_type: String. Current options include: - 'modules': Samples tweets based on community module source-target relations. - 'ht_groups': Samples tweets based on use of hashtags. Must provide list of strings. - user_threshold: - random: Boolean. True will randomly sample fully retrieved set of tweet content - ht_list: List of strings. If sampling via hashtag groups, then provide a list of the hashtags. Default is None.

Return: - DataFrame. Sampled content, based on infomap module edges.

`infomap_edges_sampler`

Sample edges in each period per module, based on map equation flow-based community detection.

Args:
- network: Dict. Each module edges data across periods edge and node data.
- sample_size: Integer.
- column_name: String. Name of desired column to sample.
- random: Boolean. True pulls randomized sample. False pulls top x tweets.
Return:
- Dict of DataFrames. Sample of content in each module per period

`ranker`

Appends rank and percentages at different aggregate levels.

Args:
- rank_type= String. Argument option for type of ranking to conduct. Currently only per_hub.
- tdhn= Dict of corpus. Traverses the 'info_hub'
Return
- tdhn= Updated 'info_hub' with 'percentage_total' per hub and 'spot' for each node per hub,
TODO: Add per_party and per_hubname

`append_rank`

Helper function for ranker(). It appends the rank number for the 'spot' value.

`append_percentages`

Helper function for ranker(). Appends each node's total_percentage to the list

Args:
- rl= List of lists. Ranked list of nodes per hub

`score_summer`

Tally scores from each module per period and append a score_total to each node instance per module for every period.

Args:
- dhn= Dict of hubs returned from info_hub_maker

`get_period_flow_total`

Helper function for score_summer. Tallies scores per Period across hubs.

Args:
- lpt= List. Contains hub totals per Period.
Return
- Float. Total flow score for a Period.

`get_score_total`

Helper function for score_summer. Tallies scores per Hub.

Args:
- list_nodes= List of Dicts
Return
- total= Float. Total flow score for a Hub.

`infomap_hub_maker`

Takes fully hydrated Dict of the map or ftree files and parses its Nodes into per Period and Module Dicts.

Args:
- file_type= String. 'map' or 'ftree' file type designation
- dict_map= Dict of map files
- mod_sample_size= Integer. Number of modules to sample
- hub_sample_size= Integer. number of nodes to sample for "hub" of each module
Output:
- dict_map= Dict with new info_hub key hydrated with hubs

`output_infomap_hub`

Takes fully hydrated infomap dict and outputs it as a CSV file.

Args:
- header= column names for DataFrame and CSV;
  - Assumes they're in order with period and hub in first and second position
- dict_hub= Hydrated Dict of hubs
- filtered_hub_length= Int. Desired length of hub
- path= Output path
- file= Output file name

`sampling_module_hubs`

Compares hub set with tweet data to ultimately output sampled tweets with hub information.

Args:
period_dates: Dict of lists that include dates for each period of the corpus
period_check: String for option: Check against 'single' or 'multiple'
period_num: Integer. If period_check == 'single', provide integer of period number.
df_all_tweets: Pandas DataFrame of tweets
df_hubs: Pandas DataFrame of infomapped hubs
top_rts_sample: Integer of desired sample size of sorted top tweets (descending order)
hub_sample: Integer of desired sample size to output
hub_cols: List of column names (String) from hub file desired to preserve and append to sample.
Returns DataFrame of top sampled tweets

`add_infomap`:

Helper function for sampling_module_hubs. It cross-references the sampled.

Args:
dft: DataFrame of sampled tweet data
dfh: Full DataFrame of hubs data
period_num: Integer of particular period number
Returns List of Dicts with hub and info_name mentions info

`batch_output_period_hub_samples`

Periodic batch output that saves sampled tweets as a CSV. Assumes successively numbered periods.

Args:
module_output: DataFrame of tweet sample data per Period per Module
period_total: Interger of total number of periods
file_ext: String of desired filename extension pattern
period_path: String of desired path to save the files
Returns nothing

periodObject Functions

get_comm_nums: Filters unique community column values into List
comm_sender and write_community_list: These 2 functions create a dict of nodes and edges to be saved as a property, .subgraphs_dict, of a periodObject. It does so by:
Creates a List of nodes per Community
Creates a List of edges per Community
Appends dict of these lists to comprehensive dict for the period.
Appends this period dict to the period)bject property: .subgraphs_dict
Returns the object.
add_comm_nodes_edges: Function to more quickly generate new networkX graph of specific comms in a period.
add_all_nodes_edges: Function to more quickly generate new networkX graph of all comms in a period.
draw_subgraphs: Draws subgraphs with networkX module, but can do so with multiple communities across periods.

communitiesObject Functions

`create_hub_csv_files`

Writes all of the objects' top rt'd/mentions information as a CSV of "hubs".

`get_comm_nums`

Filters Dataframe column community values into a List.

`get_all_comms`

Slice the full set to community and their respective tweets.

Args:
- dft: Dataframe
- col_community: String. Column name for community
- col_tweets: String. Column name for tweet content

`comm_dict_writer`

Write per Community content segments into a dictionary.

Args:
- comm_list= List of community numbers / labels
- df_content= DataFrame of data set in question
- comm_col= String of column name for community/module
- content_col= Sring of column name for content to parse and examine
- sample_size_percentage= Desired percentage to sample from full set
Returns Dict of sliced DataFrames (value) as per their community/module (key)

`split_community_tweets`

Isolates community's content, then splits string into list of strings per Tweet preparing them for the topic modeling.

Args:
col_name: String. Community label as String,
dict_comm_obj: Dict of community objects
sample_size_percentage: Float. Between 0 and 1.
Returns as Dataframe of content for respective community

clean_split_docs

Removes punctuation, makes lowercase, removes stopwords, and converts into dataframe for topic modeling.

`tm_maker`

Creates data for TM and builds an LDA TM.

Args: Pass many of the gensim LDATopicModel() object arguments here, plus some helpers. See their documentation for more details (https://radimrehurek.com/gensim/models/ldamodel.html).
random_seed: Integer. Value for randomized seed.
single: Boolean. True assumes only one period of data being evaluated.
split_comms:
- If 'single' False, Dict of objects with respective TM data.
- If 'single' True, object with TM data
num_topics: Integer. Number of topics to produce (k value)
random_state: Integer. Introduce random runs.
update_every: Integer. "Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning."
chunksize: Integer. "Number of documents to be used in each training chunk."
passes: Integer. "Number of passes through the corpus during training."
alpha: String. Pass options available via gensim package
per_word_topics: Boolean.
Returns: Either updated Dict of objects, or single Dict. Now ready for visualization or printing.

`get_hubs_top_rts`

Appends hubs' top 10 RT'd tweets and usernames to respective period and community object.

Args:
Dataframe of hub top mentions,
Dict of Objects with .top_rts,
String of period number
Returns: Dict Object with new .top_rts per Object

`get_hubs_mentions`

Appends hubs' mentions data to respective period and community object. - Args: - Dataframe of hub mentions, - Dict of Objects, - String of column name for period, - String of period number, - String of column name for the community number - Returns: Dict Object with new .top_mentions per Object

`merge_rts_mentions`

Merges hubs' sources and mentions data as a full list per Community.

communityGroupsObject Functions

`matching_dict_processor`

Processes input dataframe of network community hubs for use in the tsm.match_communities() function.

Args: A dataframe with Period, Period_Community (1_0), and top mentioned (highest in-degree) users

Returns: Dictionary of per Period with per Period_Comm hub values as lists:

        {'1': {'1_0': ['nancypelosi',
           'chuckschumer',
           'senfeinstein',
           'kamalaharris',
           'barackobama',
           'senwarren',
           'hillaryclinton',
           'senkamalaharris',
           'repadamschiff',
           'corybooker'],
           ...
           },
           ...
           '10': {'10_3': [...] }
        }

`match_maker`

Takes period dict from matching_dict_processor() and submits to tsm.match_communities() method. Assigns, filters, and sorts the returned values into a list or tuples with findings.

Args:
Dictionary of per Period with per Period_Comm hub values as lists;
filter_jacc threshold value (float) between 0 and 1.

Returns: List of tuples: period_communityxperiod_community, JACC score

        [('1_0x4_0', 0.4286),
        ('1_0x2_11', 0.4615),
        ('1_0x3_5', 0.4615),
        ... ]

`plot_bar_from_counter`

Plot the community comparisons as a bar chart.

Args:
ax=None # Resets the chart
counter = List of tuples returned from match_maker(),
path = String of desired path to directory,
output = String value of desired file name (.png)
Returns: Nothing.

`community_grouper()`

Controller function for process to group together communities found to be similar across periods in the corpus. It uses the 1) group_reader() and 2) final_grouper() functions to complete this categorization process.

Args: Accepts the network object (net_obj) with the returned value from nttc.match_maker(), which should be saved as .sorted_filtered_comms property: a list of tuples with sorted and filtered community pairs and their score, but it only uses the community values.

Returns: A list of sets, where each set is a grouped recurrent community: For example, 1_0, where 1 is the period, and 0 is the designated community number.

[{'1_8', '2_18'},
 {'3_7', '4_2'},
 {'7_11', '8_0'},
 {'10_11', '4_14', '5_14', '6_7', '9_11'},
 {'1_0', '2_11', '3_5', '4_0', '5_5', '6_12'},
 {'10_10', '1_9', '2_3', '3_3', '4_6', '5_2', '6_3', '7_0', '8_2', '9_4'},
 {'10_6', '1_2', '2_4', '3_4', '4_13', '5_6', '6_5', '7_4', '8_7', '9_0'},
 {'10_0', '1_12', '2_6', '3_0', '4_5', '5_7', '6_6', '7_3', '8_9', '9_5'}]

NOTE: This algorithm isn't perfect. It needs some refinement, since it may output some overlaps. However, it certainly filters down the potential persistent communities with either top_mentions or top_rters across periods, so it saves you some manual comparative analysis labor.

`group_reader()`

Takes the period_community pairs and appends to dict if intersections occur. However, the returned dict requires furter analysis and processing, due to unknown order and content from the sorted and filtered communities, which is why they are then sent to the final_grouper by community_grouper, after completion here.

Args:
Accepts the initial group dict, which is cross-referenced by the pair of period_community values extracted via a regex expression.
Returns: A dict of oversaturated comparisons, which are sent to final_grouper() for final analysis, reduction, and completion.

`final_grouper()`

Takes the period_community dictionaries and tests for their intersections. Then, it takes any intersections and joins them with .union and appends them into a localized running list, which will all be accrued in a running master list of that community. From there, each community result will be sorted by their length in descending order.

Args: Accepts the group dict from group_reader().
Returns: A dict of all unique period_community elements (2 or more) found to be similar.

Distribution update terminal commands

# Create new distribution of code for archiving
sudo python3 setup.py sdist bdist_wheel

# Distribute to Python Package Index
python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/*

NTTC (Name That Twitter Community!): Process and analyze community-detected data

Overview

System requirements

Installation

Objects

periodObject

communitiesObject

communityGroupsObject

General Functions

Infomap Data-Processing Functions

listify_unique_users

check_protected_dataype_names

index_unique_users

target_part_lookup

write_net_dict

vert_lookup

netify_edges

write_net_txt

read_map_or_ftree

indices_getter

batch_map

networks_controller

network_organizer

content_sampler

sample_getter

infomap_edges_sampler

ranker

append_rank

append_percentages

score_summer

get_period_flow_total

get_score_total

infomap_hub_maker

output_infomap_hub

sampling_module_hubs

add_infomap:

batch_output_period_hub_samples