This repository provides the implementation accompanying the paper Clustering Digital Ego Networks by Tie Strength: A Scalable, Platform-Independent Method, accepted for presentation at the ISKE 2025 conference. It primarily contains the code base required to ensure reproducibility of the analyses, including scripts for preprocessing, clustering, and visualization of ego-network tie strength distributions.
For reasons of privacy protection, the underlying datasets are not made publicly available in this repository. However, researchers interested in accessing the data for academic purposes may request it by contacting the email address provided below. Further information about the conference can be found at https://iske2025.com/
The datasets used in this study contain sensitive information and cannot be openly published in order to preserve user privacy. Therefore, only the code is made publicly available here to ensure transparency and reproducibility of the analyses.
Researchers who wish to access the data for academic and non-commercial purposes may request it by contacting the corresponding author at the email address provided below. Each request will be evaluated individually to ensure compliance with privacy and ethical guidelines.
Corresponding author: Masoud Fatemi
Email: firstname[dot]lastname[at]uef[dot]fi
Fatemi, M., Laitinen, M., & Fränti, P. (2025). Clustering Digital Ego Networks by Tie Strength A Scalable, Platform-Independent Method. The 20th International Conference on Intelligent Systems and Knowledge Engineering (ISKE20), Shunde, China.
This script generates figure 3 in the paper. It visualizes yearly trends of Twitter data across different regions (Nordic countries, UK, US, and Australia).
- It plots the total number of collected accounts and tweets per year using bar charts.
- Input: Pre-processed JSON files containing yearly frequency counts (stored under json/).
- Output: Bar plots showing comparative growth of accounts and tweets across regions.
- Usage: Toggle the users and tweets flags in the script to choose which statistics to plot.
This script generates figure 4 in the paper. it demonstrates data cleaning, outlier filtering, and visualization of Twitter interaction measures across regions (UK, US, AU, Nordic).
- It compares raw vs. cleaned+filtered data distributions using KDE plots.
- Input: A CSV file located in data/df.csv containing:
- interactions (IS)
- interactions_p (RIS)
- social_similarities (SS)
- outliers (OUT)
- country
- Output: KDE density plots with mean (μ) and standard deviation (σ) annotations for each region.
This script generates tables 3, 4, and 5 in the paper. it performs clustering on Twitter interaction measures after cleaning and outlier removal.
- Steps performed:
- Remove invalid datapoints (e.g., ego networks, zero values).
- Filter outliers using Median Absolute Deviation (MAD).
- Rescale outliers measure before normalization.
- Normalize all measures (z-score).
- Apply KMeans clustering.
- Generate summary tables with statistics and clustering results.
- Input:
CSV file located in data/df.csv containing:
- interactions (IS)
- interactions_p (RIS)
- social_similarities (SS)
- outliers (OUT)
- country
- userids
- Output:
- Printed summary tables:
- Table 3: Average values per country after filtering.
- Table 4: Cluster centroids and sizes.
- Table 5: Cluster counts per country.
- (Optional) CSV with clustered data if save=True.
This script generates figure 5 in the paper. This script visualizes the distribution of clustered users across regions using stacked bar charts. It uses the output dataset produced by clustering.py.
- Input:
- CSV file located in data/df_filtered_clustered_.csv
- = number of clusters (e.g., 4)
- Processing:
- Aggregates cluster assignments per region.
- Converts counts into percentages. -Generates stacked bar charts with annotated percentages.
- Output:
- Stacked bar chart comparing cluster composition across regions.
This script generates figure 6 in the paper. It validates k-means clustering on Twitter interaction measures using multiple evaluation metrics.
- Processing steps:
- Clean dataset (remove invalid entries). -Filter outliers using MAD.
- Rescale and normalize measures.
- Run k-means clustering for k = 2..6.
- Compute Silhouette Score, Calinski-Harabasz Score, and WB-index.
- Input: CSV file located in data/df.csv containing the measures and country labels.
- Output:
- Printed scores for each metric at different k values.
- Plots of score trends across cluster sizes.
This script generates figure 7 and table 6 in the paper. It visualizes the gender composition (Male/Female) within tie-strength clusters across countries using stacked bar charts. - data/df_filtered_clustered_.csv (output from clustering script) - data/gender.csv (predicted gender labels per Twitter ID)
- Processing:
- Merge gender with cluster assignments.
- Filter to AU, UK, US and Male/Female only.
- Compute gender percentages per cluster-country group.
- Plot stacked bars with annotated percentages.
- (Optional) Print gender distribution tables if table6=True.
- Output: A heatmap showing how tie-strength clusters distribute across gender groups in AU, UK, and US.