This implementation follows the steps outlined in the proposed approach:
- The
preprocess_data
function filters out users with fewer than a specified number of ratings and normalizes the rating values between 0 and 1. - The
extract_quasi_identifiers
function extracts the rating vectors for each user as quasi-identifiers. - The
compute_similarities
function computes the pairwise cosine similarities between rating vectors of users across the two datasets. - The
perform_record_linkage
function performs record linkage using the Fellegi-Sunter model from therecordlinkage
library, taking the similarities and probabilities as input.