-
Notifications
You must be signed in to change notification settings - Fork 0
DuplicatePreventionSurveyArticle
Survey of Ideas on Duplicate Prevention
**Labels:**Phase-Design,duplicates,import
Summarizes (through June 6 2008) discussions, issues, other ideas about identifying, preventing, and squashing duplicate events and venues.
-
Ignore source_id. I.e. if all other fields are identical, it's a duplicate. Either (a) don't import it, or (b) for events only, import and mark as a duplicate.
-
Use the source's UID. Add a column to each table to store uid_from_source. When importing new objects, compare the new object's uid_from_source to those already stored. If identical, don't import. (Or possibly don't import if the stored object was updated after the new object was updated.) This solves the problem of re-importing venues from the same platform, e.g., multiple Cube Spaces from Upcoming. See De-duping imports
-
Simplify fields. Simplify fields before saving. When importing, simplify the new object before duplicate checking. Examples: strip leading and trailing spaces, strip html from most fields. See Issue93; Issue126; cf. De-duping events/venues
-
Use Google's canonical address to detect duplicate venues. See Issue9.
-
Normalize fields. See De-duping events/venues
-
Fuzzy matching with Levenshtein distance. See RailsConf code drive
-
Flag potential duplicates for gardening.
-
"Fingerprint" stored objects. See EventDuplication
-
NotifyingUserOfDuplicates (Outline of screens to notify user about importing duplicates).
INITIAL REVIEW NEEDED