Skip to content

A lightweight Go package implementing Charikar's Simhash algorithm for generating hash fingerprints and calculating similarity, ideal for deduplication and content fingerprinting

License

Notifications You must be signed in to change notification settings

ErfanMomeniii/simhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go version license version

simhash

simhash is a lightweight Go package for generating Simhash tokens and calculating their similarity using the Moses Charikar Simhash algorithm. It is ideal for applications like text deduplication, plagiarism detection, and near-duplicate content detection and fingerprinting.

For detailed usage, check this.


Documentation

Install

To get started with simhash, install it using:

go get github.com/erfanmomeniii/simhash

Next, include it in your application:

import "github.com/erfanmomeniii/simhash"

Quick Start

The following example demonstrates how to generate Simhash tokens and calculate similarity:

package main

import (
	"fmt"
	"github.com/erfanmomeniii/simhash"
)

func main() {
	// Create a new Simhash instance
	s := simhash.NewSimhash()

	// Add features with weights
	s.AddFeature("example", 2)
	s.AddFeature("test", 5)

	// Generate a Simhash token
	token1 := s.GenerateToken()

	// Create another Simhash instance with different features
	s2 := simhash.NewSimhash()
	s2.AddFeature("example", 2)
	s2.AddFeature("testcase", 5)

	// Generate another token
	token2 := s2.GenerateToken()

	// Compute similarity between the two tokens
	similarity := simhash.ComputeSimilarity(token1, token2)

	fmt.Printf("Token1: %s\nToken2: %s\nSimilarity: %f\n", token1, token2, similarity)
}

Output:

Token1: F9E6E6EF197C2B25
Token2: FDA981914657B7D1
Similarity: 43.75

Features

Add Feature

Add features with their weights to the Simhash generator:

s.AddFeature("example", 5)
s.AddFeature(12345, 10)

Generate Token

Generate a 64-bit hexadecimal Simhash token based on the added features:

token := s.GenerateToken()

Compute Similarity

Calculate the similarity between two Simhash tokens as a percentage (normalized Hamming distance):

similarity := simhash.ComputeSimilarity(token1, token2)

Supported Feature Types

The AddFeature method accepts the following types:

  • Strings: e.g., "example"
  • Numbers: e.g., 123, float64, etc.
  • Byte slices: e.g., []byte("example")
  • Any other type: Converted using JSON serialization

Contributing

Pull requests are welcome! For any changes, please open an issue first to discuss the proposed modification. Ensure tests are updated accordingly.

About

A lightweight Go package implementing Charikar's Simhash algorithm for generating hash fingerprints and calculating similarity, ideal for deduplication and content fingerprinting

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages