Skip to content

Commit a8e80f7

Browse files
author
Tomasz bla Fortuna
committed
Fix scoring of very frequent trigrams.
They were given score 0 in f32 space and that made them incomparable. Widening the usable frequency range and keeping score non-zero fixes that and gives better result ordering when using "should" with popular trigrams.
1 parent fbf7528 commit a8e80f7

File tree

3 files changed

+20
-10
lines changed

3 files changed

+20
-10
lines changed

Cargo.lock

+1-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "fuzzdex"
3-
version = "1.1.0"
3+
version = "1.2.0"
44
authors = ["Tomasz bla Fortuna <[email protected]>"]
55
edition = "2021"
66
license = "MIT"

src/fuzzdex/indexer.rs

+18-8
Original file line numberDiff line numberDiff line change
@@ -76,22 +76,32 @@ impl Indexer {
7676
*
7777
* Let's try to put average count as "1".
7878
*
79-
* Hyperbolic function can smooth the scores and put them in nice range:
79+
* Hyperbolic function can smooth the scores and put them in a nice range:
8080
* 0.5 + tanh(x)/2
81-
* Has range 0 - 1 for values -inf to inf (-3 to 3 de facto).
82-
* 0.5 + tanh((avg - val - 1) / avg)/2
83-
* Will have 0.5 at exactly average, distinguish all lower values
84-
* (higher score) up to 0.87, and will distinguish plenty of higher
85-
* values.
81+
* Has range 0 - 1 for values -inf to inf. Only around -5 to 5 is
82+
* meaningful for a 32 bit float). Hence we will divide by 5*max.
83+
*
84+
* 0.5 + tanh(5.0 * (avg - val - 1) / max)/2
85+
* Will have around 0.5 at average, max 1. Distinguish all lower values
86+
* (higher score), and will distinguish plenty of higher values.
8687
*/
8788

8889
let average: f32 = self.db.values()
8990
.map(|v| v.positions.len())
9091
.sum::<usize>() as f32 / self.db.len() as f32;
9192

93+
let max: usize = self.db
94+
.values()
95+
.map(|v| v.positions.len())
96+
.max_by_key(|val| *val)
97+
.unwrap_or(1);
98+
9299
for (_trigram, entry) in self.db.iter_mut() {
93-
let input = entry.score;
94-
let score = 0.5 + ((average - input - 1.0) / average).tanh() / 2.0;
100+
let popularity = entry.score;
101+
let centered = average - popularity - 1.0;
102+
let ranged = 5.0 * centered / (max as f32);
103+
let zero_to_one = 0.5 + (ranged).tanh() / 2.0;
104+
let score = zero_to_one;
95105
entry.score = score;
96106
}
97107
Index::new(self, cache_size)

0 commit comments

Comments
 (0)