Skip to content

Commit a76423a

Browse files
authored
Release v0.2.0 (#22)
* Release v0.2.0 with optimizations
1 parent 33f1c57 commit a76423a

File tree

6 files changed

+9
-13
lines changed

6 files changed

+9
-13
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "SimString"
22
uuid = "2e3c4037-312d-4650-b9c0-fcd0fc09aae4"
33
authors = ["Bernard Brenyah"]
4-
version = "0.1.0"
4+
version = "0.2.0"
55

66
[deps]
77
CircularArrays = "7a955b69-7140-5f4e-a0ed-f168c5e2e749"

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ This package is be particulary useful for natural language processing tasks whic
1515
- [X] Fast algorithm for string matching
1616
- [X] 100% exact retrieval
1717
- [X] Support for unicodes
18+
- [X] Support for building databases directly from text files
1819
- [ ] Custom user defined feature generation methods
1920
- [ ] Mecab-based tokenizer support
20-
- [X] Support for building databases directly from text files
2121
- [ ] Support for persistent databases
2222

2323
## Suported String Similarity Measures
@@ -41,7 +41,7 @@ pkg> add SimString
4141
The few (and selected) brave ones can simply grab the current experimental features by simply adding the master branch to your development environment after invoking the package manager with `]`:
4242

4343
```julia
44-
pkg> add SimString#master
44+
pkg> add SimString#main
4545
```
4646

4747
You are good to go with bleeding edge features and breakages!

docs/src/index.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,18 @@ CurrentModule = SimString
77
Documentation for [SimString](https://github.com/PyDataBlog/SimString.jl).
88

99
A native Julia implementation of the CPMerge algorithm, which is designed for approximate string matching.
10-
This package is be particulary useful for natural language processing tasks which demand the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
10+
This package is be particulary useful for natural language processing tasks which require the retrieval of strings/texts from a very large corpora (big amounts of texts). Currently, this package supports both Character and Word based N-grams feature generations and there are plans to open the package up for custom user defined feature generation methods.
11+
12+
CPMerge Paper: [https://aclanthology.org/C10-1096/](https://aclanthology.org/C10-1096/)
1113

1214
## Features
1315

1416
- [X] Fast algorithm for string matching
1517
- [X] 100% exact retrieval
1618
- [X] Support for unicodes
19+
- [X] Support for building databases directly from text files
1720
- [ ] Custom user defined feature generation methods
1821
- [ ] Mecab-based tokenizer support
19-
- [X] Support for building databases directly from text files
2022
- [ ] Support for persistent databases
2123

2224
## Suported String Similarity Measures
@@ -82,6 +84,7 @@ desc = describe_collection(db)
8284
## Release History
8385

8486
- 0.1.0 Initial release.
87+
- 0.2.0 Added support for unicodes
8588

8689
```@index
8790
```

src/dictdb.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ end
129129
Internal function to lookup feature sets by size and feature
130130
"""
131131
function lookup_feature_set_by_size_feature(db::DictDB, size, feature)
132-
if feature keys(db.lookup_cache[size])
132+
if !haskey(db.lookup_cache[size], feature)
133133
db.lookup_cache[size][feature] = get(db.string_feature_map[size], feature, Set{String}())
134134
end
135135
return db.lookup_cache[size][feature]

src/features.jl

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@ end
1010
Internal function to pad AbstractVector types with specified padder
1111
"""
1212
function pad_string(x::AbstractVector, padder::AbstractString)
13-
# TODO: Insert a padder as the first and last element of x with undef
1413
insert!(x, 1, padder)
1514
push!(x, padder)
1615
return x
@@ -96,7 +95,6 @@ end
9695
Internal function to count and pad generated character-level ngrams (including duplicates)
9796
"""
9897
function cummulative_ngram_count(x)
99-
# TODO: Use length of x initiate non allocated ngrams
10098
counter = Dict{eltype(x), Int}()
10199

102100
return map(x) do val

src/search.jl

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -102,11 +102,6 @@ function search!(measure::AbstractSimilarityMeasure, db_collection::DictDB, quer
102102
# Generate features from query string
103103
features = extract_features(db_collection.feature_extractor, query)
104104

105-
# Metadata from the generated features (length, min & max sizes)
106-
# length_of_features = length(features)
107-
# min_feature_size = minimum_feature_size(measure, length_of_features, α)
108-
# max_feature_size = maximum_feature_size(measure, db_collection, length_of_features, α)
109-
110105
results = String[]
111106

112107
# Generate and return results from the potential candidate size pool

0 commit comments

Comments
 (0)