Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning currency without price #12

Open
rennerocha opened this issue Jul 3, 2019 · 7 comments
Open

Returning currency without price #12

rennerocha opened this issue Jul 3, 2019 · 7 comments

Comments

@rennerocha
Copy link

Does it makes sense to return a Price instance with currency set and without a price value?

Actually if we have a substring inside a string that matches with an existing currency, it will be returned even if we are not mentioning the currency:

In [1]: Price.fromstring("STRING WITH NO PRICE BUT HAS A WORD THAT CONTAINS PART OF A CURRENCY NAME: EUROPE")
Out[1]: Price(amount=None, currency='EUR')

This happens because we use regexes to match the currencies inside a string. But this approach would create a problem when single-letter currencies is handled better (#3) .

I can see two options to handle this scenario:

  1. Consider as currency only if the substring is surrounded by whitespaces:
In [2]: Price.fromstring("SOMETHING EUROPE SOMETHING")
Out[2]: Price(amount=None, currency=None)

In [3]: Price.fromstring("SOMETHING EUR SOMETHING")
Out[3]: Price(amount=None, currency='EUR')
  1. Consider as currency only if the entire string matches exactly with the currency name:
In [4]: Price.fromstring("EUR")
Out[4]: Price(amount=None, currency='EUR')

In [5]: Price.fromstring("SOMETHING EUR SOMETHING")
Out[5]: Price(amount=None, currency=None)
@Gallaecio
Copy link
Member

I think option 1 is the way to go. We just need to improve the regular expressions that we use to match currencies so that "Europa" does not count as "EUR".

@rennerocha
Copy link
Author

What about single letter currencies? Do the same?

In [1]: Price.fromstring("SOMETHING R SOMETHING")
Out[1]: Price(amount=None, currency='R')

In [1]: Price.fromstring("SOMETHING WORD SOMETHING")
Out[1]: Price(amount=None, currency=None)

@Gallaecio
Copy link
Member

Gallaecio commented Jul 4, 2019

I think so.

When a real-life scenario comes up where this is troublesome, we can look into improvements to solve or mitigate such issues. For example, ignore specific currency expressions on specific locales (I assume we are getting locale support à la dateparser eventually), such as ignoring "R" on languages where "R" is a common word and they always use a different text when referring to the "R" currency.

@ejulio
Copy link

ejulio commented Oct 7, 2019

I think a price is a combination of currency and value.
@rennerocha , can you share a specific case where this behavior is required?
Like, having the currency/symbol, but not the price?

My concern is that, as it is a specific case, it could be in a place specific for currency parsing from a string, instead of returning a price without value.

@kmike
Copy link
Member

kmike commented Oct 18, 2019

This is not very constructive, I know.. Just an observation: converting HTML to text is a tricky problem, which may require a full browser to do properly; e.g. depending on a website, a whitespace should or shouldn't be inserted around <span> tags. This means we can't 100% rely on whitespaces in these strings handled properly. Option 1 can still be a good tradeoff though.

@Manish-210
Copy link

I think instead of relying on whitespaces completely, it will be good to check that the currency's string ( eg: EUR ) is not surrounded by alphabets.

In [2]: Price.fromstring("SOMETHING EUROPE SOMETHING")
Out[2]: Price(amount=None, currency=None)

because there is " O " after EUR

In [2]: Price.fromstring("EUR SOMETHING")
Out[2]: Price(amount=None, currency='Eur')

I also agree with the @kmike tags in HTML will mostly surround the strings.
Consider:

In [2]: Price.fromstring("<span>EUR something </span>")
Out[2]: Price(amount=None, currency='Eur')

In the above example, EUR is not surrounded by any alphabet, it is surrounded by >. So it might be a good idea to only check if there is not any alphabet present adjacent to a string of currency.

@Gallaecio
Copy link
Member

I think the input to price-parser should not include HTML tags, I think the user it meant to clean that up before passing the input to number-parser. I think @kmike’s point is that, when a user does that clean up, the result may not have those spaces.

I’m thinking for example of 10 EUR<span>*</span>, which could (and probably should) result in 10 EUR* when HTML is stripped, and we should aim for extraction to work nonetheless. This example is rather straightforward, but things can get messier. Think, for example, of 10 EUR<sup>no taxes</sup> resulting in 10 EURno taxes; to support something like that, we would need a more complex solution, but it would still be doable (e.g. support all-uppercase currency codes if followed by a lowercase letter). So, using word boundaries is a good start, but things could get more complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants