FIX parsing for Long in scientific notation #674

ilmirons · 2025-02-10T15:52:10Z

Hi. I noticed DeferNum uses parseLongUnsafe to parse number strings in scientific notation format, which seems not to be correct. I don't know if this is a performance choice (I get converting with n.toLong may be slower), but shouldn't at that point throw immediately be a better option? I wrote some tests and swapped parseLongUnsafe(s) with n.toLong. Give it a look when you have time.

Thanks for the great work,

Andrea

rossabaker

Thanks! A few initial thoughts inline:

ast/shared/src/test/scala/jawn/ArbitraryUtil.scala

rossabaker · 2025-02-10T23:24:21Z

ast/shared/src/main/scala/jawn/ast/JValue.scala


  final override def asInt: Int = n.toInt
-  final override def asLong: Long = util.parseLongUnsafe(s)
+  final override def asLong: Long = n.toLong


Checking my understanding: right now it sometimes returns an incorrect value, and you want it to throw? Changing behavior to start throwing scares me, particularly in a method not documented to throw. But silently returning bad values isn't great, either. And I guess it would align the behavior more closely with .asInt.

There's a util.parseLong(s), which isn't as fast as util.parseLongUnsafe(s), but about 30% faster than .toLong and does throw on invalid input. Maybe that would be a better implementation?

No, I would like it not to throw but to produce the correct result.

Looking at both parseLong and parseLongUnsafe I see no specific handling for exponent and also as stated in the scaladoc of the former

Stated more precisely, accepted values:

- conform to the pattern: -?(0|([1-9][0-9]*)) - are within [-9223372036854775808, 9223372036854775807]

So these functions cannot handle 2e3 or any number in scientific notation or with a '.' by design, but still this kind of strings are passed to DeferNum
I see 3 solutions to this:

Just use .toLong. Simple, not so performant, but correct. Also the performance loss would be only when accessing the Long value of a Double through JValue AST, not during parsing

Use parseLong, that you mentioned: according to code in JawnFacade it will throw on any input, as Strings passed to DeferNum either have a '.' or a 'e'

// from JawnFacade final def jnum(s: CharSequence, decIndex: Int, expIndex: Int): JValue = if (decIndex == -1 && expIndex == -1) DeferLong(s.toString) else DeferNum(s.toString)

So basically we would fall in (non-)solution 3
3. State clearly a number in scientific notation/with a decimal point can only be a Double. It's up to the dev know how the field he is getting has been written. Getting a Double as a Long produces error Hence throw immediately or return Option.empty if trying to get a Long from DeferNum. This is the (non-)solution I like the least, but has 0 impact on performances and we are talking about a very rare case. That's saying: we choose not to support long in scientific notation

Edit: I initially thought it was just scientific notation and adding a flag could solve the problem, but looking again at the code I realized even decimal point is not handled by parseLongUnsafe/parseLong.

rossabaker · 2025-02-10T23:28:23Z

ast/shared/src/main/scala/jawn/ast/JValue.scala

  final override def getInt: Option[Int] = Some(n.toInt)
-  final override def getLong: Option[Long] = Some(util.parseLongUnsafe(s))
+  final override def getLong: Option[Long] = Some(n.toLong)


I think it's strange that both of these methods return Some with the JNum is not integral, but I guess that's probably behavior best left alone at this point.

I'm not sure I understood: are you suggesting to leave the code as it is? I made a small demo program and as it is it produces wrong results, unless I'm missing something.

// Scala 2.12.20 // Temurin 22.0.2 // jawn-util 1.6.0 package com.example import org.typelevel.jawn.util.{parseLong, parseLongUnsafe} object Main { def main(args: Array[String]): Unit = { List( ("2.0", 2.0), // parseLongUnsafe(2.0) = 180 (expected 2.0) ("2.5", 2.5), // parseLongUnsafe(2.5) = 185 (expected 2.5) ("2e3", 2e3), // parseLongUnsafe(2e3) = 733 (expected 2000.0) ("2.5e0", 2.5e0), // parseLongUnsafe(2.5e0) = 19030 (expected 2.5) ("2e+3", 2e+3), // parseLongUnsafe(2e+3) = 7253 (expected 2000.0) ("2.5e-1", 2.5e-1), // parseLongUnsafe(2.5e-1) = 190271 (expected 0.25) ("9.223372036854776e18", 9.223372036854776e18), // parseLongUnsafe(9.223372036854776e18) = -4010348331692976762 (expected 9.223372036854776E18) ("-9.223372036854776e+18", -9.223372036854776e18)) // parseLongUnsafe(-9.223372036854776e+18) = 3209995169510665050 (expected -9.223372036854776E18) .foreach { t => try { println(s"parseLongUnsafe(${t._1}) = " + parseLongUnsafe(t._1) + " (expected " + t._2 + ")") } catch { // when switching to parseLong everything falls here case e: Throwable => println(s"parseLongUnsafe(${t._1}) = " + e) } } } }

PS digging down the rabbit hole it seems that also the use of parseLongUnsafe in DeferLong is not correct, as there is no check at parser level that the digit string parsed represent a number in the Long range, just that there is no . or e.

// Parser.scala lines 166-176 if (c == '-') { j += 1 c = at(j) } if (c == '0') { j += 1 c = at(j) } else if ('1' <= c && c <= '9') while ({ j += 1; c = at(j); '0' <= c && c <= '9' }) () // can easily pass digit strings that are out of Long range else die(i, "expected digit") // No further checks in the lines below

After the string is passed to jnum in JawnFacade (see code in comment above)
And in DeferLong

// JValue.scala, lines 224-226 case class DeferLong(s: String) extends JNum { lazy val n: Long = util.parseLongUnsafe(s) // should be parseLong

So this too can produce wrong results for number outside of Long range (which are anyway valid as JSON as everything is supposed to be encoded as double). This also hint at the matter of who is in charge of semantic checks: IMHO should be the Facade, but comment on parseLongUnsafe seems to suggest the contrary:

/** * Parse the given character sequence as a single Long value (64-bit signed integer) in decimal (base-10). * * For valid inputs, this method produces the same values as `parseLong`. However, by avoiding input validation it is * up to 50% faster. * * For inputs which `parseLong` throws an error on, `parseLongUnsafe` may (or may not) throw an error, or return a * bogus value. This method makes no guarantees about how it handles invalid input. * * This method should only be used on sequences which have already been parsed (e.g. by a Jawn parser). When in doubt, * use `parseLong(cs)`, which is still significantly faster than `java.lang.Long.parseLong(cs.toString)`. */ def parseLongUnsafe(cs: CharSequence): Long = { // ...

Tried to fix this too, but somehow we lose idempotency on JS CharSequence/String parsing. I was unable to investigate how this happens, but it's just JS. Would it be a good idea to rewire CharSequence to String parsing in JS implementation? At the end of the day we have not a CharSequence type in JS.

Fixed problem with last commit. Was a more general issue about equality.

ilmirons added 2 commits February 10, 2025 16:00

FIX parsing for Long numbers in scientific notation

7b60483

FIX syntax

6eb482e

ilmirons mentioned this pull request Feb 10, 2025

FIX parsing for Long in scientific notation #673

Closed

ilmirons changed the title ~~Scientific notation ast support~~ FIX parsing for Long in scientific notation Feb 10, 2025

rossabaker reviewed Feb 10, 2025

View reviewed changes

ilmirons requested a review from rossabaker February 13, 2025 10:00

ilmirons added 9 commits February 13, 2025 12:38

Rewrote Arbitrary for scientific notation tests as lists

978bbfc

Swapped parseLongUnsafe for safer parseLong in DeferLong

f8c8a70

FIX formatting

27ea0c3

FIX formatting (2)

d2dd387

FIX formatting (3)

bbebe50

FIX match expression (stable type)

f56d052

FIX binary compatibility (resume old iface)

b59ab4f

FIX formatting

5c54741

FIX Forgot for a moment Long is not nullable :/

8350e79

ilmirons marked this pull request as draft February 17, 2025 08:51

ilmirons added 3 commits February 21, 2025 11:20

No idempotency on JS Charsequence/String rendering

6870485

Formatting

70ea50c

FIX equality for DeferLong

f79eaa0

ilmirons marked this pull request as ready for review March 4, 2025 17:04

Merge branch 'main' into scientific-notation-ast-support

4925dd7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FIX parsing for Long in scientific notation #674

FIX parsing for Long in scientific notation #674

Uh oh!

ilmirons commented Feb 10, 2025 •

edited

Loading

Uh oh!

rossabaker left a comment

Uh oh!

Uh oh!

rossabaker Feb 10, 2025

Uh oh!

rossabaker Feb 10, 2025

Uh oh!

ilmirons Feb 11, 2025 •

edited

Loading

Uh oh!

rossabaker Feb 10, 2025

Uh oh!

ilmirons Feb 11, 2025 •

edited

Loading

Uh oh!

ilmirons Feb 13, 2025

Uh oh!

ilmirons Feb 21, 2025

Uh oh!

ilmirons Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FIX parsing for Long in scientific notation #674

Are you sure you want to change the base?

FIX parsing for Long in scientific notation #674

Uh oh!

Conversation

ilmirons commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rossabaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rossabaker Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

rossabaker Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

ilmirons Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rossabaker Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

ilmirons Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilmirons Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

ilmirons Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

ilmirons Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ilmirons commented Feb 10, 2025 •

edited

Loading

ilmirons Feb 11, 2025 •

edited

Loading

ilmirons Feb 11, 2025 •

edited

Loading