Skip to content

Paths() Function Support #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
osevill opened this issue Mar 30, 2025 · 10 comments
Open

Paths() Function Support #277

osevill opened this issue Mar 30, 2025 · 10 comments

Comments

@osevill
Copy link

osevill commented Mar 30, 2025

I currently use jq to convert a json object to csv format, generically without referring to field names (usually a json object with an array). The filter looks like this:
jq -r '(.[] | map(paths(scalars|true)) | unique) as $cols | map(.[] as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) |  map(@csv) | .[]'

I would like to see if there's any performance gain using jaq for the above task, but when I get started, trying paths(scalars), I get an "undefined filter" error.
Since the function paths() doesn't seem to be supported, is there an alternate way to get paths to scalar values in jaq?

Also, is there jaq-specific documentation that lists available operators and functions?

Thanks!

@01mf02
Copy link
Owner

01mf02 commented Mar 31, 2025

Hi @osevill, you should be able to test your filter with jaq using two changes:

  • Instead of getpath, you can use jqjq's _getpath. Just copy def _getpath($path): ...; to the beginning of your filter. I would then also put your filter into a file and import that with -f file.jq.
  • Instead of paths(scalars | true), you should be able to use something like paths as $p | _getpath($p) | scalars | $p. This is probably slower than jq's paths(scalars), but this is probably not going to matter much in your case, because that code does not seem to be in a hot loop.

Let us know how your performance (and your final filter) looks like!

Regarding jaq-specific documentation: I've submitted a PR to the jq repository that adds this information to the jq manual itself, showing for example that paths(f) is not supported in jaq. However, I have not heard anything back since three months. :( The jq manual, IMO, is the logical point where such information should be located.

@01mf02
Copy link
Owner

01mf02 commented Mar 31, 2025

(Small improvement suggestion: . | map(tostring) is equivalent to just map(tostring), because . | f is equivalent to f for any f.)

@osevill
Copy link
Author

osevill commented Apr 1, 2025

Thank you for the suggestions. Will implement and let you know.

@01mf02
Copy link
Owner

01mf02 commented Apr 9, 2025

Hi @osevill, just wanted to let you know that your use case motivated me to implement support for getpath and paths/1 #280. This should also increase the performance for paths/0. If you want to give it a try --- that jaq branch should be able to run your filter now.
(It would be also nice to have some example input that you run your filter on.)

@osevill
Copy link
Author

osevill commented Apr 10, 2025

tyvm.
Will look into further this weekend.
Always relied on the binary releases previously, so I would have to install the rust compiler and follow the instructions on the main project page to compile from the path-values branch.
If I run this:

$ cargo install --locked jaq
$ cargo install --locked --git https://github.com/01mf02/jaq # latest development version

...will I get the path-values branch, or do I need to specify the branch above?

Also, what's the prefered method for providing you sample json input data?
Thx.

@01mf02
Copy link
Owner

01mf02 commented Apr 10, 2025

...will I get the path-values branch, or do I need to specify the branch above?

$ cargo install --locked --git https://github.com/01mf02/jaq --branch path-values

should do it.

Also, what's the prefered method for providing you sample json input data?

Just posting one or two lines of JSON in this thread, for example. :)

The best thing would be to have a small jq script that produces arbitrarily large input data.
For example, to produce a large array: jq -n '[limit(1000; repeat("Hello world!"))]'
That makes it easier to benchmark things, because one can adapt the size of the input data until the execution of the script to evaluate takes a convenient amount of time.

@osevill
Copy link
Author

osevill commented Apr 12, 2025

Successfully compiled the path-values branch and did some testing with the filter in my original post:

jq -r '(.[] | map(paths(scalars|true)) | unique) as $cols | map(.[] as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) |  map(@csv) | .[]'

The updated jaq seems to work fine with this filter when keys are the same between elements of the json array. When keys from element to element differ however, I get the following error:
Error: cannot use null as iterable (array or object)

Here's a sample schema typical of what I use with jq to convert to csv. There are 3 elements in the array. The first includes inner_array_1, the second includes inner_array_2, and the third element includes both. (For my use case with the above jq filter, it's appropriate to consider the nested array keys as part of the respective outer array element, i.e., no matter how many elements are in the inner arrays, there should only be 3 csv rows because the nested array elements don't define new records. Only the outer array elements define new records in this use case.) I also added an additional scalar "field6" key to the third outer array element.

{
  "outer_array": [
    {
      "record_no": 1,
      "inner_array_1": [
        {
          "IA1_field1": "IA1_value_1",
          "IA1_field2": "IA1_value_2",
          "IA1_field3": "IA1_value_3"
        },
        {
          "IA1_field1": "IA1_value_4",
          "IA1_field2": "IA1_value_5",
          "IA1_field3": "IA1_value_6"
        },
        {
          "IA1_field1": "IA1_value_7",
          "IA1_field2": "IA1_value_8",
          "IA1_field3": "IA1_value_9"
        }
      ],
      "OA_field1": {
        "name": "name1"
      },
      "OA_field2": 2,
      "OA_field3": 3,
      "OA_field4": 4,
      "OA_field5": 5
    },
    {
      "record_no": 2,
      "inner_array_2": [
        {
          "IA2_field1": "IA2_value_1",
          "IA2_field2": "IA2_value_2",
          "IA2_field3": "IA2_value_3"
        },
        {
          "IA2_field1": "IA2_value_4",
          "IA2_field2": "IA2_value_5",
          "IA2_field3": "IA2_value_6"
        },
        {
          "IA2_field1": "IA2_value_7",
          "IA2_field2": "IA2_value_8",
          "IA2_field3": "IA2_value_9"
        }
      ],
      "OA_field1": {
        "name": "name2"
      },
      "OA_field2": "b",
      "OA_field3": "c",
      "OA_field4": "d",
      "OA_field5": "e"
    },
    {
      "record_no": 3,
      "inner_array_1": [
        {
          "IA1_field1": "IA1_value_10",
          "IA1_field2": "IA1_value_11",
          "IA1_field3": "IA1_value_12"
        },
        {
          "IA1_field1": "IA1_value_13",
          "IA1_field2": "IA1_value_14",
          "IA1_field3": "IA1_value_15"
        },
        {
          "IA1_field1": "IA1_value_16",
          "IA1_field2": "IA1_value_17",
          "IA1_field3": "IA1_value_18"
        }
      ],
      "inner_array_2": [
        {
          "IA2_field1": "IA2_value_10",
          "IA2_field2": "IA2_value_11",
          "IA2_field3": "IA2_value_12"
        },
        {
          "IA2_field1": "IA2_value_13",
          "IA2_field2": "IA2_value_14",
          "IA2_field3": "IA2_value_15"
        },
        {
          "IA2_field1": "IA2_value_16",
          "IA2_field2": "IA2_value_17",
          "IA2_field3": "IA2_value_18"
        }
      ],
      "OA_field1": {
        "name": "name3"
      },
      "OA_field2": 7,
      "OA_field3": 8,
      "OA_field4": "f",
      "OA_field5": "g",
      "OA_field6": "h"
    }
  ]
}

Even when I simplify the sample json to just 2 records, with 2 keys each (one scalar that matches between the records; one object that differs, I also get the same "cannot use null as iterable" error.

  "outer_array": [
    {
      "record_no": 1,
      "OA_field1": {
        "name": "name1"
      }
    },
    {
      "record_no": 2,
      "OA_field2": {
        "name": "name2"
      }
    }
  ]
}

...but if both keys of the record are scalar and the second scalar key differs between records, the path-values jaq branch successfully converts to csv:

  "outer_array": [
    {
      "record_no": 1,
      "OA_field1": "name1"
    },
    {
      "record_no": 2,
      "OA_field2": "name2"
    }
  ]
}

Regards

@01mf02
Copy link
Owner

01mf02 commented Apr 14, 2025

The "cannot use null as iterable" error is to be expected, because of the way I implemented getpath in jaq. In a nutshell, jaq expands getpath([$a, $b, $c]) to .[$a][$b][$c], and indexing a null value yields an error in jaq, which explains what you are seeing.
Your filter works in jaq with a small change, namely replacing getpath($col) by getpath($col)? // null. That should preserve the behaviour of your filter in jq and also makes it explicit what should happen when you try to access a value at a path that does not exist.

(.[] | map(paths(scalars|true)) | unique) as $cols |
map(.[] as $row | ($cols | map(. as $col | $row | getpath($col)? // null))) as $rows |
([($cols | map(. | map(tostring) | join(".")))] + $rows) |  map(@csv) | .[]

@01mf02
Copy link
Owner

01mf02 commented Apr 14, 2025

I made a little change to implement getpath more efficiently c7fa9c0. After this, the performance of jaq is a bit better than jq's, but not by much:

$ hyperfine -M 2 -L jq jq,target/release/jaq '{jq} -f osevill.jq bla.json'
Benchmark 1: jq -f osevill.jq bla.json
  Time (mean ± σ):      1.345 s ±  0.027 s    [User: 1.288 s, System: 0.049 s]
  Range (min … max):    1.326 s …  1.365 s    2 runs
 
Benchmark 2: target/release/jaq -f osevill.jq bla.json
  Time (mean ± σ):      1.300 s ±  0.002 s    [User: 1.280 s, System: 0.013 s]
  Range (min … max):    1.298 s …  1.302 s    2 runs
 
Summary
  target/release/jaq -f osevill.jq bla.json ran
    1.03 ± 0.02 times faster than jq -f osevill.jq bla.json

I generated bla.json by repeating {"record_no": 1, ...} 10,000 times. I also added | empty at the end of your filter in order not to measure output performance.

@osevill
Copy link
Author

osevill commented Apr 15, 2025

Thank you again for your time working on this request. I'll recompile the path-values branch to get the updated getpath, and try it out again. Should have some time toward the end of the week.
Regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants