Skip to content

Releases: johnkerl/miller

sort-within-records, unsparsify -f, misc updates; Go-port beta

29 Nov 21:07
Compare
Choose a tag to compare

Features

Bugfixes

  • The count -n feature was not implemented as intended. This fulfills #370, reported by @aborruso.
  • Pretty-print format now works correctly with --headerless-csv-output as reported on #384, reported by @agguser.
  • The seqgen verb now correctly tracks NR and FNR in the records it emits.
  • An intermittent JSON-parsing bug reported on #394 by @sjackman has been fixed.

Documentation

This is the first release since the readthedocs move as requested by @pabloab on #375. The intention is that you will be able to select documentation specific to 5.10.0 there; I may have something to fix here.

Go-port preview

While the mods for this 5.10.1 release are quite minor, intense development time has been spent over the last few months on the Go port, tracked here and here, which will ultimately become Miller 6.

The completion of the port is still some months away. While most verbs, and most of the DSL, have been ported -- with many new features in place as tracked here -- significant gaps remain. This include the "big" verbs join, nest, reshape, stats1, and stats2, along with all the date-time-related DSL functions, etc.

Nonetheless, if you wish to experiment with the Go executables for the Miller 6 beta, please find MacOS and Linux versions attached. (I don't know how to make these for Windows yet, sorry!)

I'd love any and all advance help with the Go port including bug reports, feature requests, etc. -- both from Miller end-users as well as developers. This is exciting and fulfilling work, and I look forward to getting it completed.

Security update: disallow --prepipe in .mlrrc

03 Sep 01:34
Compare
Choose a tag to compare

As of Miller 5.9.0, you can have a .mlrrc file containing preferred flags.

As reported in #363, it would be possible for someone to prepare a repository or some other zipfile/tarfile, for example, containing datasets, and send it to you. They could have a line of the form prepipe do_something_bad; cat in that repository, so when you ran any mlr commands in there, it would run the do_something_bad command (whatever that might be).

The fix is (a) disallow prepipe within .mlrrc files; (b) as a consolation, allow new prepipe-zcat and prepipe-gunzip options which are safe to use.

This is published as CVE-2020-15167. Many thanks to @koernepr for the report!

.mlrrc feature, and fix Windows build

19 Aug 18:54
Compare
Choose a tag to compare
  • You can now save common defaults in a ~/.mlrrc. For example, if you normally process CSV files, you can say that in your ~/.mlrrc and you can leave off the --csv flag from your mlr commands. You can read more about this feature here, or in man mlr, or in mlr --help. This feature was requested in #339.
  • The AppVeyor build is now unbroken and as a result there are Windows artifacts for this build. Sorry about the delay!! :^/

Better environment-variable support, new 'count' verb, bugfixes

03 Aug 23:19
Compare
Choose a tag to compare

Features

  • The new count verb is a keystroke-saver for stats1 -a count -f {some field name}.
  • --jsonx and --ojsonx are keystroke-savers for --json --jvstack and --ojson --jvstack, which is to say, multi-line pretty-printed JSON format.
  • The new -s name=value feature for mlr put and mlr filter gives you simpler access to environment variables in your Miller script, as requested in #315.

Bugfixes

  • mlr format-values is no longer SEGVing on CSV/TSV input. This was reported on #330.
  • #313 fixes a corner case when field names within command-line arguments have embedded newlines.
  • Line/column indicators for JSON-formatting error messages are now correct (previously they were showing up as 0).
  • end {print NF} no longer SEGVs. This was reported in #330.
  • Several broken doc links were fixed up as reported on #329.

Windows note

  • The AppVeyor build has been broken for a while so there is no Windows executable attached to this release -- when I fix that there will be a 5.8.1 with Windows binaries. My apologies for the delay. Issue #354 is open to track this.

Ports, bugfixes, and keystroke-savers

17 Mar 03:24
Compare
Choose a tag to compare

Ports

Features

Bugfixes

  • A bug regarding optional regex-pattern groups was fixed in #277.
  • As of #294 you can now specify --implicit-csv-header for the join-file in mlr join.
  • A bug with spaces in XTAB-file values was fixed on #296.
  • A bug with missing final newline for XTAB-formatted files using MMAP files was fixed on #301.

Documentation

Note

Support for mmap mode has been entirely discontinued. This is an invisible change and should not affect you at all. For anyone interested in lower-level details, though, the summary is as follows:

  • For an incremental performance gain (perhaps 10-20% run time at most, but see below), within the C source code one can use the mmap system call to access input files via pointer arithmetic rather than malloc-and-memcopy using stdio.
  • However mmap is not available when reading from standard input -- it cannot be memory-mapped.
  • This means all file-format readers are implemented twice within the Miller source code.
  • While I try to regression-test Miller thoroughly, running all canned tests through mmap and stdio mode, I've nonetheless found my mmap implementations liable to corner-cases which I miss but users find: for example #29, #102, and #296.
  • As tracked on #160, various operating systems do not release mmapped pages after use as one might intuit, meaning that for large files and/or large numbers of files, I've for a long time now needed to have Miller opt out of mmap usage for precisely those cases which most need the performance gain: see #160, #181, and #256.
  • Additionally, mmap is not used at all for Windows/MSYS2 so there is nothing to lose there.

For these reasons, keeping mmap mode isn't worth the development overhead.

As of release 5.7.0, the mlr executable will still accept the --mmap and --no-mmap command-line flags as no-ops, for backward compatibility.

The caveat for you is that for everyday small files, the default was previously mmap mode and is now stdio (except mlr ... < filename or ... | mlr ... which have always used stdio). There is the off chance that this will newly reveal an old, latent bug or two somewhere.

I've re-run regressions in valgrind mode to aggressively catch any errors, but, please let me know ASAP via GitHub issue of any unexpected behavior in 5.7.0.

Miller 5.6.2: Bug fix for CSV/TSV with many files

22 Sep 00:20
Compare
Choose a tag to compare

Bug fixes:

  • #271 fixes a corner-case bug with more than 100 CSV/TSV files with headers of varying lengths.

Documentation:

Mobile-friendly docs

17 Sep 03:23
Compare
Choose a tag to compare

The only change is that http://johnkerl.org/miller/doc is now more mobile-friendly.

All build artifacts are the same as at https://github.com/johnkerl/miller/releases/tag/v5.6.0

Before

Before

After

After

System calls / external commands, ASV/USV support, and bulk numeric formatting

13 Sep 02:25
Compare
Choose a tag to compare

Features:

  • The new system DSL function allows you to run arbitrary shell commands and store them in field values. Some example usages are documented here. This is in response to issues #246 and #209.

  • There is now support for ASV and USV file formats. This is in response to issue #245.

  • The new format-values verb allows you to apply numerical formatting across all record values. This is in response to issue #252.

Documentation:

Bugfixes:

Note:

Thanks to @aborruso @davidselassie @joelparkerhenderson for the bug reports and feature requests!! :)

Positional indexing and other data-cleaning features

01 Sep 03:03
Compare
Choose a tag to compare

Features:

  • The new positional-indexing feature resolves #236 from @aborruso. You can now get the name of the 3rd field of each record via $[[3]], and its value by $[[[3]]]. These are both usable on either the left-hand or right-hand side of assignment statements, so you can more easily do things like renaming fields progrmatically within the DSL.

  • There is a new capitalize DSL function, complementing the already-existing toupper. This stems from #236.

  • There is a new skip-trivial-records verb, resolving #197. Similarly, there is a new remove-empty-columns verb, resolving #206. Both are useful for data-cleaning use-cases.

  • Another pair is #181 and #256. While Miller uses mmap internally (and invisibily) to get approximately a 20% performance boost over not using it, this can cause out-of-memory issues with reading either large files, or too many small ones. Now, Miller automatically avoids mmap in these cases. You can still use --mmap or --no-mmap if you want manual control of this.

  • There is a new --ivar option for the nest verb which complements the already-existing --evar. This is from #260 thanks to @jgreely.

  • There is a new keystroke-saving urandrange DSL function: urandrange(low, high) is the same as low + (high - low) * urand(). This arose from #243.

  • There is a new -v option for the cat verb which writes a low-level record-structure dump to standard error.

  • There is a new -N option for mlr which is a keystroke-saver for --implicit-csv-header --headerless-csv-output.

Documentation:

Bugfixes:

  • There was a SEGV using nest within then-chains, fixed in response to #220.

  • Quotes and backslashes weren't being escaped in JSON output with --jvquoteall; reported on #222.

An extra thank-you:

I've never code-named releases but if I were to code-name 5.5.0 I would call it "aborruso". Andrea has contributed many fantastic feature requests, as well as driving a huge volume of Miller-related discussions in StackExchange (#212). Mille grazie al mio amico @aborruso!

New data-cleaning features, Windows mlr.exe, limited localtime support, and bugfixes

14 Oct 20:25
Compare
Choose a tag to compare

Features:

Builds:

Documentation:

Bugfixes:

  • There was a memory leak for TSV-format files only as reported by @treynr on #181.

  • Dollar sign in regular expressions were not being escaped properly as reported by @dohse on #171.