Releases: johnkerl/miller
sort-within-records, unsparsify -f, misc updates; Go-port beta
Features
- The unsparsify -f feature fulfills #387 from @sjackman .
- The new sort-within-records verb is an old ask, underway from the Go port, backported to C.
- Likewise the truncate DSL function.
Bugfixes
- The count -n feature was not implemented as intended. This fulfills #370, reported by @aborruso.
- Pretty-print format now works correctly with
--headerless-csv-output
as reported on #384, reported by @agguser. - The seqgen verb now correctly tracks
NR
andFNR
in the records it emits. - An intermittent JSON-parsing bug reported on #394 by @sjackman has been fixed.
Documentation
This is the first release since the readthedocs move as requested by @pabloab on #375. The intention is that you will be able to select documentation specific to 5.10.0 there; I may have something to fix here.
Go-port preview
While the mods for this 5.10.1 release are quite minor, intense development time has been spent over the last few months on the Go port, tracked here and here, which will ultimately become Miller 6.
The completion of the port is still some months away. While most verbs, and most of the DSL, have been ported -- with many new features in place as tracked here -- significant gaps remain. This include the "big" verbs join
, nest
, reshape
, stats1
, and stats2
, along with all the date-time-related DSL functions, etc.
Nonetheless, if you wish to experiment with the Go executables for the Miller 6 beta, please find MacOS and Linux versions attached. (I don't know how to make these for Windows yet, sorry!)
I'd love any and all advance help with the Go port including bug reports, feature requests, etc. -- both from Miller end-users as well as developers. This is exciting and fulfilling work, and I look forward to getting it completed.
Security update: disallow --prepipe in .mlrrc
As of Miller 5.9.0, you can have a .mlrrc
file containing preferred flags.
As reported in #363, it would be possible for someone to prepare a repository or some other zipfile/tarfile, for example, containing datasets, and send it to you. They could have a line of the form prepipe do_something_bad; cat
in that repository, so when you ran any mlr
commands in there, it would run the do_something_bad
command (whatever that might be).
The fix is (a) disallow prepipe
within .mlrrc files
; (b) as a consolation, allow new prepipe-zcat
and prepipe-gunzip
options which are safe to use.
This is published as CVE-2020-15167. Many thanks to @koernepr for the report!
.mlrrc feature, and fix Windows build
- You can now save common defaults in a
~/.mlrrc
. For example, if you normally process CSV files, you can say that in your~/.mlrrc
and you can leave off the--csv
flag from yourmlr
commands. You can read more about this feature here, or inman mlr
, or inmlr --help
. This feature was requested in #339. - The AppVeyor build is now unbroken and as a result there are Windows artifacts for this build. Sorry about the delay!! :^/
Better environment-variable support, new 'count' verb, bugfixes
Features
- The new count verb is a keystroke-saver for
stats1 -a count -f {some field name}
. - --jsonx and --ojsonx are keystroke-savers for
--json --jvstack
and--ojson --jvstack
, which is to say, multi-line pretty-printed JSON format. - The new -s name=value feature for
mlr put
andmlr filter
gives you simpler access to environment variables in your Miller script, as requested in #315.
Bugfixes
mlr format-values
is no longer SEGVing on CSV/TSV input. This was reported on #330.- #313 fixes a corner case when field names within command-line arguments have embedded newlines.
- Line/column indicators for JSON-formatting error messages are now correct (previously they were showing up as 0).
end {print NF}
no longer SEGVs. This was reported in #330.- Several broken doc links were fixed up as reported on #329.
Windows note
- The AppVeyor build has been broken for a while so there is no Windows executable attached to this release -- when I fix that there will be a 5.8.1 with Windows binaries. My apologies for the delay. Issue #354 is open to track this.
Ports, bugfixes, and keystroke-savers
Ports
-
Miller is available via MacPorts thanks to @herbygillot. Miller tracking issue is #273.
-
An Alpine Linux port is pending this release thanks to @terorie. Miller tracking issue is #293.
Features
- The new remove-empty-columns and skip-trivial-records are keystroke-savers for things which would other require DSL syntax, as tracked in #274.
Bugfixes
- A bug regarding optional regex-pattern groups was fixed in #277.
- As of #294 you can now specify
--implicit-csv-header
for the join-file inmlr join
. - A bug with spaces in XTAB-file values was fixed on #296.
- A bug with missing final newline for XTAB-formatted files using MMAP files was fixed on #301.
Documentation
-
Look-and-feel at http://johnkerl.org/miller/doc/ is (hopefully) improved, including clearer visual indication of which section/page you're currently looking at. Note that this change has been live for a few weeks, as look-and-feel-related doc-mods from post-5.6.2 were backported to http://johnkerl.org/miller/doc/.
-
#282 improves DSL-function documentation at http://johnkerl.org/miller/doc/reference-dsl.html#Built-in_functions_for_filter_and_put,_summary
Note
Support for mmap mode has been entirely discontinued. This is an invisible change and should not affect you at all. For anyone interested in lower-level details, though, the summary is as follows:
- For an incremental performance gain (perhaps 10-20% run time at most, but see below), within the C source code one can use the
mmap
system call to access input files via pointer arithmetic rather than malloc-and-memcopy using stdio. - However mmap is not available when reading from standard input -- it cannot be memory-mapped.
- This means all file-format readers are implemented twice within the Miller source code.
- While I try to regression-test Miller thoroughly, running all canned tests through mmap and stdio mode, I've nonetheless found my mmap implementations liable to corner-cases which I miss but users find: for example #29, #102, and #296.
- As tracked on #160, various operating systems do not release mmapped pages after use as one might intuit, meaning that for large files and/or large numbers of files, I've for a long time now needed to have Miller opt out of mmap usage for precisely those cases which most need the performance gain: see #160, #181, and #256.
- Additionally, mmap is not used at all for Windows/MSYS2 so there is nothing to lose there.
For these reasons, keeping mmap mode isn't worth the development overhead.
As of release 5.7.0, the mlr
executable will still accept the --mmap
and --no-mmap
command-line flags as no-ops, for backward compatibility.
The caveat for you is that for everyday small files, the default was previously mmap mode and is now stdio (except mlr ... < filename
or ... | mlr ...
which have always used stdio). There is the off chance that this will newly reveal an old, latent bug or two somewhere.
I've re-run regressions in valgrind
mode to aggressively catch any errors, but, please let me know ASAP via GitHub issue of any unexpected behavior in 5.7.0.
Miller 5.6.2: Bug fix for CSV/TSV with many files
Bug fixes:
- #271 fixes a corner-case bug with more than 100 CSV/TSV files with headers of varying lengths.
Documentation:
- The new http://johnkerl.org/miller/doc/whyc-details.html is an elaboration on http://johnkerl.org/miller/doc/whyc.html which answers a question posed by @BurntSushi on Reddit a couple years ago which I did not address in detail at the time.
Mobile-friendly docs
The only change is that http://johnkerl.org/miller/doc is now more mobile-friendly.
All build artifacts are the same as at https://github.com/johnkerl/miller/releases/tag/v5.6.0
Before
After
System calls / external commands, ASV/USV support, and bulk numeric formatting
Features:
-
The new system DSL function allows you to run arbitrary shell commands and store them in field values. Some example usages are documented here. This is in response to issues #246 and #209.
-
There is now support for ASV and USV file formats. This is in response to issue #245.
-
The new format-values verb allows you to apply numerical formatting across all record values. This is in response to issue #252.
Documentation:
-
The new DKVP I/O in Python sample code now works for Python 2 as well as Python 3.
-
There is a new cookbook entry on doing multiple joins. This is in response to issue #235.
Bugfixes:
-
The toupper, tolower, and capitalize DSL functions are now UTF-8 aware, thanks to @sheredom's marvelous https://github.com/sheredom/utf8.h. The internationalization page has also been expanded. This is in response to issue #254.
-
#250 fixes a bug using in-place mode in conjunction with verbs (such as rename or sort) which take field-name lists as arguments.
-
#253 fixes a bug in the label when one or more names are common between old and new.
-
#251 fixes a corner-case bug when (a) input is CSV; (b) the last field ends with a comma and no newline; (c) input is from standard input and/or --no-mmap is supplied.
Note:
Thanks to @aborruso @davidselassie @joelparkerhenderson for the bug reports and feature requests!! :)
Positional indexing and other data-cleaning features
Features:
-
The new positional-indexing feature resolves #236 from @aborruso. You can now get the name of the 3rd field of each record via $[[3]], and its value by $[[[3]]]. These are both usable on either the left-hand or right-hand side of assignment statements, so you can more easily do things like renaming fields progrmatically within the DSL.
-
There is a new capitalize DSL function, complementing the already-existing toupper. This stems from #236.
-
There is a new skip-trivial-records verb, resolving #197. Similarly, there is a new remove-empty-columns verb, resolving #206. Both are useful for data-cleaning use-cases.
-
Another pair is #181 and #256. While Miller uses mmap internally (and invisibily) to get approximately a 20% performance boost over not using it, this can cause out-of-memory issues with reading either large files, or too many small ones. Now, Miller automatically avoids mmap in these cases. You can still use --mmap or --no-mmap if you want manual control of this.
-
There is a new --ivar option for the nest verb which complements the already-existing --evar. This is from #260 thanks to @jgreely.
-
There is a new keystroke-saving urandrange DSL function: urandrange(low, high) is the same as low + (high - low) * urand(). This arose from #243.
-
There is a new -v option for the cat verb which writes a low-level record-structure dump to standard error.
-
There is a new -N option for mlr which is a keystroke-saver for --implicit-csv-header --headerless-csv-output.
Documentation:
-
The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_to_escape_'%3F'_in_regexes%3F resolves #203.
-
The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_can_I_filter_by_date%3F resolves #208.
-
#244 fixes a documentation issue while highlighting the need for #241.
Bugfixes:
-
There was a SEGV using
nest
withinthen
-chains, fixed in response to #220. -
Quotes and backslashes weren't being escaped in JSON output with --jvquoteall; reported on #222.
An extra thank-you:
I've never code-named releases but if I were to code-name 5.5.0 I would call it "aborruso". Andrea has contributed many fantastic feature requests, as well as driving a huge volume of Miller-related discussions in StackExchange (#212). Mille grazie al mio amico @aborruso!
New data-cleaning features, Windows mlr.exe, limited localtime support, and bugfixes
Features:
-
The new clean-whitespace verb resolves #190 from @aborruso. Along with the new functions strip, lstrip, rstrip, collapse_whitespace, and clean_whitespace, there is now both coarse-grained and fine-grained control over whitespace within field names and/or values. See the linked-to documentation for examples.
-
The new altkv verb resolves #184 which was originally opened via an email request. This supports mapping value-lists such as
a,b,c,d
to alternating key-value pairs such asa=b,c=d
. -
The new fill-down verb resolves #189 by @aborruso. See the linked-to documentation for examples.
-
The uniq verb now has a uniq -a which resolves #168 from @sjackman.
-
The new regextract and regextract_or_else functions resolve #183 by @aborruso.
-
The new ssub function arises from #171 by @dohse, as a simplified way to avoid escaping characters which are special to regular-expression parsers.
-
There are new localtime functions in response to #170 by @sitaramc. However note that as discussed on #170 these do not undo one another in all circumstances. This is a non-issue for timezones which do not do DST. Otherwise, please use with disclaimers: localdate, localtime2sec, sec2localdate, sec2localtime, strftime_local, and strptime_local.
Builds:
-
Windows build-artifacts are now available in Appveyor at https://ci.appveyor.com/project/johnkerl/miller/build/artifacts, and will be attached to this and future releases. This resolves #167, #148, and #109.
-
Travis builds at https://travis-ci.org/johnkerl/miller/builds now run on OSX as well as Linux.
Documentation:
-
put/filter documentation was confusing as reported by @NikosAlexandris on #169.
-
The new FAQ entry http://johnkerl.org/miller-releases/miller-head/doc/faq.html#How_to_rectangularize_after_joins_with_unpaired? resolves #193 by @aborruso.
-
The new cookbook entry http://johnkerl.org/miller/doc/cookbook.html#Options_for_dealing_with_duplicate_rows arises from #168 from @sjackman.
-
The unsparsify documentation had some words missing as reported by @tst2005 on #194.
-
There was a typo in the cookpage page http://johnkerl.org/miller/doc/cookbook.html#Full_field_renames_and_reassigns as fixed by @tst2005 in #192.