29 Nov 21:07

johnkerl

5a453f8

sort-within-records, unsparsify -f, misc updates; Go-port beta

Features

The unsparsify -f feature fulfills #387 from @sjackman .
The new sort-within-records verb is an old ask, underway from the Go port, backported to C.
Likewise the truncate DSL function.

Bugfixes

The count -n feature was not implemented as intended. This fulfills #370, reported by @aborruso.
Pretty-print format now works correctly with --headerless-csv-output as reported on #384, reported by @agguser.
The seqgen verb now correctly tracks NR and FNR in the records it emits.
An intermittent JSON-parsing bug reported on #394 by @sjackman has been fixed.

Documentation

This is the first release since the readthedocs move as requested by @pabloab on #375. The intention is that you will be able to select documentation specific to 5.10.0 there; I may have something to fix here.

Go-port preview

While the mods for this 5.10.1 release are quite minor, intense development time has been spent over the last few months on the Go port, tracked here and here, which will ultimately become Miller 6.

The completion of the port is still some months away. While most verbs, and most of the DSL, have been ported -- with many new features in place as tracked here -- significant gaps remain. This include the "big" verbs join, nest, reshape, stats1, and stats2, along with all the date-time-related DSL functions, etc.

Nonetheless, if you wish to experiment with the Go executables for the Miller 6 beta, please find MacOS and Linux versions attached. (I don't know how to make these for Windows yet, sorry!)

I'd love any and all advance help with the Go port including bug reports, feature requests, etc. -- both from Miller end-users as well as developers. This is exciting and fulfilling work, and I look forward to getting it completed.

Contributors

aborruso, sjackman, and 2 other contributors

Assets 10

03 Sep 01:34

johnkerl

v5.9.1

f34cbcd

Security update: disallow --prepipe in .mlrrc

As of Miller 5.9.0, you can have a .mlrrc file containing preferred flags.

As reported in #363, it would be possible for someone to prepare a repository or some other zipfile/tarfile, for example, containing datasets, and send it to you. They could have a line of the form prepipe do_something_bad; cat in that repository, so when you ran any mlr commands in there, it would run the do_something_bad command (whatever that might be).

The fix is (a) disallow prepipe within .mlrrc files; (b) as a consolation, allow new prepipe-zcat and prepipe-gunzip options which are safe to use.

This is published as CVE-2020-15167. Many thanks to @koernepr for the report!

Assets 8

19 Aug 18:54

johnkerl

v5.9.0

640af6e

.mlrrc feature, and fix Windows build

You can now save common defaults in a ~/.mlrrc. For example, if you normally process CSV files, you can say that in your ~/.mlrrc and you can leave off the --csv flag from your mlr commands. You can read more about this feature here, or in man mlr, or in mlr --help. This feature was requested in #339.
The AppVeyor build is now unbroken and as a result there are Windows artifacts for this build. Sorry about the delay!! :^/

Assets 8

03 Aug 23:19

johnkerl

v5.8.0

e5cdbc7

Better environment-variable support, new 'count' verb, bugfixes

Features

The new count verb is a keystroke-saver for stats1 -a count -f {some field name}.
--jsonx and --ojsonx are keystroke-savers for --json --jvstack and --ojson --jvstack, which is to say, multi-line pretty-printed JSON format.
The new -s name=value feature for mlr put and mlr filter gives you simpler access to environment variables in your Miller script, as requested in #315.

Bugfixes

mlr format-values is no longer SEGVing on CSV/TSV input. This was reported on #330.
#313 fixes a corner case when field names within command-line arguments have embedded newlines.
Line/column indicators for JSON-formatting error messages are now correct (previously they were showing up as 0).
end {print NF} no longer SEGVs. This was reported in #330.
Several broken doc links were fixed up as reported on #329.

Windows note

The AppVeyor build has been broken for a while so there is no Windows executable attached to this release -- when I fix that there will be a 5.8.1 with Windows binaries. My apologies for the delay. Issue #354 is open to track this.

Assets 6

17 Mar 03:24

johnkerl

v5.7.0

a4037d3

Ports, bugfixes, and keystroke-savers

Ports

Miller is available via MacPorts thanks to @herbygillot. Miller tracking issue is #273.
An Alpine Linux port is pending this release thanks to @terorie. Miller tracking issue is #293.

Features

The new remove-empty-columns and skip-trivial-records are keystroke-savers for things which would other require DSL syntax, as tracked in #274.

Bugfixes

A bug regarding optional regex-pattern groups was fixed in #277.
As of #294 you can now specify --implicit-csv-header for the join-file in mlr join.
A bug with spaces in XTAB-file values was fixed on #296.
A bug with missing final newline for XTAB-formatted files using MMAP files was fixed on #301.

Documentation

Look-and-feel at http://johnkerl.org/miller/doc/ is (hopefully) improved, including clearer visual indication of which section/page you're currently looking at. Note that this change has been live for a few weeks, as look-and-feel-related doc-mods from post-5.6.2 were backported to http://johnkerl.org/miller/doc/.
#282 improves DSL-function documentation at http://johnkerl.org/miller/doc/reference-dsl.html#Built-in_functions_for_filter_and_put,_summary

Note

Support for mmap mode has been entirely discontinued. This is an invisible change and should not affect you at all. For anyone interested in lower-level details, though, the summary is as follows:

For an incremental performance gain (perhaps 10-20% run time at most, but see below), within the C source code one can use the mmap system call to access input files via pointer arithmetic rather than malloc-and-memcopy using stdio.
However mmap is not available when reading from standard input -- it cannot be memory-mapped.
This means all file-format readers are implemented twice within the Miller source code.
While I try to regression-test Miller thoroughly, running all canned tests through mmap and stdio mode, I've nonetheless found my mmap implementations liable to corner-cases which I miss but users find: for example #29, #102, and #296.
As tracked on #160, various operating systems do not release mmapped pages after use as one might intuit, meaning that for large files and/or large numbers of files, I've for a long time now needed to have Miller opt out of mmap usage for precisely those cases which most need the performance gain: see #160, #181, and #256.
Additionally, mmap is not used at all for Windows/MSYS2 so there is nothing to lose there.

For these reasons, keeping mmap mode isn't worth the development overhead.

As of release 5.7.0, the mlr executable will still accept the --mmap and --no-mmap command-line flags as no-ops, for backward compatibility.

The caveat for you is that for everyday small files, the default was previously mmap mode and is now stdio (except mlr ... < filename or ... | mlr ... which have always used stdio). There is the off chance that this will newly reveal an old, latent bug or two somewhere.

I've re-run regressions in valgrind mode to aggressively catch any errors, but, please let me know ASAP via GitHub issue of any unexpected behavior in 5.7.0.

Assets 8

22 Sep 00:20

johnkerl

v5.6.2

001321a

Miller 5.6.2: Bug fix for CSV/TSV with many files

Bug fixes:

#271 fixes a corner-case bug with more than 100 CSV/TSV files with headers of varying lengths.

Documentation:

The new http://johnkerl.org/miller/doc/whyc-details.html is an elaboration on http://johnkerl.org/miller/doc/whyc.html which answers a question posed by @BurntSushi on Reddit a couple years ago which I did not address in detail at the time.

Contributors

BurntSushi

Assets 7

17 Sep 03:23

johnkerl

v5.6.1

6811478

Mobile-friendly docs

The only change is that http://johnkerl.org/miller/doc is now more mobile-friendly.

All build artifacts are the same as at https://github.com/johnkerl/miller/releases/tag/v5.6.0

Before

After

Assets 2

13 Sep 02:25

johnkerl

v5.6.0

377ce82

System calls / external commands, ASV/USV support, and bulk numeric formatting

Features:

The new system DSL function allows you to run arbitrary shell commands and store them in field values. Some example usages are documented here. This is in response to issues #246 and #209.
There is now support for ASV and USV file formats. This is in response to issue #245.
The new format-values verb allows you to apply numerical formatting across all record values. This is in response to issue #252.

Documentation:

The new DKVP I/O in Python sample code now works for Python 2 as well as Python 3.
There is a new cookbook entry on doing multiple joins. This is in response to issue #235.

Bugfixes:

The toupper, tolower, and capitalize DSL functions are now UTF-8 aware, thanks to @sheredom's marvelous https://github.com/sheredom/utf8.h. The internationalization page has also been expanded. This is in response to issue #254.
#250 fixes a bug using in-place mode in conjunction with verbs (such as rename or sort) which take field-name lists as arguments.
#253 fixes a bug in the label when one or more names are common between old and new.
#251 fixes a corner-case bug when (a) input is CSV; (b) the last field ends with a comma and no newline; (c) input is from standard input and/or --no-mmap is supplied.

Note:

Thanks to @aborruso @davidselassie @joelparkerhenderson for the bug reports and feature requests!! :)

Assets 7

01 Sep 03:03

johnkerl

v5.5.0

beeb9ae

Positional indexing and other data-cleaning features

Features:

The new positional-indexing feature resolves #236 from @aborruso. You can now get the name of the 3rd field of each record via $[[3]], and its value by $[[[3]]]. These are both usable on either the left-hand or right-hand side of assignment statements, so you can more easily do things like renaming fields progrmatically within the DSL.
There is a new capitalize DSL function, complementing the already-existing toupper. This stems from #236.
There is a new skip-trivial-records verb, resolving #197. Similarly, there is a new remove-empty-columns verb, resolving #206. Both are useful for data-cleaning use-cases.
Another pair is #181 and #256. While Miller uses mmap internally (and invisibily) to get approximately a 20% performance boost over not using it, this can cause out-of-memory issues with reading either large files, or too many small ones. Now, Miller automatically avoids mmap in these cases. You can still use --mmap or --no-mmap if you want manual control of this.
There is a new --ivar option for the nest verb which complements the already-existing --evar. This is from #260 thanks to @jgreely.
There is a new keystroke-saving urandrange DSL function: urandrange(low, high) is the same as low + (high - low) * urand(). This arose from #243.
There is a new -v option for the cat verb which writes a low-level record-structure dump to standard error.
There is a new -N option for mlr which is a keystroke-saver for --implicit-csv-header --headerless-csv-output.

Documentation:

The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_to_escape_'%3F'_in_regexes%3F resolves #203.
The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_can_I_filter_by_date%3F resolves #208.
#244 fixes a documentation issue while highlighting the need for #241.

Bugfixes:

There was a SEGV using nest within then-chains, fixed in response to #220.
Quotes and backslashes weren't being escaped in JSON output with --jvquoteall; reported on #222.

An extra thank-you:

I've never code-named releases but if I were to code-name 5.5.0 I would call it "aborruso". Andrea has contributed many fantastic feature requests, as well as driving a huge volume of Miller-related discussions in StackExchange (#212). Mille grazie al mio amico @aborruso!

Assets 7

14 Oct 20:25

johnkerl

5.4.0

d3dbfb7

New data-cleaning features, Windows mlr.exe, limited localtime support, and bugfixes

Features:

The new clean-whitespace verb resolves #190 from @aborruso. Along with the new functions strip, lstrip, rstrip, collapse_whitespace, and clean_whitespace, there is now both coarse-grained and fine-grained control over whitespace within field names and/or values. See the linked-to documentation for examples.
The new altkv verb resolves #184 which was originally opened via an email request. This supports mapping value-lists such as a,b,c,d to alternating key-value pairs such as a=b,c=d.
The new fill-down verb resolves #189 by @aborruso. See the linked-to documentation for examples.
The uniq verb now has a uniq -a which resolves #168 from @sjackman.
The new regextract and regextract_or_else functions resolve #183 by @aborruso.
The new ssub function arises from #171 by @dohse, as a simplified way to avoid escaping characters which are special to regular-expression parsers.
There are new localtime functions in response to #170 by @sitaramc. However note that as discussed on #170 these do not undo one another in all circumstances. This is a non-issue for timezones which do not do DST. Otherwise, please use with disclaimers: localdate, localtime2sec, sec2localdate, sec2localtime, strftime_local, and strptime_local.

Builds:

Windows build-artifacts are now available in Appveyor at https://ci.appveyor.com/project/johnkerl/miller/build/artifacts, and will be attached to this and future releases. This resolves #167, #148, and #109.
Travis builds at https://travis-ci.org/johnkerl/miller/builds now run on OSX as well as Linux.
An Ubuntu 17 build issue was fixed by @singalen on #164.

Documentation:

put/filter documentation was confusing as reported by @NikosAlexandris on #169.
The new FAQ entry http://johnkerl.org/miller-releases/miller-head/doc/faq.html#How_to_rectangularize_after_joins_with_unpaired? resolves #193 by @aborruso.
The new cookbook entry http://johnkerl.org/miller/doc/cookbook.html#Options_for_dealing_with_duplicate_rows arises from #168 from @sjackman.
The unsparsify documentation had some words missing as reported by @tst2005 on #194.
There was a typo in the cookpage page http://johnkerl.org/miller/doc/cookbook.html#Full_field_renames_and_reassigns as fixed by @tst2005 in #192.

Bugfixes:

There was a memory leak for TSV-format files only as reported by @treynr on #181.
Dollar sign in regular expressions were not being escaped properly as reported by @dohse on #171.

Assets 8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Bugfixes

Documentation

Go-port preview

Contributors

Features

Bugfixes

Windows note

Ports

Features

Bugfixes

Documentation

Note

Contributors

Before

After

Features:

Documentation:

Bugfixes:

Note:

Features:

Documentation:

Bugfixes:

An extra thank-you:

Features:

Builds:

Documentation:

Bugfixes:

Releases: johnkerl/miller

sort-within-records, unsparsify -f, misc updates; Go-port beta

Features

Bugfixes

Documentation

Go-port preview

Contributors

Security update: disallow --prepipe in .mlrrc

.mlrrc feature, and fix Windows build

Better environment-variable support, new 'count' verb, bugfixes

Features

Bugfixes

Windows note

Ports, bugfixes, and keystroke-savers

Ports

Features

Bugfixes

Documentation

Note

Miller 5.6.2: Bug fix for CSV/TSV with many files

Contributors

Mobile-friendly docs

Before

After

System calls / external commands, ASV/USV support, and bulk numeric formatting

Features:

Documentation:

Bugfixes:

Note:

Positional indexing and other data-cleaning features

Features:

Documentation:

Bugfixes:

An extra thank-you:

New data-cleaning features, Windows mlr.exe, limited localtime support, and bugfixes

Features:

Builds:

Documentation:

Bugfixes: