Releases: johnkerl/miller
Minor feature enhancements, and portability
- Portability (affecting the CSV-RFC reader) for the Debian packaging request: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800074. The latter greatly increases the number of platforms on which Miller has been validated.
mlr decimate
: http://johnkerl.org/miller/doc/reference.html#decimate- Integer-preservation feature for
mlr top
andmlr stats1
with percentiles: If inputs are integers then corresponding outputs will be so as well (unless-F
, which forces all-float output). mlr histogram
now has a--auto
option for autocomputing lower and upper limits: http://johnkerl.org/miller/doc/reference.html#histogrammlr uniq
andmlr count-distinct
now have a-n
flag to show only the counts of distinct values, rather than listing all distinct values: http://johnkerl.org/miller/doc/reference.html#uniq http://johnkerl.org/miller/doc/reference.html#count-distinct- The
strlen
function correctly handles UTF-8 string data.
Allow scientific notation in DSL literals; mlr bar --auto
- Miller has always supported scientific notation in field values, e.g
x=1e6
. However, it had never supported scientific notation in DSL literals, e.g.mlr put '$y = $x + 1e6
. This release fixes that. - Additionally,
mlr bar
now has a---auto
flag which holds all records in memory and computes limits from the data, so you don't have to compute them separately and pass them in via--lo
and--hi
.
Integer and float arithmetic, improved documentation, minor feature enhancements
Integer/float arithmetic
The key feature of the 3.0.0 release, and the reason for the major version increment, is that previously all numbers were scanned into mlr put
and mlr filter
functions as floating-point -- then, only recast to integer as necessary for integer operations. Since IEEE doubles have 53 bits of precision (52 mantissa bits along with implicit leading one) while 64-bit integers have 64, this meant that full 64-bit integer signficance could not be passed through Miller functions.
As of the 3.0.0 release, numbers in Miller are int (64 bits) or float (double-precision). Numbers scannable as integers are treated as integers. The sum, difference, and product of two integers is another integer -- except when overflow would occur, at which point a floating-point result is produced. Integer division is pythonic, namely, 7/2
is 3.5, and 7//2
is 3. Mixed integer/float operations produce float. Bitwise operators are now supported.
You now have more control over arithmetic, not less. The only real compatibility change is that some numbers will now be printing like 123
rather than 123.0000
.
For full details please see http://johnkerl.org/miller/doc/reference.html#Arithmetic.
New functions for filter and put
- Since integers are now fully supported in
mlr put
andmlr filter
, it is now possible to have the bitwise operators| ^ & << >>
. These operate on 64-bit integers and produce 64-bit-integer results. - Modular arithmetic is implemented by
madd
,msub
,mmul
, andmexp
. urandint
andurand32
are in addition to the existingurand
.sgn
complementsabs
.strftime
andstrptime
are generalizations ofsec2gmt
andgmt2sec
. There are pass-throughs to systemstrftime
andstrptime
; see your local manpages for available time-formatting options.- Please see http://johnkerl.org/miller/doc/reference.html#Functions_for_filter_and_put for more information.
Verbs
mlr grep
: http://johnkerl.org/miller/doc/reference.html#grepmlr cat -n
option: http://johnkerl.org/miller/doc/reference.html#catmlr stats1 skewness
andmlr stats1 kurtosis
: http://johnkerl.org/miller/doc/reference.html#stats1mlr bar
allows for some simple terminal-level visualization: http://johnkerl.org/miller/doc/reference.html#barmlr join
now has full support for heterogeneous data: records lacking all the join keys are treated the same as any other left-unpaired or right-unpaired records. This was tracked on issue #82.
I/O options
mlr --xvright
for XTAB outputmlr --headerless-csv-output
for CSV/CSV-lite output
Documentation
- The
mlr.1
manpage is now autogenerated. - There is now documentation on operator precedence and function semantics.
- HTML pages at http://johnkerl.org/miller/doc/ are now PDF-renderable.
- Per-release documents are available at http://johnkerl.org/miller/doc/release-docs.html. (The documents at http://johnkerl.org/miller/doc/ have always tracked head, and they continue to do so.)
Iterative stats, exclude-filter, implicit-CSV-header, and other features
mlr stats1
andstats2
now support a-s
feature in which means, linear regressions, etc. evolve record-by-record as new records appear over time. This is particularly useful intail -f
contexts. See also http://johnkerl.org/miller/doc/reference.html#stats1 and http://johnkerl.org/miller/doc/reference.html#stats2.mlr filter
now supports a-x
flag to negate the sense of the filter: instead of editing logic expressions e.g. frommlr filter '$x < 10 || $x > 20'
tomlr filter '$x >= 10 && $x <= 20'
, you can simply domlr filter -x '$x < 10 || $x > 20'
. See also http://johnkerl.org/miller/doc/reference.html#filter.- In the event a CSV file lacks header lines, you can use
mlr --implicit-csv-header
to add positional header1,2,3,...
. You can also convert those to desired text usingmlr label
. See also http://johnkerl.org/miller/doc/reference.html#label. - Heterogeneity support is improved for
sort
,stats1
,stats2
,step
,head
,tail
,top
,sample
,uniq
, andcount-distinct
. See also #79. mlr stats2
now has a logistic-regression feature, but I recommend treating it as experimental until some numerical-stability issues involving my naïve Newton-Raphson solver are worked out -- namely, it doesn't converge in all cases.
Bug fix for mlr top -a
Memory management was incorrect in mlr top -a
.
Regex support, gsub, reservoir sampling, iterative stats, and other features
Regex support
- http://johnkerl.org/miller/doc/reference.html#Regular_expressions
- http://johnkerl.org/miller/doc/reference.html#put
- http://johnkerl.org/miller/doc/reference.html#filter
- http://johnkerl.org/miller/doc/reference.html#having-fields
- http://johnkerl.org/miller/doc/reference.html#cut
- http://johnkerl.org/miller/doc/reference.html#rename
gsub function
In addition to the existing sub
function: replace-all in addition to replace-once. Includes regex support.
http://johnkerl.org/miller/doc/reference.html#Functions_for_filter_and_put
Reservoir sampling
http://johnkerl.org/miller/doc/reference.html#sample
Iterative stats1/stats2
Use mlr stats1 -s ...
or mlr stats2 -s ...
to print averages, min/max, correlation, etc. on every record. Useful in tail -f
contexts when you want to see statistics evolving as the data evolve in time.
http://johnkerl.org/miller/doc/reference.html#stats1
http://johnkerl.org/miller/doc/reference.html#stats2
Minor
- Initial delta for
mlr step -a delta
is now 0, matching initial 1 formlr step -a ratio
- Usage messages consistently go to stdout when asked for via
-h
, and stderr in case of command-line syntax errors - Online help is confined to 80-character column width, except for
mlr -f
which is all single-line greppable Header/data length mismatch
error messages for CSV/CSV-lite now include file/line context
Autoconfig support
Documentation at http://johnkerl.org/miller/doc/build.html
Resolves #9
Most of the work here due to @0-wiz-0
Multi-character RS,FS,PS
You can process CRLF-terminated DKVP files with mlr --dkvp --rs crlf
.
You can process LF-terminated CSV files with mlr --csv --rs lf
.
You can process TSV using mlr --fs tab
; you can convert TSV to CSV using mlr --ifs tab --ofs comma
.
Along with many more possibilities.
Please see mlr -h
for more information.
There is one minor, backward-incompatible change which I felt not worth calling this 3.0.0: default field separator for NIDX format is now space, not comma.
Improved read performance for RFC4180 CSV
Resolves #51
RFC-compliant CSV input is now about 60% faster than at initial feature release (https://github.com/johnkerl/miller/releases/tag/v2.0.0). It remains about 50% slower than CSV-lite.
Reduce tar-file size
Addresses #61