-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dimension convention and table <--> matrix conversions #50
Comments
Disagreed, I'm against introducing case distinctions into the interface. If this is a common problem, it's maybe a better idea to provide helper functionality for the interface locations, rather than performing ad-hoc surgery on the interface design? (or maybe I misunderstood your suggestion?) |
@ablaom 's opinion is maybe the most important here - he seems like the most likely person to suffer the consequences (bad or good) |
I'm not sure I understand the problem exactly. For supervised models, data arrives as any generic table (currently Query iterable table) whose type is completely unknown. Incidentally, sometimes it is an MLJBase presently provides a convenience function Perhaps the benefit you are looking for is one of the casualties of an agnostic data interface? Edit And Query.jl has been dropped in favour of Tables.jl. |
Apologies for the lack of clarity and maybe that my suggestion doesn't make sense. For simplicity let's just assume the input is a matrix that the user has loaded somehow. My question is: if the user feeds |
@tlienart, and how on earth should any machinery know, or distinguish between: That is, how should any machinery incapable of mind reading know. @ablaom, I very much agree with the agnostic data interface. There's a standard way in which the interface expects data, and I only see problems arising from adding redundant type or formatting options. |
No no, I think the issue I'm raising is much simpler than the data agnostic interface with which I wholeheartedly agree. The key element is that in Julia, a transpose is a type but not a Matrix (it's a kind of "view" if you like). See below: X = randn(50, 10);
typeof(X) # Array{Float64, 2}
typeof(transpose(X)) # LinearAlgebra.Transpose{Float64,Array{Float64,2}} So if your functions only expect a foo(X::Matrix{Float64}) = maximum(X)
foo(X) # will work
foo(transpose(X)) # will NOT work to fix this you need to write foo(X::AbstractMatrix{Float64}) = maximum(X)
foo(X) # will work as before
foo(transpose(X)) # will ALSO work I hope this clarifies the confusion, sorry. So all I'm saying is to make sure that no error is thrown if a transpose is fed by the user (who would know that MLJ expects a |
Strictly speaking the user should not feed a matrix Consider the common cases You raise an issue here, which is what we should do about the user who inserts a matrix. It would be nice to allow (because it saves conversions most of the time) but there is the danger of abuse. Ah, but for supervised models there is no danger because |
Ah, I understand. I'm slightly shocked by the fact that in Julia, transposes do not behave like matrices - or, at least, do not live in the union type or similar. But surely there's good reasons for this? Anyway, regarding MLJ, this seems to be an issue of the transpose rather than of the MLJ interface. |
Ok fair enough however I imagine that this (the user feeding a matrix) will happen a lot don't you think? Especially anyone coming from ScikitLearn (the python one) might try to do that. I would just suggest that the To avoid issues we could then check in the
Note that "matching shapes" for |
I think the union type you're thinking of @fkiraly is precisely the |
Regarding @ablaom 's earlier suggestion on checking equal length of X,y etc - I'd recommend to avoid too many case distinctions as they tend to make the interface more brittle - it's very easy to forget which different cases or coercions need to be covered, unless there's a dedicated interface point for input handling. |
About AbstractMatrix : it would solve the problem, but not the problem that there's still array and transpose floating around. So now the user may only use one of three obvious types that could pop up, two of the three cases will still throw an error. Pragmatically, if we burden the user to adhere with a type restriction anway (since they may not use array!), it should be the more sensible restriction of "use an iterable, please". |
Previous comment made before reading preceding two, sorry. I don't agree on dimension checks at the level of model definitions. Surely that role lives at a higher level of the interface. There still seems to be some confusion about the responsibility of the interface implementer, and over what she has no control. The implementation author writes the |
re @fkiraly:
There is one single natural point for supervised models to check that |
Right so if I understand you correctly you are suggesting to
I'm sorry if I'm being dense and not translating your suggestion properly. I just believe that the possibility that a user tries to give a transpose matrix is extremely likely, especially non-technical users who would also be confused by the fact that |
Yes. Once you have decided the natural choice of type for Should we allow the user (at the machine level) to present matrix/abstract matrix inputs? Pros: (i) performance benefit (but not at an expected bottleneck, 99% of the time) Cons: (iii) the designers of both generic data interfaces decided against viewing matrices as tables, probably for good reasons (although I don't know what they are) I vote for simplicity. No matrices. Provide the user with a function What do others think? |
I'm good with your suggestion favoring simplicity as long as there's enough testing with various inputs for the interfaces. |
Agreed, and also agreed with @ablaom in particular on the additional argument that the model is the wrong interface level for random type coercion. As @ablaom implicitly points out, in fact we already do have a dedicated input handling interface - the data container/provider framework iterables. Hence the clean way to deal with matrices is on the data side of this rather than on the model/MLJ side. Which would mean that in consequence, the clean way of dealing with the matrices is raising an issues with IterableTables.jl rather than here. |
Please file an issue in Tables.jl if you think a method wrapping a matrix as a table can be useful. See also JuliaData/Tables.jl#56 and JuliaData/Tables.jl#58, which are a bit different but related. |
Important design question It would be nice to avoid unnecessary data "coercions" (usually table -> matrix) in meta-modelling. The two main cases are ensembling and resampling by cross-validation. The raison d'ere for the To state the convention let's recall that the The meta-algorithms I have in mind need to apply pattern-selection to the data. In the simple design this is always row-access to convention: Note that as The alternative is to not impose a convention, in which case we may as well drop the My main disappointment, if we go the simple route, is the case of homogeneous ensembles, which can be quite large. In most other use-cases I can imagine, the performance/resource bottlenecks are elsewhere. What do others think? Edit. An extra complication with the "sophisticated" option is that I will want to add a |
I don't have a strong opinion regarding this, but it may be worth looking at how StatsModels handles this, since it does a lot of similar work (see JuliaStats/StatsModels.jl#71 for the upcoming implementation). |
Question/request for clarification, @ablaom: why, or in which form, are coercions necessary in meta-modelling? I've been trying to think what you mean but I'm not sure. Isn't indexing by row index set (as supported by tables) sufficient here, why does it have to be a matrix? |
The external package algorithm being wrapped may require a matrix for its fit method (and a matrix may indeed be the most efficient form for it to work with). Remember, we have no control over the preferred input of the externally defined algorithm. Also, row selection of a table is generally slower (may be serial, as opposed to random-access). Consider, for sake of illustration, 2-fold cross validation: I want to fit on simple design
sophisticated design
The simple design requires four coercions from table to matrix (say), and the selection is generally slower (serial, not-random access). The sophisticated design requires one coercion and the selection is generally faster. These coercions probably don't matter too much for cv, but for ensembling lots of simple models, the fit may be on same order as coercion/selection. |
@ablaom, I see - I still personally think that the onus should be on the table to provide a uniform interface with quick access functionality optional. E.g., if a "access_as_matrix" flag is set internally in Xtable, it would not need coercion, or coercion would be instantaneous since the outer layer is just an access layer. This would make a lot of sense also for other wrapped storage modalities such as sparse matrices. In this respect, I disagree with the consensus in JuliaData/Tables.jl#56. In my opinion, it would make much sense to have functionality to wrap a matrix coercion-free with tables accessor syntax, even if there is no 1:1 mapping between matrices and tables. Then, in the simple design, coercions would still be called, but would be of negligible computational impact, no? Since the table sees "I'm a matrix internally" and just loops the query through to random access internally. |
@ablaom, on a higher level: I'm generally against "fixing lack of features on the wrong side of inter-package-interface". Not only does this lead to interface proliferation (e.g., now you suddenly need to tell the ensembling wrapper when to coerce) inside, but it may lead to the inter-package-interface breaking when the other package goes through consolidation/re-factoring, since it doesn't take into account the functionality you've sneakily tacked on in your package. |
Hmm. I think I understand your objections, and do not at all disagree. However, I do not understand the proposed solution:
Agreed, but Tables.jl (and Query) are far from providing this kind of access. Tables.jl returns a row (or column) iterator. The production of this iterator may be fast or slow, depending on the table being accessed, but access is necessarily sequential from the outsider's point-of-view. At least this is my understanding.
I'm sorry I don't understand. How, for example can I set "access_as_matrix" for a CSV file (a kind of data source) or even a DataFrame (a vector of vectors of differing eltypes)? There is no matrix to access?? Perhaps there is some misunderstanding about the sense in which Tables.jl is a "wrapper". Tables.jl does not wrap data in new object. Rather, it provides methods for accessing all tabular data objects in the same way. |
Sorry for creating confusion, @ablaom . It is perhaps created by me referring ambigiously to "Tables.jl in its current state" and "abstract iterator interfaces in general". A clean tabular data container interface usually provides a front-end, i.e., a user interface or access layer, and a back-end, i.e., a data interface or storage layer. As far as I understand, Tables.jl provides a unified set of access patterns for the front-end, and by-proxy implements some back-ends, through "dispatch- If so, given the nice front-end, I don't see what would stop you to write methods dispatching on matrices, or "my matrix plus schema" objects. Similarly, you could implement a back-end data container where internally you can switch matrix representation vs table representation on and off (via the "flag"). But none of these exist. So the wrapper I meant is not the accessor patterns in Tables.jl, but a purpose-built data structure which could be accessed via the patterns - does that make more sense? |
I.e., DataFrames.jl or similar could provide such a functionality, at least there needs to be some type-based case distinction between accessor and back-end, since a matrix itself doesn't store a schema. |
It could be possible to add a trait to Tables.jl indicating whether a given table type can be efficiently indexed. If not, you could copy its content to a type that fulfills this requirement (e.g. using |
@nalimilan Is there not still the problem that Tables.jl does not provide fast random access methods, even if the type supports it, because only iterators are returned? Or is this a planned features enhancement? |
That can probably be considered if it fills a need. Or maybe you can already just call |
@nalimilan Thanks for that. I did not realise that Tables.columnstable essentially returns a view and not a copy where possible. This makes a big difference in those cases. So, if doing row selection, you would recommend first making a Tables.columnstable rather than a Tables.rowstable (my current naive implementation?) Perhaps you would you be willing to review my current row/column section add-on to Tables.jl? It is here; the methods are called I must say, it would be easier for a Tables.jl user like me if Tables.jl provided row/column/cartesian indexing option, so that I didn't have to think about such details. Is the reluctance to provide such functionality because the project has other priorities (perfectly understandable!) - or is there some other objection to providing it? This functionality seems orthogonal to the existing functionality, as far as I can tell. At this point it is difficult for me to say how much a performance advantage I would get from Tables.jl providing the indexing feature. Mostly the benefit would be small, I'm sure. The cases I am thinking about are meta-algorithms like bagging and gradient boosting for arbitrary models, which we would like to apply to arbitrary models (in external packages) and not just on a case-by-case basis. In these ensemble methods, one might have hundreds or thousands of "simple" learners which are extremely quick to train (e.g., a decision stump), and perhaps this training time is comparable to the time to access the data through "inefficient" indexing. (I put inefficient in quotes because I agree now this may or may not be so bad, depending on the supplied table type. ) To avoid "inefficient" indexing, MLJ must cache the data or wrap tables in an objects with a (possibly) duplicate storage layer, along the lines I think @fkiraly is suggesting above. Both options are complications to our design which, in my opinion, are out of place. |
@fkiraly On reflection I am in complete agreement with your objections to my "sophisticated' attempts to circumvent repetitive data coercions. All row/column access to data should be applied through the one interface (Tables.jl, or some functional or functional/object wrap). I will depreciate the |
AFAICT it makes sense to collect data to a
I've looked at it quickly, and it looks good to me. Some features sound like they could be moved to CategoricalArrays (like
I can't speak for @quinnj (and better comment at JuliaData/Tables.jl#48 for possible improvements to Tables.jl), but I can see at least two reasons. First, starting with iteration over rows and columns suits most use cases, so that's a good start. Second, efficient indexing isn't possible for all sources without making a copy, contrary to iteration which works for all sources. But that doesn't mean we shouldn't investigate what's the most appropriate way of dealing with indexing. It would be too bad if packages had to duplicate features internally to make it work. Anyway, I think you should really benchmark the different options first. Without reliable information on performance design discussions cannot be very precise. If you know what's the most appropriate storage format you would need Tables.jl to produce (which could possibly vary depending on whether the source is column-oriented or row-oriented), that would help discussing concrete options. |
AFAIK, we use the
n x p
convention wheren
is the number of observations. It seems to me that we should however make the whole machinery able to take transposes (especially after ranting against other packages not offering this by hardcodingMatrix{T}
).A fix would be to replace all occurrences of
::Matrix{Float64}
by::AbstractMatrix{Float64}
.Thoughts?
The text was updated successfully, but these errors were encountered: