-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tseries
class
#294
base: master
Are you sure you want to change the base?
tseries
class
#294
Conversation
The reason that I started this is because I wanted to have some data structure to be able to easily bootstrap the determination of the gradient flow scales. The gradient flow quantities are a little like correlation functions in that they have a "second" dimension (flow time) [source-sink separation in the cf case] which is orthogonal to the measurement direction. Since I did not want to use just a 2d array to which all possible "observables" and "source-sink-separations" are added, it became kind of natural to represent the timeseries as a tidy data frame with "data" variables and "explanatory" variables. A 3D field observed at 100 "md times" could for example be represented as follows:
and bootstrapped
Compatible timeseries can be added, subtracted or multiplied
and the tidy format makes them easy to plot
For |
R/timeseries.R
Outdated
{ | ||
idcs <- sample_idcs[row_idx,,drop=TRUE] | ||
# compute the row indices corresponding to the bootstrap sample indices | ||
df_sample_idcs <- unlist(lapply(idcs, function(x){ which(x == .tseries$data$md_idx) })) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this approach is doomed to failure because this statement (repeated boot.R
times) is too expensive when .tseries$data$md_idx
is of length O(100k), which is the case for the situations of interest... Basically I think I've found the reason why high-dimensional data should not be stored in long format if it is to be bootstrapped... back to the drawing board!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a workaround using the grr
package which at least makes it bearable...
Apparently I'm drawn to play devil's advocate as my first reaction was: Tidy data is nice, but what about performance? In the Lüscher Analysis I found that data frames with bootstrapped quantities (boot summary) tend to have millions of elements and become hard to handle. On an aside, I am not exactly sure what your operations do, but would |
To be honest, performance overall is excellent, even for rather high dimensional data. The only problem is the bootstrapping itself which requires lots of calls to Once the bootstrap samples have been generated, the functions that
I use I have some documentation left to write and hope to get around to that soon so that I can push the latest version for review. |
Yeah, now that you mentioned it, I remember that I used |
That's good to hear! I do like the dynamic flexibility of the tidy data format. It feels a bit as if you are blurring the hierarchy of paramvalf, right? In my analysis I use the hadron |
No, I don't think so. One wouldn't use this, for example, to combined the data from multiple ensembles (unless one is applying a transformation to the bootstrap samples of multiple ensembles, such as a fit, at which point "ensemble" ceases to be an explanatory variable). It would be possible, however.
The idea really is to just support arbitrary-dimensioned timeseries data without the meta-data requirements of
where
and may contain further columns. The call It's a bit opaque, so I need to make an effort to write good examples and documentation... |
…the PP fit and the combined PP/PA fit on the M_ps effective mass plot. Use the value of the latter as a represeantative value for M_ps in the result table.
…ow analysis to not lose it
…en online measurement data and the rows of
} | ||
else if(fLength > length(tmpdata$V1)) { | ||
stop(paste("readcmifiles: file", files[i], "is too short. Aborting...\n")) | ||
dat <- my_lapply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@urbach if you're curious, this pattern of iterating in parallel using mclapply
(or in this case pbmclapply
) over the files and binding the results together with do.call(rbind,dat)
in line 281 below is about a factor of 100 faster than building the data frame ahead of the time and replacing it bit by bit with the read data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even single-threaded, it is many many times faster.
…utput.data' by a 'skip_output_data' parameter and instead rely on a 'traj_from' parameter to filter based on the trajectory index. This of course breaks ALL analysis drivers...
…used to express all auto-correlation times in terms of unit length trajectories and to plot non-unit-length trajectories at unit-length coordinates
…charge also at t = (w0/a)^2 to make it comparable across lattice spacings (instead of just using the maximal flow time)
…large (as happens sometimes during thermalisation)
…e chosen in analysis_tmlqcd_gradient_flow. This can help with gradient flow data for which the flow time was not chosen long enough to reach the typical definition point (0.3).
…t non-integer labels for the gamma structures and to include an optional sample label
…d structure to store real and imaginary components of correlation functions instead of storing (r,i) pairs in sequence, we thus need a new IO function for this
This is a proposal for a new type of container for hadron to deal with arbitrary timeseries data. The basic idea is to represent everything as tidy data frames and to implement transformations and analysis on these.