`tseries` class #294

kostrzewa · 2020-07-28T18:43:04Z

This is a proposal for a new type of container for hadron to deal with arbitrary timeseries data. The basic idea is to represent everything as tidy data frames and to implement transformations and analysis on these.

…dient flow

kostrzewa · 2020-07-28T18:51:36Z

The reason that I started this is because I wanted to have some data structure to be able to easily bootstrap the determination of the gradient flow scales. The gradient flow quantities are a little like correlation functions in that they have a "second" dimension (flow time) [source-sink separation in the cf case] which is orthogonal to the measurement direction. Since I did not want to use just a 2d array to which all possible "observables" and "source-sink-separations" are added, it became kind of natural to represent the timeseries as a tidy data frame with "data" variables and "explanatory" variables.

A 3D field observed at 100 "md times" could for example be represented as follows:

md_idx <- 1:100
xyz_md <- expand.grid(x=0:9, y=0:9, z=0:9, md_idx=md_idx)


val <- sin(xyz_md$x)*cos(xyz_md$y)*log(1 + xyz_md$z)*rnorm(prod(dim(xyz_md)))
threed_field_tseries <- 
  tseries_orig(data = cbind(xyz_md, val = val),
               explanatory_vars = c('x','y','z'))

and bootstrapped

bs_threed_field_tseries <- 
  bootstrap.tseries(threed_field_tseries,
                    boot.R = 100, boot.l = 10, seed = 12345,
                    sim = 'geom', endcorr = TRUE, serial = FALSE)

Compatible timeseries can be added, subtracted or multiplied

diff_threed_field_tseries <- bs_threed_field_tseries - bs_second_threed_field_tseries

and the tidy format makes them easy to plot

p <- ggplot2::ggplot(diff_threed_field_tseries$central, aes(x = x, y = y, fill = val)) +
       ggplot2::facet_wrap(z ~ ., ncol = 5) +
       ggplot2::geom_tile() +
       ggplot2::labs(title = "diff_threed_field_tseries") +
       ggplot2::theme_bw()
plot(p)

For threed_field_tseries above, I could add multiple timeseries to the data structure which all have the same explanatory_vars. I'm in the process of writing mutate functions which allow such a set of resampled timeseries to be combined into a new timeseries with the same explanatory variables. Similary, one can imagine performing reductions along one of the dimensions. Let's say we would like to sum over the z direction, for example. This would transform the timeseries with explanatory_vars = c("x", "y", "z") into one with explanatory_vars = c("x", "y"). Of course, the situation of dealing with derived timeseries (without original data) is also forseen.

kostrzewa · 2020-07-29T06:54:37Z

R/timeseries.R

+      {
+        idcs <- sample_idcs[row_idx,,drop=TRUE]
+        # compute the row indices corresponding to the bootstrap sample indices
+        df_sample_idcs <- unlist(lapply(idcs, function(x){ which(x == .tseries$data$md_idx) }))


I'm afraid this approach is doomed to failure because this statement (repeated boot.R times) is too expensive when .tseries$data$md_idx is of length O(100k), which is the case for the situations of interest... Basically I think I've found the reason why high-dimensional data should not be stored in long format if it is to be bootstrapped... back to the drawing board!

I have a workaround using the grr package which at least makes it bearable...

martin-ueding · 2020-08-03T06:55:42Z

Apparently I'm drawn to play devil's advocate as my first reaction was: Tidy data is nice, but what about performance? In the Lüscher Analysis I found that data frames with bootstrapped quantities (boot summary) tend to have millions of elements and become hard to handle.

On an aside, geom_raster is faster than geom_tile if you have a grid-like layout. In your case with expand.grid you have exactly this case.

I am not exactly sure what your operations do, but would group_by from dplyr be helpful for you?

kostrzewa · 2020-08-03T08:28:36Z

Apparently I'm drawn to play devil's advocate as my first reaction was: Tidy data is nice, but what about performance? In the Lüscher Analysis I found that data frames with bootstrapped quantities (boot summary) tend to have millions of elements and become hard to handle.

To be honest, performance overall is excellent, even for rather high dimensional data. The only problem is the bootstrapping itself which requires lots of calls to which, which is surprisingly slow. So slow, in fact, that the grr package provides a much faster implementation. There are still many calls, but it's completely within acceptable bounds.

Once the bootstrap samples have been generated, the functions that dplyr >= 1.0.0 provides are more than fast enough and bootstrapping fits to the data is rather snappy.

I am not exactly sure what your operations do, but would group_by from dplyr be helpful for you?

I use group_by_at because the grouping variables are dynamical (as I support arbitrary-dimensioned data).

I have some documentation left to write and hope to get around to that soon so that I can push the latest version for review.

kostrzewa · 2020-08-03T08:30:50Z

On an aside, geom_raster is faster than geom_tile if you have a grid-like layout.

Yeah, now that you mentioned it, I remember that I used geom_raster elsewhere to draw correlation matrices.

martin-ueding · 2020-08-03T08:47:15Z

That's good to hear! I do like the dynamic flexibility of the tidy data format.

It feels a bit as if you are blurring the hierarchy of paramvalf, right? In my analysis I use the hadron cf class and a bunch of parameters separate of that. With your data structure one can basically have various parameters also in the tseries class with actual support for it? That sounds very versatile!

kostrzewa · 2020-08-03T09:05:02Z

It feels a bit as if you are blurring the hierarchy of paramvalf, right?

No, I don't think so. One wouldn't use this, for example, to combined the data from multiple ensembles (unless one is applying a transformation to the bootstrap samples of multiple ensembles, such as a fit, at which point "ensemble" ceases to be an explanatory variable). It would be possible, however.

In my analysis I use the hadron cf class and a bunch of parameters separate of that. With your data structure one can basically have various parameters also in the tseries class with actual support for it? That sounds very versatile!

The idea really is to just support arbitrary-dimensioned timeseries data without the meta-data requirements of cf and raw_cf, which don't apply (i.e. things like specifying nrObs, Time and so on, as these don't apply here). Then I provide functions to bootstrap and jackknife this data. Further, the basic arithmetic operations are supported (which are then carried out on the central value and resampling samples, with automatic calculation of standard errors). Finally, I provide functionality akin to mutate and transmute in dplyr, where one provides a "plan" of transformations to be carried out in the form of expressions which are then applied to each observable in a tseries object either on the original data (for special cases) or on the samples and central value. For example, this function interpolates all gradient flow quantities to specific value(s) of the flow time:

gf_interpolate <- function(raw_gradflow, target_ts){
  dplyr_avail <- requireNamespace("dplyr")
  if( !dplyr_avail ){
    stop("The 'dplyr' package is required to use this function!\n")
  }
  
  gf_tseries <- tseries_orig(data = dplyr::rename(raw_gradflow, md_idx = traj),
                             explanatory_vars = c("t"))

  all_obs <- dplyr::setdiff(colnames(gf_tseries$data), c("t", "md_idx"))

  res <- lapply(
    X = target_ts,
    FUN = function(target_t){
      reduce_plan <- lapply(
        X = all_obs,
        FUN = function(obs){
          exp_string <- sprintf("approx(x = t, y = %s, xout = %f)$y",
                                obs, target_t$val)
          parse(text = exp_string)                                                                                   
        }
      )
      names(reduce_plan) <- sprintf("%s_%s", all_obs, rep(target_t$name, times = length(all_obs)))

      apply_reduce_plan.tseries(ts = gf_tseries, reduce_vars = c("t"), reduce_plan)$data
    }
  )
  return(res)
}

where raw_radflow is a tidy data frame of the form

traj t P Eplaq Esym tsqEplaq tsqEsym Wsym
[...]

and may contain further columns. target_ts is a list of pairs name and val which give the name of the flow time and the value of t that this corresponds to (so there are things lie t_0, w_0, t_1).

The call apply_reduce_plan.tseries(ts = gf_tseries, reduce_vars = c("t"), reduce_plan) applies the expression approx(x = t, y = %s, xout = %f)$y to the data, automatically grouping as appropriate.

It's a bit opaque, so I need to make an effort to write good examples and documentation...

…the PP fit and the combined PP/PA fit on the M_ps effective mass plot. Use the value of the latter as a represeantative value for M_ps in the result table.

…ow analysis to not lose it

…en online measurement data and the rows of

kostrzewa · 2022-03-05T09:34:42Z

R/readutils.R

-      }
-      else if(fLength > length(tmpdata$V1)) {
-        stop(paste("readcmifiles: file", files[i], "is too short. Aborting...\n"))
+  dat <- my_lapply(


@urbach if you're curious, this pattern of iterating in parallel using mclapply (or in this case pbmclapply) over the files and binding the results together with do.call(rbind,dat) in line 281 below is about a factor of 100 faster than building the data frame ahead of the time and replacing it bit by bit with the read data

Even single-threaded, it is many many times faster.

…ge.R

…utput.data' by a 'skip_output_data' parameter and instead rely on a 'traj_from' parameter to filter based on the trajectory index. This of course breaks ALL analysis drivers...

…used to express all auto-correlation times in terms of unit length trajectories and to plot non-unit-length trajectories at unit-length coordinates

…charge also at t = (w0/a)^2 to make it comparable across lattice spacings (instead of just using the maximal flow time)

…large (as happens sometimes during thermalisation)

…online analysis

…e chosen in analysis_tmlqcd_gradient_flow. This can help with gradient flow data for which the flow time was not chosen long enough to reach the typical definition point (0.3).

…t non-integer labels for the gamma structures and to include an optional sample label

…d structure to store real and imaginary components of correlation functions instead of storing (r,i) pairs in sequence, we thus need a new IO function for this

kostrzewa added 4 commits July 28, 2020 20:28

have explicit source file for generic functions

18b1b36

use .Rdata as extension in analysis_online

2540be1

begin work on 'tseries' class

9475263

vignette in which the new 'tseries' class is applied to study the gra…

3d95596

…dient flow

kostrzewa added enhancement do not merge yet labels Jul 28, 2020

add 'tseries' vignette

eeaa93a

kostrzewa commented Jul 29, 2020

View reviewed changes

kostrzewa added 6 commits April 7, 2021 12:27

In the analysis of online measurements, indicate M_ps extracted from …

4c88db7

…the PP fit and the combined PP/PA fit on the M_ps effective mass plot. Use the value of the latter as a represeantative value for M_ps in the result table.

store WIP of timeseries toolset and new implementation of gradient fl…

02afcc0

…ow analysis to not lose it

two digits for error on scale in gradient flow evolution plots

5bdf9d9

print statement which shouldn't exist in readcmifiles

c87a928

add to parameters of as a way to have different starting points betwe…

aa4cdab

…en online measurement data and the rows of

speed up readcmifiles by a factor of more than 100

85f11a8

kostrzewa commented Mar 5, 2022

View reviewed changes

kostrzewa added 9 commits March 7, 2022 10:59

Merge remote-tracking branch 'origin/master' into tseries

ce93f04

resolve documentation warnings due to missing names in R/hadron-packa…

5aa3c81

…ge.R

speed up gradient flow I/O by about a factor of 200

8550080

remove 'skip' from analysis_online, replace this functionality for 'o…

c26cbc2

…utput.data' by a 'skip_output_data' parameter and instead rely on a 'traj_from' parameter to filter based on the trajectory index. This of course breaks ALL analysis drivers...

be more helpful when omeas.offset is used and no trajectories are found

73b407c

note dependency on reshape2

597efe4

account for dimesions of gradient flow reference data

7d7d38f

add trajectory_length parameter for analysis_online which can be …

9e8d133

…used to express all auto-correlation times in terms of unit length trajectories and to plot non-unit-length trajectories at unit-length coordinates

in analysis_tmlqcd_gradient_flow, plot and summarize the topological …

accc42c

…charge also at t = (w0/a)^2 to make it comparable across lattice spacings (instead of just using the maximal flow time)

kostrzewa added 13 commits August 20, 2022 12:26

draw error rectangles for mpcac plateau only if the error is not too …

b8690b9

…large (as happens sometimes during thermalisation)

introduce some restrictions on the plotting of error polygons in the …

e642956

…online analysis

add more documentation to 'analysis_tmlqcd_gradient_flow'

7288fbc

typo in evals.stepsize

ade423b

Allow the definition point for gluonic scales (scale_definition) to b…

34272fc

…e chosen in analysis_tmlqcd_gradient_flow. This can help with gradient flow data for which the flow time was not chosen long enough to reach the typical definition point (0.3).

parenthesis problem in analysis_online

ed17427

do not use the bootstrap samples to estimate the mean of mpi_ov_fpi

cafec48

in boostrap.nlsfit, make sure that x divides y

306fa22

generalise the generator functions for CVC / meson_2pt keys to suppor…

4a4809e

…t non-integer labels for the gamma structures and to include an optional sample label

nyom HDF5 files differ from CVC HDF5 files in that they use a compoun…

3018fd4

…d structure to store real and imaginary components of correlation functions instead of storing (r,i) pairs in sequence, we thus need a new IO function for this

do not use 'sample', use 'istoch' instead

065c2b2

remove unused dbg parameter

94922b3

ensure that timeseries dat contains 'y' and 't'

8824f0f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tseries` class #294

`tseries` class #294

kostrzewa commented Jul 28, 2020

kostrzewa commented Jul 28, 2020 •

edited

Loading

kostrzewa Jul 29, 2020

kostrzewa Jul 29, 2020

martin-ueding commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

kostrzewa commented Aug 3, 2020 •

edited

Loading

kostrzewa Mar 5, 2022

kostrzewa Mar 5, 2022

tseries class #294

Are you sure you want to change the base?

tseries class #294

Conversation

kostrzewa commented Jul 28, 2020

kostrzewa commented Jul 28, 2020 • edited Loading

kostrzewa Jul 29, 2020

Choose a reason for hiding this comment

kostrzewa Jul 29, 2020

Choose a reason for hiding this comment

martin-ueding commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

kostrzewa commented Aug 3, 2020

martin-ueding commented Aug 3, 2020

kostrzewa commented Aug 3, 2020 • edited Loading

kostrzewa Mar 5, 2022

Choose a reason for hiding this comment

kostrzewa Mar 5, 2022

Choose a reason for hiding this comment

`tseries` class #294

`tseries` class #294

kostrzewa commented Jul 28, 2020 •

edited

Loading

kostrzewa commented Aug 3, 2020 •

edited

Loading