Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tseries class #294

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open

tseries class #294

wants to merge 33 commits into from

Conversation

kostrzewa
Copy link
Member

This is a proposal for a new type of container for hadron to deal with arbitrary timeseries data. The basic idea is to represent everything as tidy data frames and to implement transformations and analysis on these.

@kostrzewa
Copy link
Member Author

kostrzewa commented Jul 28, 2020

The reason that I started this is because I wanted to have some data structure to be able to easily bootstrap the determination of the gradient flow scales. The gradient flow quantities are a little like correlation functions in that they have a "second" dimension (flow time) [source-sink separation in the cf case] which is orthogonal to the measurement direction. Since I did not want to use just a 2d array to which all possible "observables" and "source-sink-separations" are added, it became kind of natural to represent the timeseries as a tidy data frame with "data" variables and "explanatory" variables.

A 3D field observed at 100 "md times" could for example be represented as follows:

md_idx <- 1:100
xyz_md <- expand.grid(x=0:9, y=0:9, z=0:9, md_idx=md_idx)


val <- sin(xyz_md$x)*cos(xyz_md$y)*log(1 + xyz_md$z)*rnorm(prod(dim(xyz_md)))
threed_field_tseries <- 
  tseries_orig(data = cbind(xyz_md, val = val),
               explanatory_vars = c('x','y','z'))

and bootstrapped

bs_threed_field_tseries <- 
  bootstrap.tseries(threed_field_tseries,
                    boot.R = 100, boot.l = 10, seed = 12345,
                    sim = 'geom', endcorr = TRUE, serial = FALSE)

Compatible timeseries can be added, subtracted or multiplied

diff_threed_field_tseries <- bs_threed_field_tseries - bs_second_threed_field_tseries

and the tidy format makes them easy to plot

p <- ggplot2::ggplot(diff_threed_field_tseries$central, aes(x = x, y = y, fill = val)) +
       ggplot2::facet_wrap(z ~ ., ncol = 5) +
       ggplot2::geom_tile() +
       ggplot2::labs(title = "diff_threed_field_tseries") +
       ggplot2::theme_bw()
plot(p)

image

For threed_field_tseries above, I could add multiple timeseries to the data structure which all have the same explanatory_vars. I'm in the process of writing mutate functions which allow such a set of resampled timeseries to be combined into a new timeseries with the same explanatory variables. Similary, one can imagine performing reductions along one of the dimensions. Let's say we would like to sum over the z direction, for example. This would transform the timeseries with explanatory_vars = c("x", "y", "z") into one with explanatory_vars = c("x", "y"). Of course, the situation of dealing with derived timeseries (without original data) is also forseen.

R/timeseries.R Outdated
{
idcs <- sample_idcs[row_idx,,drop=TRUE]
# compute the row indices corresponding to the bootstrap sample indices
df_sample_idcs <- unlist(lapply(idcs, function(x){ which(x == .tseries$data$md_idx) }))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this approach is doomed to failure because this statement (repeated boot.R times) is too expensive when .tseries$data$md_idx is of length O(100k), which is the case for the situations of interest... Basically I think I've found the reason why high-dimensional data should not be stored in long format if it is to be bootstrapped... back to the drawing board!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a workaround using the grr package which at least makes it bearable...

@martin-ueding
Copy link
Contributor

Apparently I'm drawn to play devil's advocate as my first reaction was: Tidy data is nice, but what about performance? In the Lüscher Analysis I found that data frames with bootstrapped quantities (boot summary) tend to have millions of elements and become hard to handle.

On an aside, geom_raster is faster than geom_tile if you have a grid-like layout. In your case with expand.grid you have exactly this case.

I am not exactly sure what your operations do, but would group_by from dplyr be helpful for you?

@kostrzewa
Copy link
Member Author

Apparently I'm drawn to play devil's advocate as my first reaction was: Tidy data is nice, but what about performance? In the Lüscher Analysis I found that data frames with bootstrapped quantities (boot summary) tend to have millions of elements and become hard to handle.

To be honest, performance overall is excellent, even for rather high dimensional data. The only problem is the bootstrapping itself which requires lots of calls to which, which is surprisingly slow. So slow, in fact, that the grr package provides a much faster implementation. There are still many calls, but it's completely within acceptable bounds.

Once the bootstrap samples have been generated, the functions that dplyr >= 1.0.0 provides are more than fast enough and bootstrapping fits to the data is rather snappy.

I am not exactly sure what your operations do, but would group_by from dplyr be helpful for you?

I use group_by_at because the grouping variables are dynamical (as I support arbitrary-dimensioned data).

I have some documentation left to write and hope to get around to that soon so that I can push the latest version for review.

@kostrzewa
Copy link
Member Author

On an aside, geom_raster is faster than geom_tile if you have a grid-like layout.

Yeah, now that you mentioned it, I remember that I used geom_raster elsewhere to draw correlation matrices.

@martin-ueding
Copy link
Contributor

That's good to hear! I do like the dynamic flexibility of the tidy data format.

It feels a bit as if you are blurring the hierarchy of paramvalf, right? In my analysis I use the hadron cf class and a bunch of parameters separate of that. With your data structure one can basically have various parameters also in the tseries class with actual support for it? That sounds very versatile!

@kostrzewa
Copy link
Member Author

kostrzewa commented Aug 3, 2020

It feels a bit as if you are blurring the hierarchy of paramvalf, right?

No, I don't think so. One wouldn't use this, for example, to combined the data from multiple ensembles (unless one is applying a transformation to the bootstrap samples of multiple ensembles, such as a fit, at which point "ensemble" ceases to be an explanatory variable). It would be possible, however.

In my analysis I use the hadron cf class and a bunch of parameters separate of that. With your data structure one can basically have various parameters also in the tseries class with actual support for it? That sounds very versatile!

The idea really is to just support arbitrary-dimensioned timeseries data without the meta-data requirements of cf and raw_cf, which don't apply (i.e. things like specifying nrObs, Time and so on, as these don't apply here). Then I provide functions to bootstrap and jackknife this data. Further, the basic arithmetic operations are supported (which are then carried out on the central value and resampling samples, with automatic calculation of standard errors). Finally, I provide functionality akin to mutate and transmute in dplyr, where one provides a "plan" of transformations to be carried out in the form of expressions which are then applied to each observable in a tseries object either on the original data (for special cases) or on the samples and central value. For example, this function interpolates all gradient flow quantities to specific value(s) of the flow time:

gf_interpolate <- function(raw_gradflow, target_ts){
  dplyr_avail <- requireNamespace("dplyr")
  if( !dplyr_avail ){
    stop("The 'dplyr' package is required to use this function!\n")
  }
  
  gf_tseries <- tseries_orig(data = dplyr::rename(raw_gradflow, md_idx = traj),
                             explanatory_vars = c("t"))

  all_obs <- dplyr::setdiff(colnames(gf_tseries$data), c("t", "md_idx"))

  res <- lapply(
    X = target_ts,
    FUN = function(target_t){
      reduce_plan <- lapply(
        X = all_obs,
        FUN = function(obs){
          exp_string <- sprintf("approx(x = t, y = %s, xout = %f)$y",
                                obs, target_t$val)
          parse(text = exp_string)                                                                                   
        }
      )
      names(reduce_plan) <- sprintf("%s_%s", all_obs, rep(target_t$name, times = length(all_obs)))

      apply_reduce_plan.tseries(ts = gf_tseries, reduce_vars = c("t"), reduce_plan)$data
    }
  )
  return(res)
}

where raw_radflow is a tidy data frame of the form

traj t P Eplaq Esym tsqEplaq tsqEsym Wsym
[...]

and may contain further columns. target_ts is a list of pairs name and val which give the name of the flow time and the value of t that this corresponds to (so there are things lie t_0, w_0, t_1).

The call apply_reduce_plan.tseries(ts = gf_tseries, reduce_vars = c("t"), reduce_plan) applies the expression approx(x = t, y = %s, xout = %f)$y to the data, automatically grouping as appropriate.

It's a bit opaque, so I need to make an effort to write good examples and documentation...

}
else if(fLength > length(tmpdata$V1)) {
stop(paste("readcmifiles: file", files[i], "is too short. Aborting...\n"))
dat <- my_lapply(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@urbach if you're curious, this pattern of iterating in parallel using mclapply (or in this case pbmclapply) over the files and binding the results together with do.call(rbind,dat) in line 281 below is about a factor of 100 faster than building the data frame ahead of the time and replacing it bit by bit with the read data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even single-threaded, it is many many times faster.

…utput.data' by a 'skip_output_data' parameter and instead rely on a 'traj_from' parameter to filter based on the trajectory index. This of course breaks ALL analysis drivers...
…used to express all auto-correlation times in terms of unit length trajectories and to plot non-unit-length trajectories at unit-length coordinates
…charge also at t = (w0/a)^2 to make it comparable across lattice spacings (instead of just using the maximal flow time)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants