Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with cA2.60.32 on P=-10-1 irrep=B2 #30

Open
martin-ueding opened this issue May 6, 2020 · 8 comments
Open

Problem with cA2.60.32 on P=-10-1 irrep=B2 #30

martin-ueding opened this issue May 6, 2020 · 8 comments

Comments

@martin-ueding
Copy link
Contributor

martin-ueding commented May 6, 2020

I am re-running the projections on cA2.60.32 and they work just fine for almost all irreps in every configuration. There is just one exception, namely P = (-1, 0, -1) in the B₂ irrep. And that for every configuration. It is always this output:

Opening HDF5 files …
[1] "correlators/C2c_cnfg0000.h5"
[1] "correlators/C4cC_cnfg0000.h5"
[1] "correlators/C4cD_cnfg0000.h5"
[1] "correlators/C6cC_cnfg0000.h5"
[1] "correlators/C6cCD_cnfg0000.h5"
[1] "correlators/C6cD_cnfg0000.h5"
  Done
Loading correlators from HDF5 files …

 *** caught segfault ***
address 0x22039, cause 'memory not mapped'

Traceback:
 1: H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,     compoundAsDataFrame = compoundAsDataFrame, drop = drop, ...)
 2: doTryCatch(return(expr), name, parentenv, handler)
 3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 4: tryCatchList(expr, classes, parentenv, handlers)
 5: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
 6: try({    obj <- H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile,         h5spaceMem = h5spaceMem, compoundAsDataFrame = compoundAsDataFrame,         drop = drop, ...)})
 7: h5readDataset(h5dataset, index = index, start = start, stride = stride,     block = block, count = count, compoundAsDataFrame = compoundAsDataFrame,     drop = drop, ...)
 8: h5read(file_handles[[diagram]], datasetname)
 9: FUN(X[[i]], ...)
10: lapply(needed_names, load_dataset)
11: numericprojection::numeric_projection(c(-1, 0, -1), "B2", 0)
An irrecoverable exception occurred. R is aborting now ...

I have tried to restart these jobs, but that did not help either. We had some random segfaults before, but this is consistent. It seems that it has something to do with the actual files. And it happens on all of the nodes that I have tried.

The only difference in input is the prescription file. And that does not differ from the other ensembles. And the ones related with a global rotation are just fine.

For the meantime I will just skip that B₂ irrep at P² = 2, but it feels very peculiar and I still have no idea what happens there.

@martin-ueding
Copy link
Contributor Author

For some reason this went through for two configurations this time:

$ ls resolved_-10-1_B2_*
resolved_-10-1_B2_2496.js  resolved_-10-1_B2_5328.js

@kostrzewa
Copy link
Member

This is very strange. My first instinct would be to guess that it's related to having too many HDF5 files open at the same time (I could imagine that these are internally opened using mmap), but this would suggest that things would also fail elsewhere.

@kostrzewa
Copy link
Member

I guess in the original description you mean (-1, 0, -1) rather than (-1, 0, 1), correct?

@martin-ueding
Copy link
Contributor Author

I really don't get it either. And there are not too many HDF5 files open, I start a new R process for every configuration and every irrep. It just crashes. And since it worked on two configurations, there cannot be something completely wrong with the program or the files.

@kostrzewa
Copy link
Member

I meant globally. When there are O(30) projection jobs running, the number of memory mapped files will be rather large and this might be problematic for Lustre. What if you run a projection for a single config on QBIG?

@martin-ueding
Copy link
Contributor Author

What if you run a projection for a single config on QBIG?

After all the projections were done, I did try that to see what the issue was. It seems that even with a single irrep in the whole cluster there is a problem.

I will find out how the other ensembles fare with that, perhaps it is always this irrep or just that irrep on cA2.60.32.

@kostrzewa
Copy link
Member

Hah, we figured in the end. @matfischer observed the same problem and it was solved by reinstalling rhdf5 :)

@kostrzewa
Copy link
Member

not so fast, apparently...

@kostrzewa kostrzewa reopened this Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants