Si example problems #169

Zouzw76 · 2025-01-08T13:27:55Z

Zouzw76
Jan 8, 2025

Hi Nakib,
Thank you for the great work on elphbolt.
I am trying to run the silicon examples. And I am plagued by some problems. Could you please provide some insights and help for me?
Version:
elphbolt1.1 (passed all the tests).
opencoarrays-2.9.3
gcc-12.3.0
mpich-4.1.2
•When I ran the Si examples using the (6, 24) mesh with 16 images, elphbolt completed successfully. However, I encountered the following error after the message 'Thanks for using elphbolt. Bye!' was output:

The output ends with the following:

•If I disregard this error, I ran the Si example using a (50,150) mesh on four nodes, each equipped with 28 Intel(R) Xeon(R) CPU E5-2680 [email protected] cores, just as you did. However, the code remained in a state like the one below for more than six hours. Is this normal, or could this be caused by the errors mentioned above?

I am looking forward to your reply.
Best,
Ziwen

Answered by Zouzw76

Mar 7, 2025

Dear Nakib,
The issue I encountered has been resolved after recompiling elphbolt. As you suggested, I conducted additional tests with different (q, k) meshes. I found that when running the code on a single node with 96 cores, all calculations completed successfully within a reasonable time. However, when using multiple nodes, especially more than three, the code froze unexpectedly and exited abnormally. This led me to suspect a parallelization issue.
To address this, I recompiled the latest elphbolt using GCC 12.3.0, OpenCoarrays 2.10.2, and FPM 0.9.0. Moreover, I obtained a temporary storage directory from the cluster administrator that is better suited for intensive I/O operations. Now,…

View full answer

nakib · 2025-01-08T13:41:27Z

nakib
Jan 8, 2025
Maintainer

Dear Ziwen,

Thanks for your kind words and your question.

It is not normal for the code to be stuck at that stage. I have seen this before with builds with OpenCoarrays built with a version of MPI that is not supported. Can you try the following?

Unload any existing MPI module on your cluster.
Rebuild OpenCoarrays with

./install.sh --with-fortran <path to>/gfortran \
--with-cxx <path to>/g++ \
--with-c <path to>/gcc

This way, OpenCoarrays will download a compatible version of mpich during the build process.

Make sure that this internal version of mpich is actually being used by checking the output of caf --show and cafrun -- show.

Please report back if this solves your issue.

Another point: I recommend staying up to date with the latest develop branch, if possible.

Best,
Nakib

6 replies

nakib Jan 9, 2025
Maintainer

Hi Ziwen,

Thanks for the details. It looks like you have set your data dump directory in your home directory on the cluster. This code does a lot of disk i/o and the data dump should be happening on a dedicated scratch directory. Please ask your cluster manager about best practices for running i/o intensive codes. And please let me know if this solves the issue.

Best,
Nakib

Zouzw76 Jan 19, 2025
Author

Hi Nakib,

I have got a dedicated scratch directory from the cluster manager, but the problem persists. My code has been running for about 4 days on 4 nodes, each equipped with 28 Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz cores. It exits abnormally at the "Calculating interactions" section. Below is the script I submitted:"

Could you please provide some insight into this?
Best,
Ziwen

nakib Jan 19, 2025
Maintainer

Dear Ziwen,

4 days of runtime for this calculation is definitely not the expected behavior. I would first like to make sure that your datadump directory is on your scratch and not on your log-in node. You should set the datadumpdir in input.nml as a direct path to some directory in your scratch. Have you done this already?

Please include the full input.nml, job submission script, and your output file in your response.

Best,
Nakib

Zouzw76 Jan 19, 2025
Author

Hi Nakib,

I simply copied all my input files to a dedicated scratch directory ('workdir' in the job submission script) and then ran the elphbolt code. The dedicated scratch directory was provided by the cluster manager. Below is the job submission script:

The input.nml was copied from the Si-example, with only the number of meshes changed. Below is the input.nml:

Below is the output:

It exit abnormally because it exceeded the allowed runtime on the nodes being used.
Best,
Ziwen

nakib Jan 19, 2025
Maintainer

Thanks. The behavior seems quite unpredictable. For example, in your previous run, the code went through this part of the problem quite quickly. I have not seen this behavior in any of the clusters I have run this code on. But in order to diagnose your issue, let's look at a cheaper calculation. Can you reduce the problem parameters to a 20x20x20 q-mesh and 60x60x60 k-mesh? Then check if the calculations finish properly on just 1 node. Then run again on 2 nodes. And so on.

How is your cluster set up? Are there very strict limits on how much i/o a job can do? Is it a standard cluster with Infiniband connectivity?

Have you tried to run the code on another cluster?

Best,
Nakib

Zouzw76 · 2025-01-19T12:03:35Z

Zouzw76
Jan 19, 2025
Author

Dear Nakib,
The cluster manager has confirmed that our cluster uses Infiniband connectivity and has no I/O limits. I successfully ran the code with a 12x12x12 q-mesh and a 48x48x48 k-mesh on 2 nodes. Below is the output file:

Thank for your suggestion. I plan to try the (20,60) mesh, (30,90) mesh, and so on. I have not yet run the code on another cluster, but I will attempt to do so if the issue persists on this cluster.

1 reply

nakib Jan 19, 2025
Maintainer

Sounds good. And please keep me updated. By the way, you can add text files on github. There is no need for pasting images.

Best,
Nakib

Zouzw76 · 2025-03-07T07:22:48Z

Zouzw76
Mar 7, 2025
Author

Dear Nakib,
The issue I encountered has been resolved after recompiling elphbolt. As you suggested, I conducted additional tests with different (q, k) meshes. I found that when running the code on a single node with 96 cores, all calculations completed successfully within a reasonable time. However, when using multiple nodes, especially more than three, the code froze unexpectedly and exited abnormally. This led me to suspect a parallelization issue.
To address this, I recompiled the latest elphbolt using GCC 12.3.0, OpenCoarrays 2.10.2, and FPM 0.9.0. Moreover, I obtained a temporary storage directory from the cluster administrator that is better suited for intensive I/O operations. Now, the code runs normally.
Best regards,
Ziwen

1 reply

nakib Mar 7, 2025
Maintainer

Dear Ziwen,

Thanks a lot for letting me know. I'm glad the issue has been sorted out.

Best regards,
Nakib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Si example problems #169

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Si example problems #169

Zouzw76 Jan 8, 2025

Replies: 3 comments · 8 replies

nakib Jan 8, 2025 Maintainer

nakib Jan 9, 2025 Maintainer

Zouzw76 Jan 19, 2025 Author

nakib Jan 19, 2025 Maintainer

Zouzw76 Jan 19, 2025 Author

nakib Jan 19, 2025 Maintainer

Zouzw76 Jan 19, 2025 Author

nakib Jan 19, 2025 Maintainer

Zouzw76 Mar 7, 2025 Author

nakib Mar 7, 2025 Maintainer

Zouzw76
Jan 8, 2025

Replies: 3 comments 8 replies

nakib
Jan 8, 2025
Maintainer

nakib Jan 9, 2025
Maintainer

Zouzw76 Jan 19, 2025
Author

nakib Jan 19, 2025
Maintainer

Zouzw76 Jan 19, 2025
Author

nakib Jan 19, 2025
Maintainer

Zouzw76
Jan 19, 2025
Author

nakib Jan 19, 2025
Maintainer

Zouzw76
Mar 7, 2025
Author

nakib Mar 7, 2025
Maintainer