Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During installation of ohpc , when trying to start #pdsh -w c[1] systemctl start , it give error #1918

Closed
farrukhndm opened this issue Dec 16, 2023 · 13 comments

Comments

@farrukhndm
Copy link

farrukhndm commented Dec 16, 2023

Dear Team,
Can anyone guide for below error facing to start

below command run without any error

# systemctl enable munge   
# systemctl enable slurmctld
# systemctl start munge
# systemctl start slurmctld
# systemctl restart php-fpm
# pdsh -w c[1] systemctl start slurmd

below command give error as below , any one hlep to guide me

# pdsh -w c[1] systemctl start munge

]0;root@master:~�[root@master ~]# pdsh -w c[1] systemctl start munge 
c1: Job for munge.service failed because the control process exited with error code.
c1: See "systemctl status munge.service" and "journalctl -xe" for details.
pdsh@master: c1: ssh exited with exit code 1
@adrianreber
Copy link
Member

@farrukhndm You need to check the error messages on the compute node. Please run systemctl status munge.service or journalctl -xe on c1.

@martin-g
Copy link
Contributor

It might be a copy/paste thingy but pdsh -w c[1] systemctl start slurmd should really be pdsh -w $c[1] systemctl start slurmd, note the $ in $c[1]

@farrukhndm
Copy link
Author

farrukhndm commented Dec 20, 2023

It might be a copy/paste thingy but pdsh -w c[1] systemctl start slurmd should really be pdsh -w $c[1] systemctl start slurmd, note the $ in $c[1]

This time i again try to run with c[1] without $ , and it run without any error , means is it Ok ? or any verification at c1 ?
Further
[root@master ~]# pdsh -w c[1] systemctl start slurmd
[root@master ~]#

@farrukhndm
Copy link
Author

@farrukhndm You need to check the error messages on the compute node. Please run systemctl status munge.service or journalctl -xe on c1.


here is ouput 1 
[root@c1 ~]# systemctl status munge.service
● munge.service - MUNGE authentication service
   Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor prese>
   Active: failed (Result: exit-code) since Tue 2023-12-19 21:09:12 EST; 8h ago
     Docs: man:munged(8)
  Process: 1160 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)

Dec 19 21:09:12 c1 systemd[1]: Starting MUNGE authentication service...
Dec 19 21:09:12 c1 munged[1174]: munged: Error: Failed to check logfile "/var/l>
Dec 19 21:09:12 c1 systemd[1]: munge.service: Control process exited, code=exit>
Dec 19 21:09:12 c1 systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 19 21:09:12 c1 systemd[1]: Failed to start MUNGE authentication service.

[root@c1 ~]#

here is output 2

`[root@c1 ~]# journalctl -xe
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit session-3.scope has finished starting up.

-- The start-up result is done.
Dec 20 06:06:03 c1 systemd[1]: Started Session 5 of user root.
-- Subject: Unit session-5.scope has finished start-up
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit session-5.scope has finished starting up.

-- The start-up result is done.
Dec 20 06:06:03 c1 systemd-logind[1141]: New session 5 of user root.
-- Subject: A new session 5 has been created for user root
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- Documentation: https://www.freedesktop.org/wiki/Software/systemd/multiseat

-- A new session with the ID 5 has been created for the user root.

-- The leading process of the session is 1546.
`

@martin-g
Copy link
Contributor

This time i again try to run with c[1] without $ , and it run without any error , means is it Ok ? or any verification at c1 ?
Further
[root@master ~]# pdsh -w c[1] systemctl start slurmd
[root@master ~]#

Why without again ?
Your first message is without. Now you had to try with!

@martin-g
Copy link
Contributor

Dec 19 21:09:12 c1 munged[1174]: munged: Error: Failed to check logfile "/var/l>

I think this is the cause. The message is truncated, so it is not clear which file exactly is problematic.

@farrukhndm
Copy link
Author

farrukhndm commented Dec 20, 2023

This time i again try to run with c[1] without $ , and it run without any error , means is it Ok ? or any verification at c1 ?
Further
[root@master ~]# pdsh -w c[1] systemctl start slurmd
[root@master ~]#

Why without again ? Your first message is without. Now you had to try with!

Below is output with $
[root@master ~]# pdsh -w $c[1] systemctl start slurmd
1: ssh: connect to host 0.0.0.1 port 22: Invalid argument
pdsh@master: 1: ssh exited with exit code 255
[root@master ~]#

> After this login on c1 node

root@c1 ~]#
login as: root
[email protected]'s password:
[root@c1 ~]# systemctl status munge.service
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor prese>
Active: failed (Result: exit-code) since Wed 2023-12-20 12:40:00 EST; 4min 2>
Docs: man:munged(8)
Process: 1152 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)

Dec 20 12:39:59 c1 systemd[1]: Starting MUNGE authentication service...
Dec 20 12:39:59 c1 munged[1167]: munged: Error: Failed to check logfile "/var/l>
Dec 20 12:40:00 c1 systemd[1]: munge.service: Control process exited, code=exit>
Dec 20 12:40:00 c1 systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 20 12:40:00 c1 systemd[1]: Failed to start MUNGE authentication service.

@martin-g
Copy link
Contributor

How do you define the c array ?
Do you use the input.local templates ? For example at

c_ip[0]=172.16.1.1
c_ip[1]=172.16.1.2
c_ip[2]=172.16.1.3
c_ip[3]=172.16.1.4
you can see how c_ip is being defined.
It seems you use something custom because the templates use c_name and c_ip as array names.

About the actual error - Dec 20 12:39:59 c1 munged[1167]: munged: Error: Failed to check logfile "/var/l>:
It think it is caused due to permissions issue for /var/log/munge/munged.log
What is the output of ls -alR /var/log/munge* ?
What is the error if you try to run munged manually, i.e. without systemd ?

@martin-g
Copy link
Contributor

Actually, I am in mistake about pdsh !
It is smart enough to deal with c[1]! So there is no need of $ !
You can focus only on the munged failure on the compute node.

@martin-g
Copy link
Contributor

Please also try [root@master ~]# pdsh -l root -w c[1] systemctl start slurmd

@farrukhndm
Copy link
Author

farrukhndm commented Dec 21, 2023

Actually, I am in mistake about pdsh ! It is smart enough to deal with c[1]! So there is no need of $ ! You can focus only on the munged failure on the compute node.

its ok, its worked fine without $ as below , now will check munge error & update you
[root@master ~]# pdsh -w c[1] systemctl start slurmd
[root@master ~]#

@farrukhndm
Copy link
Author

How do you define the c array ? Do you use the input.local templates ? For example at

c_ip[0]=172.16.1.1
c_ip[1]=172.16.1.2
c_ip[2]=172.16.1.3
c_ip[3]=172.16.1.4

you can see how c_ip is being defined.
It seems you use something custom because the templates use c_name and c_ip as array names.
About the actual error - Dec 20 12:39:59 c1 munged[1167]: munged: Error: Failed to check logfile "/var/l>: It think it is caused due to permissions issue for /var/log/munge/munged.log What is the output of ls -alR /var/log/munge* ? What is the error if you try to run munged manually, i.e. without systemd ?

Here is output


[root@master ~]# ls -alR /var/log/munge*
/var/log/munge:
total 8
drwx------   2 munge munge   51 Dec 20 06:45 .
drwxr-xr-x. 21 root  root  4096 Dec 21 13:34 ..
-rw-r-----   1 munge munge    0 Dec 20 06:45 munged.log
-rw-r-----   1 munge munge 1736 Dec 20 06:45 munged.log-20231220

Still error is same on c1
[email protected]'s password:
Last login: Wed Dec 20 12:43:27 2023 from 192.168.1.200
[root@c1 ~]# systemctl status munge.service
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor prese>
Active: failed (Result: exit-code) since Wed 2023-12-20 12:40:00 EST; 14min >
Docs: man:munged(8)
Process: 1152 ExecStart=/usr/sbin/munged (code=exited, status=1/FAILURE)

Dec 20 12:39:59 c1 systemd[1]: Starting MUNGE authentication service...
Dec 20 12:39:59 c1 munged[1167]: munged: Error: Failed to check logfile "/var/l>
Dec 20 12:40:00 c1 systemd[1]: munge.service: Control process exited, code=exit>
Dec 20 12:40:00 c1 systemd[1]: munge.service: Failed with result 'exit-code'.
Dec 20 12:40:00 c1 systemd[1]: Failed to start MUNGE authentication service.
lines 1-11/11 (END)

Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants