Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fan cycling indifinetly #28

Closed
B0ndo2 opened this issue Mar 13, 2024 · 9 comments
Closed

Fan cycling indifinetly #28

B0ndo2 opened this issue Mar 13, 2024 · 9 comments

Comments

@B0ndo2
Copy link

B0ndo2 commented Mar 13, 2024

I have a ThinkPad T14s gen 3 where I installed zcfan. The fan is cycling like crazy. I am monitoring the CPU and GPU temperature and they never reached 70 or 61. I also feel that the fun runs at high speed always

Mar 13 16:21:58 XX zcfan[11590]: [FAN] Temperature now 63C, fan set to low
Mar 13 16:22:26 XX zcfan[11590]: [FAN] Temperature now 50C, fan set to off
Mar 13 16:24:52 XX zcfan[11590]: [FAN] Temperature now 76C, fan set to medium
Mar 13 16:24:56 XX zcfan[11590]: [FAN] Temperature now 47C, fan set to off
Mar 13 16:25:45 XX zcfan[11590]: [FAN] Temperature now 70C, fan set to low
Mar 13 16:25:49 XX zcfan[11590]: [FAN] Temperature now 48C, fan set to off
Mar 13 16:25:59 XX zcfan[11590]: [FAN] Temperature now 61C, fan set to low
Mar 13 16:26:02 XX zcfan[11590]: [FAN] Temperature now 48C, fan set to off
Mar 13 16:26:22 XX zcfan[11590]: [FAN] Temperature now 61C, fan set to low
Mar 13 16:26:25 XX zcfan[11590]: [FAN] Temperature now 47C, fan set to off

zcfan.conf

max_temp 85
med_temp 75
low_temp 60

@cdown
Copy link
Owner

cdown commented Mar 15, 2024

Which hwmon input is producing those numbers? There was some discussion about blacklisting in #25, but I'd be intrigued to find out which sensor is actually producing these values, especially since it's not static in this case.

@B0ndo2
Copy link
Author

B0ndo2 commented Mar 15, 2024

I don't know which ones, how do I find out ?

@rudolf81
Copy link

rudolf81 commented Mar 15, 2024

@B0ndo2

Try the (attached) script - maybe it can help.
get_temps.txt

Output looks like this:
`Directory: /sys/class/hwmon/hwmon0/
Name: BAT0

Directory: /sys/class/hwmon/hwmon1/
Name: nvme
temp1_input: 32850 (Composite)
temp3_input: 67850 (Sensor 2)

Directory: /sys/class/hwmon/hwmon2/
Name: amdgpu
temp1_input: 43000 (edge)

Directory: /sys/class/hwmon/hwmon3/
Name: AC

Directory: /sys/class/hwmon/hwmon4/
Name: acpitz
temp1_input: 44000

Directory: /sys/class/hwmon/hwmon5/
Name: k10temp
temp1_input: 44375 (Tctl)

Directory: /sys/class/hwmon/hwmon6/
Name: thinkpad
temp1_input: 44000 (CPU)
cat: /sys/class/hwmon/hwmon6/temp2_input: No such device or address
temp2_input: (GPU)
temp3_input: 44000
temp4_input: 0
temp5_input: 44000
temp6_input: 44000
temp7_input: 44000
temp8_input: 0

Directory: /sys/class/hwmon/hwmon7/
Name: ath11k_hwmon
temp1_input: 41000
`

(Hu - seem like /sys/class/hwmon/hwmon6/temp2_input is not readable on my system...)

@B0ndo2
Copy link
Author

B0ndo2 commented Mar 18, 2024

Here is the output

Directory: /sys/class/hwmon/hwmon0/
Name: AC

Directory: /sys/class/hwmon/hwmon1/
Name: acpitz
  temp1_input: 65000

Directory: /sys/class/hwmon/hwmon2/
Name: BAT0

Directory: /sys/class/hwmon/hwmon3/
Name: nvme
  temp1_input: 49850 (Composite)
  temp2_input: 49850 (Sensor 1)
  temp3_input: 45850 (Sensor 2)

Directory: /sys/class/hwmon/hwmon4/
Name: ucsi_source_psy_USBC000:001

Directory: /sys/class/hwmon/hwmon5/
Name: ucsi_source_psy_USBC000:002

Directory: /sys/class/hwmon/hwmon6/
Name: thinkpad
  temp1_input: 65000 (CPU)
cat: /sys/class/hwmon/hwmon6/temp2_input: No such device or address
  temp2_input:  (GPU)
  temp3_input: 47000
  temp4_input: 0
  temp5_input: 37000
  temp6_input: 57000
  temp7_input: 54000
cat: /sys/class/hwmon/hwmon6/temp8_input: No such device or address
  temp8_input: 

Directory: /sys/class/hwmon/hwmon7/
Name: coretemp
  temp1_input: 58000 (Package id 0)
  temp2_input: 58000 (Core 0)
  temp6_input: 58000 (Core 4)

Directory: /sys/class/hwmon/hwmon8/
Name: iwlwifi_1
  temp1_input: 53000

@stefancircuit
Copy link

stefancircuit commented Jul 26, 2024

I suspect I have the same issue. There seems to be a mysterious high temperature being detected that isn't shown anywhere else. For example:

Jul 26 15:33:40 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 79C, fan set to medium
Jul 26 15:34:00 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 50C, fan set to low
Jul 26 15:40:52 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 72C, fan set to medium
Jul 26 15:45:12 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 50C, fan set to low
Jul 26 15:49:47 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 73C, fan set to medium

checking the sensors around the same time as the last entry shows:

Fri 26 Jul 15:52:12 BST 2024
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:        +45.0°C  

ucsi_source_psy_USBC000:003-isa-0000
Adapter: ISA adapter
in0:           0.00 V  (min =  +0.00 V, max =  +0.00 V)
curr1:         0.00 A  (max =  +0.00 A)

ucsi_source_psy_USBC000:001-isa-0000
Adapter: ISA adapter
in0:           0.00 V  (min =  +0.00 V, max =  +0.00 V)
curr1:         0.00 A  (max =  +0.00 A)

BAT0-acpi-0
Adapter: ACPI interface
in0:          17.74 V  

thinkpad-isa-0000
Adapter: ISA adapter
fan1:        3642 RPM
fan2:        3725 RPM
CPU:          +51.0°C  
GPU:          +46.0°C  
temp3:        +48.0°C  
temp4:         +0.0°C  
temp5:        +46.0°C  
temp6:        +47.0°C  
temp7:        +38.0°C  
temp8:            N/A  

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +52.0°C  (high = +110.0°C, crit = +110.0°C)
Core 0:        +46.0°C  (high = +110.0°C, crit = +110.0°C)
Core 1:        +46.0°C  (high = +110.0°C, crit = +110.0°C)
Core 2:        +46.0°C  (high = +110.0°C, crit = +110.0°C)
Core 3:        +46.0°C  (high = +110.0°C, crit = +110.0°C)
Core 4:        +45.0°C  (high = +110.0°C, crit = +110.0°C)
Core 5:        +45.0°C  (high = +110.0°C, crit = +110.0°C)
Core 6:        +45.0°C  (high = +110.0°C, crit = +110.0°C)
Core 7:        +45.0°C  (high = +110.0°C, crit = +110.0°C)
Core 8:        +46.0°C  (high = +110.0°C, crit = +110.0°C)
Core 12:       +45.0°C  (high = +110.0°C, crit = +110.0°C)
Core 16:       +46.0°C  (high = +110.0°C, crit = +110.0°C)
Core 20:       +46.0°C  (high = +110.0°C, crit = +110.0°C)
Core 24:       +47.0°C  (high = +110.0°C, crit = +110.0°C)
Core 28:       +47.0°C  (high = +110.0°C, crit = +110.0°C)
Core 32:       +50.0°C  (high = +110.0°C, crit = +110.0°C)
Core 33:       +50.0°C  (high = +110.0°C, crit = +110.0°C)

ucsi_source_psy_USBC000:002-isa-0000
Adapter: ISA adapter
in0:           0.00 V  (min =  +0.00 V, max =  +0.00 V)
curr1:         0.00 A  (max =  +0.00 A)

nvme-pci-0400
Adapter: PCI adapter
Composite:    +33.9°C  (low  = -20.1°C, high = +77.8°C)
                       (crit = +81.8°C)
Sensor 1:     +33.9°C  (low  = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +51.0°C  (crit = +108.0°C)

or via cat /proc/acpi/ibm/thermal

Fri 26 Jul 15:53:57 BST 2024
temperatures:	53 47 49 0 47 49 38 -128

Nothing seems to be close to the ~70 degree temp being reported. Do you have any ideas where it might be coming from?

edit: After watching the output of the script given above, I think what happens is there are very short temperature spikes. I'm not sure if they're real or errors in the sensor data.

Observing just the temp1 of the cpu:

while true; do date && cat /sys/class/hwmon/hwmon7/temp1_input; sleep 0.1; done;

for less than a second the temp seems to jump up and down several degrees:

Fri 26 Jul 17:16:13 BST 2024
67000
Fri 26 Jul 17:16:13 BST 2024
67000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:14 BST 2024
81000
Fri 26 Jul 17:16:14 BST 2024
81000
Fri 26 Jul 17:16:14 BST 2024
67000

If zcfan notices this jump it'll often kick the fan into a higher mode for a while. But it doesn't seem necessary to do so. I wonder if it should keep a running 1-2 second average to avoid spikes?

@rudolf81
Copy link

Hey @stefancircuit

Wow, that is strange.

Are you saying you measured a temp difference from 67 to 81, within 100ms?
That sounds rather impossible...? I would think the thermal mass...

Were you running any specific load on the system at the time?

What is your hwmon7's name?
(you can run my script posted above).
Mine is "ath11k_hwmon", which is wifi.

(Lots of docs suggest the number of the "hwmon" sensors can jump between boots... I don't think I've observed that however).

Anyway - if your hwmon7 is your wifi... it might not be covered by the CPU cooling heatpipes... like mine (T16 G1 AMD): https://laptopmedia.com/wp-content/uploads/2022/08/internals-1000x711.jpg

Sooo, if that is the case, you may have a similar problem to what I had - needing to be able to blacklist a sensor from zcfan, from it's fan-management algorithm, which just takes the highest temperature of any sensor, and uses that to set the fan speed.

See: #25

I've reached a dead end with my issue...
My SSD exposed some additional dud temp sensor, that was always stuck on the same temp, and causing issues in the zcfan algo...

In the end I modified zcfan to hardcode an exclusion on that one sensor on my laptop, and that solved the algo issue, but, it exposed an new issue... that is - when setting the desired fan level by writing to /proc/acpi/ibm/fan - the whole system may crash. At random.

It is actually something I observed with zcfan and my experimentation with it... I thought it was my own bad code (but how... zcfan is in userspace...) - but later I wrote my own "zcfan" in python to read sensors, accommodate sensor blacklisting, compute brackets, and set the fan level accordingly.
It worked great, except my system would still hard crash - at random.
No amount of playing with the intervals of writing the fan level or the watchdog timer, could fix this.

I don't know how to troubleshoot this further. I'm guessing this issue might be unique to the combination of the IBM ACPI driver and my motherboard/bios - else... zcfan would not for work anyone...

Actually, the original problem I had, that lead me to try zcfan, was that the auto fan speed control of the system had an issue... Most of the time it would be fine, but then sometimes it would get into a loop of spinning up and down, over and over, quite fast. Probably up and down, in about 5 seconds. Over and over. Even if no load on the system.

I think fundamentally, the IBM ACPI driver, which does run in kernel-space, is not that good, and it leads to these issues we've seen-

  • rapidly fluctuating temperature readings
  • random crashes on writing to it
  • reporting bogus sensors
  • spinning up & down

Hmmm, come to think of it, the fluctuating fan speed issue I had, is kinda similar to what you and @B0ndo2 reported, but maybe at a faster pace? maybe it has to do with the polling frequency set up in zcfan... maybe its fundamentally driven by the same issue.

I'm not sure how to troubleshoot this issue further, or where to go for help.

@stefancircuit - have you had any random system crashes while experimenting with zcfan?

If not - you can try that sensor blacklisting route... or if your hwmon7 temp1 is some part of your CPU/GPU, and you do want it to drive your fan speed... maybe you can pre-process those readings via a moving average or a low-pass-filter or something like that to smooth out the bumps...

Last question - which version of the IBM ACPI driver are you running?
You can check with:
cat /proc/ibm/acpi/driver

I'm on 0.26

Thanks!

@stefancircuit
Copy link

Hi thanks for your response @rudolf81 . I was not running any particular load, just idling. hwmon7 is the CPU For me:

Directory: /sys/class/hwmon/hwmon7/
Name: coretemp
  temp1_input: 47000 (Package id 0)
  temp2_input: 41000 (Core 0)
  temp3_input: 41000 (Core 1)
  temp4_input: 40000 (Core 2)
  temp5_input: 41000 (Core 3)
  temp6_input: 40000 (Core 4)
  temp7_input: 41000 (Core 5)
  temp8_input: 40000 (Core 6)
  temp9_input: 41000 (Core 7)

My assumption was that temp1 was the closest thing to the "overall" CPU temperature, however it might just be the max of all the cores or something. So I don't think blacklisting would work in my case.

I don't get any crashes just these (possibly spurious) sub-second temperature spikes that push the fan speed up randomly.

The acpi version is as follows:

sudo cat /proc/acpi/ibm/driver 

driver:	ThinkPad ACPI Extras
version:	0.26

But yeah, it seems like a rolling average would help smooth out anomalies.

That being said I'm currently running Ubuntu 22, and I tried 24 over the weekend which seems to just fix the problems. I'm not sure why, or what changed but the temps are down and the fan stays mostly off without extra tools. This is a very new laptop model (P1 Gen7) so maybe it just needs whatever mysterious packages exist in the newer OS. 🤷‍♂️

I'll still keep using zcfan for a bit as I cannot upgrade fully yet, but hopefully that will be the longer term solution.

@rudolf81
Copy link

Hi @stefancircuit

Interesting.

You would think that the temperatures you get, via querying the sysft or via procfs, somehow come directly from the actual sensors of the components.

Updating some OS packages, are not likely to alter what those readings are?? (...unless they already have some smoothing algo applied? but... you'd think that would be a concern 1 layer above - not from the actual sensors themselves...)

Anyway - when you get back into Ubuntu 24 - would be awesome if you can share the version of Thinkpad ACPI Extras driver.

Thanks.

@cdown
Copy link
Owner

cdown commented Nov 21, 2024

From discussion it feels like this is in the vein of #25, so closing so we can discuss there. Thanks!

@cdown cdown closed this as completed Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants