3

I have a Thinkpad T430 with an i7-3520M CPU. This is rated to run at 2.90GHz, but it can boost up to 3.5GHz.

Previously I would use Windows 10. It is an old laptop, and CPU temperatures can get quite high, but on Windows it would never heat up to the point where it would shut off on me. It would instead stop boosting and lower the clock speed to compensate for the high heat, if I remember correctly. When the laptop is docked, due to the less ideal cooling setup it would heat up faster, but it would still not overheat on Windows 10.

When I'm using Kubuntu 24.04 (kernel 6.8.0-51), I noticed that the CPU can reach very high temperatures of around 95C during high CPU usage. The clock rate also seems to be constantly boosting to 3.5GHz, despite the high temperatures. At high temperatures, I noticed in top that there are some processes called "idle_inject" that have a high CPU usage. After googling those, it seems that the kernel uses these processes to inject idle wait states to force the CPU to idle to cool down, if I understand correctly. However, it doesnt seem like it's doing anything to help, and the CPU clock stays at 3.5GHz.

On prolonged workloads, the laptop would eventually force shutdown. I feel that this isn't normal behaviour and there is something wrong with the way the kernel is managing the CPU frequency. What can I do?

Note: I realize that I could do hardware fixes like cleaning/replacing the fan, replacing the thermal paste, etc. But the fact is that the CPU is boosting when it shouldn't be, and I believe this is controlled in software, and didn't happen on Windows. So there should be a way to fix it.

EDIT: Here are some thermald logs

Dec 30 09:08:47 the9a3eedi-linux thermald[1475]: 13 CPUID levels; family:model:stepping 0x6:3a:9 (6:58:9)
Dec 30 09:08:47 the9a3eedi-linux thermald[1475]: 13 CPUID levels; family:model:stepping 0x6:3a:9 (6:58:9)
Dec 30 09:08:49 the9a3eedi-linux thermald[1475]: sensor id 4 : No temp sysfs for reading raw temp
Dec 30 09:08:49 the9a3eedi-linux thermald[1475]: sensor id 4 : No temp sysfs for reading raw temp
Dec 30 09:08:49 the9a3eedi-linux thermald[1475]: sensor id 4 : No temp sysfs for reading raw temp
Dec 30 09:08:49 the9a3eedi-linux thermald[1475]: Config file /etc/thermald/thermal-conf.xml does not exist
Dec 30 09:08:49 the9a3eedi-linux thermald[1475]: Config file /etc/thermald/thermal-conf.xml does not exist
Dec 30 09:08:49 the9a3eedi-linux thermald[1475]: Config file /etc/thermald/thermal-conf.xml does not exist
Dec 30 09:08:49 the9a3eedi-linux thermald[1475]: Polling mode is enabled: 4
Dec 30 09:08:49 the9a3eedi-linux systemd[1]: Started thermald.service - Thermal Daemon Service.

9a3eedi
  • 183

3 Answers3

3

As a first step, the program / process which causes the high CPU load should be identified. Open a terminal and run top. This will show all processes running, including their consumption of CPU time and RAM. In case the list is not sorted by CPU consumption, press the t key to sort accordingly. In case their is any process which can be safely killed consuming a lot of CPU time - do so and check for improvements.

As a second step, the settings of the Linux kernel for the CPU can be manually check. In the /sys pseudo filesystem is a directory for each CPU core which holds corresponding information. Switch to the directory for CPU0 (should be identical for all CPUs anyway) cd /sys/devices/system/cpu/cpu0/cpufreq/ and check the file scaling_governor: cat scaling_governor. The output should be either "ondemand" or "intel_pstate". The supported governors for your CPU are listed in the file scaling_available_governors. You can manually set the scaling governor for each CPU core. The following command sets the "powersave" governor for the 1st CPU core (assuming it is supported for the given CPU):

echo powersave | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

powersave means that the CPU core runs always on the lowest frequency, independent of the workload. This should prevent overheating the CPU, but may make the system unresponsive. Alternatively, convervative instead of powersave can be used to conservatively change to higher frequencies. A detailed description of the different scaling governors can be found in the Linux Kernel Documentation.

Please note that it is on regular functioning systems normally not necessary to change the scaling governor manually.

In the directory /sys/devices/system/cpu/cpu0/cpufreq/ there will also be, depending on the CPU and its age, either a directory called stats or a number of directories named stateX, whereas X is a number, the lowest number is 0. The number represent the idle state ("C-state") of the CPU, so there is a directory for each idle state. In case the stats directory exists, change into it and have a look into the file time_in_state: cat time_in_state. The file holds information on how many milliseconds the CPU core was running at which frequency. In case the stateX directories exist, switch in the cpu0 one and check the file time: cat time. The file holds information for the given CPU core how many milliseconds the CPU core spent in this idle state. The state 0 is when the core is in full use, the higher numbers represent idle / sleep states. The higher the number, the higher / deeper the sleep state. Compare the output of time in the cpu0 directory to the values in higher idle states.

In case there is no process consuming a lot of CPU time and the CPU spents the majority of time in at a lower frequency and / or higher idle state but overheats anyway, there may be something wrong with the cooling system.

noisefloor
  • 1,769
2

The process thermald should take care of your problem. Your specific problem is that your thermald is not customized, due to a missing config file... so it runs with default thermald parameters that may not work in your case.

Config file /etc/thermald/thermal-conf.xml does not exist

Do man thermald and man thermal-conf.xml for examples on how to create your own config file.

Use this config.sh command to get you started:

echo ""
echo "Types per thermal_zone*"
echo "-----------------------"
cat /sys/class/thermal/thermal_zone*/type
echo ""
echo "Temps per thermal_zone*"
echo "-----------------------"
cat /sys/class/thermal/thermal_zone*/temp

Here's my thermal-conf.xml:

<?xml version="1.0"?>
<ThermalConfiguration>
<Platform>
        <Name>Dell Inspiron-7700-AIO</Name>
        <ProductName>*</ProductName>
        <Preference>QUIET</Preference>
        <ThermalSensors>
<!--
                <ThermalSensor>
                        <Type>acpitz</Type>
                        <Path>/sys/class/thermal/thermal_zone0/</Path>
                        <AsyncCapable>0</AsyncCapable>
                </ThermalSensor>
                <ThermalSensor>
                        <Type>iwlwifi_1</Type>
                        <Path>/sys/class/thermal/thermal_zone1/</Path>
                        <AsyncCapable>0</AsyncCapable>
                </ThermalSensor>
-->
                <ThermalSensor>
                        <Type>x86_pkg_temp</Type>
                        <Path>/sys/class/thermal/thermal_zone2/</Path>
                        <AsyncCapable>0</AsyncCapable>
                </ThermalSensor>
        </ThermalSensors>
<!--
commented out
-->
        <ThermalZones>
                <ThermalZone>
                        <Type>cpu</Type>
                        <TripPoints>
                                <TripPoint>
                                        <SensorType>x86_pkg_temp</SensorType>
                                        <Temperature>65000</Temperature>
                                        <type>passive</type>
                                        <ControlType>PARALLEL</ControlType>
                                        <CoolingDevice>
                                                <index>0</index>
                                                <type>Fan</type>
                                                <influence>30</influence>
                                                <SamplingPeriod>5</SamplingPeriod>
                                        </CoolingDevice>
                                        <CoolingDevice>
                                                <index>5</index>
                                                <type>Processor</type>
                                                <influence>80</influence>
                                                <SamplingPeriod>5</SamplingPeriod>
                                        </CoolingDevice>
                                        <CoolingDevice>
                                                <index>13</index>
                                                <type>intel_powerclamp</type>
                                                <influence>100</influence>
                                                <SamplingPeriod>5</SamplingPeriod>
                                        </CoolingDevice>
                                </TripPoint>
                        </TripPoints>
                </ThermalZone>
        </ThermalZones>
</Platform>
</ThermalConfiguration>

Ask questions if you need to.

heynnema
  • 73,649
1

According to the specifications, Tjunction is 105°C.

The mobile i7 from that time will reduce the clock rate when they hit this number, not before. Mine does as well, and it's been running fine for over ten years.

If you get a system shutdown, then you run into a second limit after that, when reducing the clock rate does not help. This is an older machine, it will probably need new thermal paste or a replacement fan at some point (mine certainly did).

I'm using the ondemand governor, with

echo 1 >/sys/devices/system/cpu/cpufreq/ondemand/ignore_nice_load

so background processes don't cause a frequency switch, that also helps immensely if I don't need something right away, I can start it in the background with nice or batch.