I recently bought a new motherboard/cpu (Asus Rog Strix z690-a gaming wifi D4/Intel i7 12700K) and initially used my existing ubuntu 20.04 installation which I had installed when I had my previous z370 motherboard.
During my initial usage (training deep learning models), I noticed weird behaviors from facing sudden segfaults after seemingly random intervals (sometimes just after minutes of starting the training and sometimes after several hours) or my NVME drive disappearing suddenly making me face errors like this:
I thought maybe this is because of the previous installation, so I went ahead and installed a fresh ubuntu 20.04, installed latest nvidia driver (515), with needed requirements (pytorch 1.11, anaconda3) and started again.
I faced one system hang (I was browsing the latest version of firefox when everything froze and nothing would respond, even ctrl+alt+Fs wouldnt work) so I had to hard reset. rebooted resumed training and then another segfault occured which looks like this:
notice the:
RuntimeError: DataLoader worker (pid 2477) is killed by signal: Segmentation fault.
section below:
Train: 22 [1200/5004 ( 24%)] Loss: 3.231 (3.24) Time-Batch: 0.110s, 2325.76/s LR: 1.000e-01 Data: 0.003 (0.130)
Train: 22 [1400/5004 ( 28%)] Loss: 3.278 (3.24) Time-Batch: 0.102s, 2500.91/s LR: 1.000e-01 Data: 0.002 (0.128)
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/home/hossein/pytorch-image-models/train.py", line 736, in <module>
main()
File "/home/hossein/pytorch-image-models/train.py", line 525, in main
train_metrics = train_one_epoch(epoch, model, loader_train, optimizer, train_loss_fn, args,
File "/home/hossein/pytorch-image-models/train.py", line 600, in train_one_epoch
loss_scaler(loss, optimizer,
File "/home/hossein/pytorch-image-models/timm/utils/cuda.py", line 43, in __call__
self._scaler.scale(loss).backward(create_graph=create_graph)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2477) is killed by signal: Segmentation fault.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2405) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-06-08_20:28:21
host : hossein-pc
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2405)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
It should be noted that prior to upgrading, this very same script, worked without a hitch for a week nonstop, so I'm 99% sure the scrip is fine.
Also I have run Aida64 stress test (CPU/FPU/Cache) for 2:30 hours in one setting and multiple 1 hours in other occasions successfully.
I have also run memtest
for 6 hours (basically all the default tests with 4 passes) successfully. I had these issues before upgrading the BIOS, and now after updating to the latest BIOS version, I still face this segfault.
At this point I'm at a complete loss of what could be the cause. I have also installed 22.04, which froze as well mid training, at which point I reinstalled ubuntu 20.04 again and this all happened today!
Below are my previous errors in case they matter:
- Weird error where the path to my dataset located in my nvme drive was corrupted, note the error:
FileNotFoundError: [Errno 2] No such file or directory: /media/hossein/SSE/ImageNdt_DataS`t/trainjn036970p7/n03693007_276q.JPEG
The correct path was
/media/hossein/SSD/ImageNet_DataSet/train/n03697007/n03697007_2760.JPEG
This looked like a memory issue to me, and I thought maybe due to heat, my nvme is going crazy, so after this I installed my nvme drive (samsung 980 1tb) to another slot down the board (previously it was installed between cpu socket and graphics card port which made it very hot around 65/76c). I updated the bios after this.
- Connection abort error (nvme drive suddenly disappeared(the picture above)):
Train: 0 [2200/5004 ( 44%)] Loss: 6.439 (6.79) Time-Batch: 0.120s, 2135.13/s LR: 1.000e-01 Data: 0.006 (0.092)
Train: 0 [2400/5004 ( 48%)] Loss: 6.337 (6.76) Time-Batch: 0.118s, 2164.70/s LR: 1.000e-01 Data: 0.003 (0.092)
WARNING: Skipped sample (index 1068111, file n04347754/n04347754_93404.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 349727, file n02115641/n02115641_30352.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 910908, file n03908714/n03908714_3517.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 894431, file n03877472/n03877472_17451.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 779988, file n03590841/n03590841_10648.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 213196, file n02089078/n02089078_8336.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 629596, file n03000134/n03000134_6084.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 1221601, file n07753592/n07753592_1779.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 1089611, file n04399382/n04399382_31586.JPEG). [Errno 107] Transport endpoint is not connected
Traceback (most recent call last):
...
raise exception
ConnectionAbortedError: Caught ConnectionAbortedError in DataLoader worker process 10.
...
ConnectionAbortedError: [Errno 103] Software caused connection abort: '/media/hossein/SSD/ImageNet_DataSet/train/n02883205/n02883205_6142.JPEG'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3721) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
...
- another segmentationfault:
Train: 0 [1600/5004 ( 32%)] Loss: 6.612 (6.85) Time-Batch: 0.103s, 2492.08/s LR: 1.000e-01 Data: 0.002 (0.099)
Train: 0 [1800/5004 ( 36%)] Loss: 6.658 (6.82) Time-Batch: 0.108s, 2376.50/s LR: 1.000e-01 Data: 0.008 (0.099)
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/home/hossein/pytorch-image-models/train.py", line 736, in <module>
main()
File "/home/hossein/pytorch-image-models/train.py", line 525, in main
train_metrics = train_one_epoch(epoch, model, loader_train, optimizer, train_loss_fn, args,
File "/home/hossein/pytorch-image-models/train.py", line 600, in train_one_epoch
loss_scaler(loss, optimizer,
File "/home/hossein/pytorch-image-models/timm/utils/cuda.py", line 48, in __call__
self._scaler.step(optimizer)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in _maybe_opt_step
if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in <genexpr>
if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6063) is killed by signal: Segmentation fault.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5934) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-06-04_18:17:11
host : hossein-pc
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5934)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Is there anyway I can know for sure which part is causing the segfaults?
Side note:
Under windows, using hard disk sentinel, my nvme drive is 100% healthy.
Also I have swap file disabled completely as I have 32GB of ram.(could this have anything to do with segfaults?)
Here are the logs content for dmesg,dmesg.0 and the content of important and hardware category from logs app in ubuntu:
dmesg content: https://pastebin.com/fctcEmnB
dmesg.0: https://pastebin.com/mmvR8hSV
logs content-important: https://pastebin.com/NsVgsxYx
logs content-hardware: https://pastebin.com/cYyPCgCL
and here is the output of smartctl:
(base) hossein@hossein-pc:~$ sudo smartctl -a -x /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.13.0-48-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 980 1TB
Serial Number: S649NJ0R331701H
Firmware Version: 2B4QFXO7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 5
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization: 465,389,219,840 [465 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 d311422bf4
Local Time is: Thu Jun 9 10:11:14 2022 +0430
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055): Comp DS_Mngmt Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x10): *Other*
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 5.24W - - 0 0 0 0 0 0
1 + 4.49W - - 1 1 1 1 0 0
2 + 2.19W - - 2 2 2 2 0 500
3 - 0.0500W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 1000 9000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 38 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 62,083,877 [31.7 TB]
Data Units Written: 7,758,068 [3.97 TB]
Host Read Commands: 626,314,400
Host Write Commands: 90,214,169
Controller Busy Time: 1,148
Power Cycles: 290
Power On Hours: 1,668
Unsafe Shutdowns: 49
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 38 Celsius
Temperature Sensor 2: 38 Celsius
Thermal Temp. 2 Transition Count: 185
Thermal Temp. 2 Total Time: 54
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
Update 3:
I got some new error logs, this happened today and it seems it has something to do with ntfs-3g
:
var/log
and there is this logs application in ubuntu which seemingly have logs for several categories, which should I be looking into?