4

I recently bought a new motherboard/cpu (Asus Rog Strix z690-a gaming wifi D4/Intel i7 12700K) and initially used my existing ubuntu 20.04 installation which I had installed when I had my previous z370 motherboard.
During my initial usage (training deep learning models), I noticed weird behaviors from facing sudden segfaults after seemingly random intervals (sometimes just after minutes of starting the training and sometimes after several hours) or my NVME drive disappearing suddenly making me face errors like this:
enter image description here

I thought maybe this is because of the previous installation, so I went ahead and installed a fresh ubuntu 20.04, installed latest nvidia driver (515), with needed requirements (pytorch 1.11, anaconda3) and started again. I faced one system hang (I was browsing the latest version of firefox when everything froze and nothing would respond, even ctrl+alt+Fs wouldnt work) so I had to hard reset. rebooted resumed training and then another segfault occured which looks like this:
notice the: RuntimeError: DataLoader worker (pid 2477) is killed by signal: Segmentation fault. section below:

Train: 22 [1200/5004 ( 24%)]  Loss: 3.231 (3.24)  Time-Batch: 0.110s, 2325.76/s  LR: 1.000e-01  Data: 0.003 (0.130)
Train: 22 [1400/5004 ( 28%)]  Loss: 3.278 (3.24)  Time-Batch: 0.102s, 2500.91/s  LR: 1.000e-01  Data: 0.002 (0.128)
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/hossein/pytorch-image-models/train.py", line 736, in <module>
    main()
  File "/home/hossein/pytorch-image-models/train.py", line 525, in main
    train_metrics = train_one_epoch(epoch, model, loader_train, optimizer, train_loss_fn, args,
  File "/home/hossein/pytorch-image-models/train.py", line 600, in train_one_epoch
    loss_scaler(loss, optimizer,
  File "/home/hossein/pytorch-image-models/timm/utils/cuda.py", line 43, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2477) is killed by signal: Segmentation fault. 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2405) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
  File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-06-08_20:28:21
  host      : hossein-pc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2405)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

It should be noted that prior to upgrading, this very same script, worked without a hitch for a week nonstop, so I'm 99% sure the scrip is fine.
Also I have run Aida64 stress test (CPU/FPU/Cache) for 2:30 hours in one setting and multiple 1 hours in other occasions successfully.
I have also run memtest for 6 hours (basically all the default tests with 4 passes) successfully. I had these issues before upgrading the BIOS, and now after updating to the latest BIOS version, I still face this segfault.

At this point I'm at a complete loss of what could be the cause. I have also installed 22.04, which froze as well mid training, at which point I reinstalled ubuntu 20.04 again and this all happened today!

Below are my previous errors in case they matter:

  • Weird error where the path to my dataset located in my nvme drive was corrupted, note the error:
FileNotFoundError: [Errno 2] No such file or directory: /media/hossein/SSE/ImageNdt_DataS`t/trainjn036970p7/n03693007_276q.JPEG

The correct path was

/media/hossein/SSD/ImageNet_DataSet/train/n03697007/n03697007_2760.JPEG

This looked like a memory issue to me, and I thought maybe due to heat, my nvme is going crazy, so after this I installed my nvme drive (samsung 980 1tb) to another slot down the board (previously it was installed between cpu socket and graphics card port which made it very hot around 65/76c). I updated the bios after this.

  • Connection abort error (nvme drive suddenly disappeared(the picture above)):
Train: 0 [2200/5004 ( 44%)]  Loss: 6.439 (6.79)  Time-Batch: 0.120s, 2135.13/s  LR: 1.000e-01  Data: 0.006 (0.092)
Train: 0 [2400/5004 ( 48%)]  Loss: 6.337 (6.76)  Time-Batch: 0.118s, 2164.70/s  LR: 1.000e-01  Data: 0.003 (0.092)
WARNING: Skipped sample (index 1068111, file n04347754/n04347754_93404.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 349727, file n02115641/n02115641_30352.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 910908, file n03908714/n03908714_3517.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 894431, file n03877472/n03877472_17451.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 779988, file n03590841/n03590841_10648.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 213196, file n02089078/n02089078_8336.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 629596, file n03000134/n03000134_6084.JPEG). [Errno 103] Software caused connection abort
WARNING: Skipped sample (index 1221601, file n07753592/n07753592_1779.JPEG). [Errno 107] Transport endpoint is not connected
WARNING: Skipped sample (index 1089611, file n04399382/n04399382_31586.JPEG). [Errno 107] Transport endpoint is not connected
Traceback (most recent call last):
 ...
  raise exception
ConnectionAbortedError: Caught ConnectionAbortedError in DataLoader worker process 10.
 ...
ConnectionAbortedError: [Errno 103] Software caused connection abort: '/media/hossein/SSD/ImageNet_DataSet/train/n02883205/n02883205_6142.JPEG'

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3721) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
...
  • another segmentationfault:
Train: 0 [1600/5004 ( 32%)]  Loss: 6.612 (6.85)  Time-Batch: 0.103s, 2492.08/s  LR: 1.000e-01  Data: 0.002 (0.099)
Train: 0 [1800/5004 ( 36%)]  Loss: 6.658 (6.82)  Time-Batch: 0.108s, 2376.50/s  LR: 1.000e-01  Data: 0.008 (0.099)
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/hossein/pytorch-image-models/train.py", line 736, in <module>
    main()
  File "/home/hossein/pytorch-image-models/train.py", line 525, in main
    train_metrics = train_one_epoch(epoch, model, loader_train, optimizer, train_loss_fn, args,
  File "/home/hossein/pytorch-image-models/train.py", line 600, in train_one_epoch
    loss_scaler(loss, optimizer,
  File "/home/hossein/pytorch-image-models/timm/utils/cuda.py", line 48, in __call__
    self._scaler.step(optimizer)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 338, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in _maybe_opt_step
    if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in <genexpr>
    if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6063) is killed by signal: Segmentation fault. 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5934) of binary: /home/hossein/anaconda3/bin/python3
Traceback (most recent call last):
  File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hossein/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-06-04_18:17:11
  host      : hossein-pc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5934)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Is there anyway I can know for sure which part is causing the segfaults?

Side note:
Under windows, using hard disk sentinel, my nvme drive is 100% healthy. Also I have swap file disabled completely as I have 32GB of ram.(could this have anything to do with segfaults?)

Here are the logs content for dmesg,dmesg.0 and the content of important and hardware category from logs app in ubuntu:

dmesg content: https://pastebin.com/fctcEmnB
dmesg.0: https://pastebin.com/mmvR8hSV
logs content-important: https://pastebin.com/NsVgsxYx
logs content-hardware: https://pastebin.com/cYyPCgCL

and here is the output of smartctl:

(base) hossein@hossein-pc:~$ sudo smartctl -a -x /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.13.0-48-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 1TB
Serial Number:                      S649NJ0R331701H
Firmware Version:                   2B4QFXO7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      5
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            465,389,219,840 [465 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 d311422bf4
Local Time is:                      Thu Jun  9 10:11:14 2022 +0430
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x10):        *Other*

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.24W       -        -    0  0  0  0        0       0
 1 +     4.49W       -        -    1  1  1  1        0       0
 2 +     2.19W       -        -    2  2  2  2        0     500
 3 -   0.0500W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     1000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    62,083,877 [31.7 TB]
Data Units Written:                 7,758,068 [3.97 TB]
Host Read Commands:                 626,314,400
Host Write Commands:                90,214,169
Controller Busy Time:               1,148
Power Cycles:                       290
Power On Hours:                     1,668
Unsafe Shutdowns:                   49
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               38 Celsius
Temperature Sensor 2:               38 Celsius
Thermal Temp. 2 Transition Count:   185
Thermal Temp. 2 Total Time:         54

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Update 3:

I got some new error logs, this happened today and it seems it has something to do with ntfs-3g: enter image description here enter image description here enter image description here

4
  • Can you dig around in dmesg for I/O errors? I'm suspecting that something is going wonky with your SSD (perhaps an intermittent controller failure due to heat?). Also, if you could use smartctl to get some drive data (particularly how much data has been written to the drive), that may be helpful.
    – ArrayBolt3
    Commented Jun 8, 2022 at 19:51
  • Thanks alot, but how should I do that, there is a dmseg in var/log and there is this logs application in ubuntu which seemingly have logs for several categories, which should I be looking into?
    – Hossein
    Commented Jun 9, 2022 at 0:32
  • In a terminal, run "nano /var/log/dmesg". Then press Ctrl+W, type "I/O error", and hit Enter to search for it. You can search for more errors by pressing Ctrl+W and Enter to find the next successive error.
    – ArrayBolt3
    Commented Jun 9, 2022 at 2:47
  • Thanks a lot really appreciate it. I didnt find any "I/O error", though I uploaded the contents of dmesg and logs to pastebin, maybe someone can spot something out of ordinary in there.
    – Hossein
    Commented Jun 9, 2022 at 4:34

1 Answer 1

0

Ok, Here's the update.
All of those issues were caused by my memory being at 2400mhz clock.(XMP was disabled!) and thus as I suspected it was caused by memory issues.

Based on the my research, for Intel's 12th gen motherboards/CPU, The valid clock is 3200mhz(this is whats tested and should work flawlessly). speeds as low as 2400mhz seemingly work! but not without issues under heavy loads.

That's when I went ahead and activated the XMP profile, and my memory started working with the advertised speed (i.e. 3000mhz), I overclocked it to 3200mhz and 3600mhz and found out from 3000mhz and up the issues go away, so no need to stress the mems over 3000mhz for now!

So always activate the XMP and note the speed as well. if you have system intensive workload where your CPU/GPU/RAM/DISK all are at 100% utilization 24/7, this is a must have otherwise you may not see any issues right away.

Note that I had no issues under windows, even though I ran benchmarks for an extended amount of time, and tested each component individually, nothing popped up. it only rared its ugly head when the whole system was under heavy load!

Also as to why I didn't bother activating the XMP in first place, I thought the motherboard and the system overall would work better by sticking to the default settings(optimized! as Asus calls it), and if I use XMP (and thus have any overclocks (which by the way I had no use of), it would introduce bugs as the whole motherboard and architecture was still very new and I didn't want any hassles!) obviously I was wrong!always set the XMP first!

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .