Strange segfault after several hours of running program

Question

I created a program fairly closely related to hello_pi/hello_video.c. And for some reason after several hours of running the program I get these very strange segfaults that I cannot for the life of me figure out how to resolve. Here is what I am running:

Raspberry PI 3B+
Raspbian Lite (Buster)

I have checked the following:

I am not under voltage. I am using a 5.1v 2.5a power supply from raspberry PI themselves.
I did not over clock the system. I am sitting at 1.4 like a normal PI.
The system has heat sinks and a fan to keep the temperature low.
The system is also totally up to date (apt update... upgrade... rpi-update etc)

Here are a few of the errors I am receiving from the address sanitizer built into gcc:

==26987==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x45009021 at pc 0x769ca548 bp 0x5a7fa474 sp 0x5a7fa040
READ of size 8 at 0x45009021 thread T450 (ILCS_HOST)
    #0 0x769ca547  (/usr/lib/arm-linux-gnueabihf/libasan.so.5+0x3a547)

Address 0x45009021 is a wild pointer.
SUMMARY: AddressSanitizer: heap-buffer-overflow (/usr/lib/arm-linux-gnueabihf/libasan.so.5+0x3a547) 
Shadow bytes around the buggy address:
  0x28a011b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x28a011c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x28a011d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x28a011e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x28a011f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x28a01200: fa fa fa fa[fa]fa fa fa fa fa fa fa fa fa fa fa
  0x28a01210: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x28a01220: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x28a01230: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x28a01240: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x28a01250: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
Thread T450 (ILCS_HOST) created by T443 here:
    #0 0x769db9c7 in pthread_create (/usr/lib/arm-linux-gnueabihf/libasan.so.5+0x4b9c7)
    #1 0x7693b203 in vcos_thread_create /home/dom/projects/staging/userland/interface/vcos/pthreads/vcos_pthreads.c:212

Thread T443 created by T0 here:
    #0 0x769db9c7 in pthread_create (/usr/lib/arm-linux-gnueabihf/libasan.so.5+0x4b9c7)
    #1 0x74ee1c57 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/usr/lib/arm-linux-gnueabihf/libstdc++.so.6+0x9dc57)
    #2 0x6bb02a7f  (<unknown module>)

==26987==ABORTING

==23420==ERROR: AddressSanitizer: SEGV on unknown address 0x0019c2c0 (pc 0x74b060f0 bp 0x74b08f3c sp 0x6acfe3b8 T2)
==23420==The signal is caused by a READ memory access.
    #0 0x74b060ef in completion_thread (/opt/vc/lib/libvchiq_arm.so+0x20ef)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/opt/vc/lib/libvchiq_arm.so+0x20ef) in completion_thread
Thread T2 (VCHIQ completio) created by T0 here:
    #0 0x769609c7 in pthread_create (/usr/lib/arm-linux-gnueabihf/libasan.so.5+0x4b9c7)
    #1 0x768c0203 in vcos_thread_create /home/dom/projects/staging/userland/interface/vcos/pthreads/vcos_pthreads.c:212

==23420==ABORTING

Here is the output from gdb during those crashes:

Thread 3 "VCHIQ completio" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x6fbfeb00 (LWP 15664)]
0x751c60f0 in completion_thread () from /opt/vc/lib/libvchiq_arm.so

All the crashes seem to be related to "completion_thread()". I have no idea what that is.

Any help would be greatly appreciated!

Almost sounds like a memory leak. Watch it for a while and see if it's virtual size increases without bound. This http://www.unknownroad.com/rtfm/gdbtut/gdbsegfault.html looks like it might be helpful in gdb. — bls, Jul 22 '19 at 21:08
I take it you have based this on some example code. Does the example code crash in the same way? If it doesn't you need to find the bug you have added. — joan, Jul 22 '19 at 21:09
@bls well I have checked the memory usage and that does not increase over time. So I am not sure it is a memory leak — ktb92677, Jul 22 '19 at 21:26
@joan I have based on many bits and pieces of sample code from all over. I am sure I am the one who added the issue but it is extremely odd that neither the sanitizer not gdb say that the actual error is in my code. — ktb92677, Jul 22 '19 at 21:28
That's probably not unusual at all. Some piece of memory is getting trashed I expect. Good luck debugging. (Without code I doubt anyone can give more than general advice.) — Mark Smith, Jul 22 '19 at 21:33
@MarkSmith I was mostly looking for someone who has perhaps seen this issue before. Or for advice debugging. All these errors do not tell me anything about where the error is and because the error only pops up every 3-4 hours or so (occasionally longer) it is really hard to narrow down where the error is. — ktb92677, Jul 22 '19 at 21:36
@ktb92677, System crashes after some time, hours to weeks, is general problem. The general cause is some resource used up. A specific case is "memory leak" meaning memory uses up, no memory left over. Eg, repeatedly or recursively calling a function to say, opening a new socket, new file, new fork, .... Every time you open a socket, you use more memory, until no memory left over. I is difficult to collect garbage sockets, because you don't know if they are really garbage. A common get around is to use a watch dog to reboot when system hung. Or just reboot every hour/day/week. — tlfong01, Jul 23 '19 at 01:47
@ktb92677, In case you wish to use a watchdog, here is one: (1) https://raspberrypi.stackexchange.com/questions/99584/cut-power-on-a-remote-raspberry-pi-3-via-another-raspi — tlfong01, Jul 23 '19 at 04:38
You need to find out when you introduced the bug. Start from the base example and make changes until it starts crashing again. That will localise where you introduced the bug. — joan, Jul 23 '19 at 09:44
@tlfong01 Unfortunately a watchdog program is not going to cut it in my case. Thank you for the suggestion though — ktb92677, Jul 23 '19 at 16:13
@joan thank you for the suggestion. Although it is pain staking to have to sit for 3-4 hours just to wait for the system to crash it looks like that is what I will have to do. — ktb92677, Jul 23 '19 at 16:14
@ktb92677, Yes, I agree a watchdog or regular reboot is only a get around, not the solution. I was wondering if there is a way to speed up your sit-and-wait-to-see approach. Perhaps you can modify your swap memory size, fuse a slow blow time bomb, to help speed up your memory leaking. In my watch dog answer, I use a testing time bomb to crash the system in minutes, to verify that the watchdog is working. Perhaps you can log the process number, count socket or similar things, to narrow down the problematic area. Good luck and cheers. — tlfong01, Jul 24 '19 at 01:39

Strange segfault after several hours of running program

0 Answers0