#7093 closed Bugs (fixed)
Building "system" segfaults on AIX 6.1 / vacpp
Reported by: | Owned by: | Vladimir Prus | |
---|---|---|---|
Milestone: | To Be Determined | Component: | build |
Version: | Boost 1.50.0 | Severity: | Regression |
Keywords: | aix | Cc: |
Description
Machine:
- AIX 6.1, 4-CPU PowerPC_POWER7
Compiler:
- IBM XL C/C++ for AIX, V11.1
Steps to reproduce:
- unpack Boost source tarball
./bootstrap.sh --with-libraries=system
./b2
Observed behaviour:
$ ./bootstrap.sh --with-libraries=system -n Building Boost.Build engine with toolset vacpp... tools/build/v2/engine/bin.aixppc/b2 -n Unicode/ICU support for Boost.Regex?... not found. Generating Boost.Build configuration in project-config.jam... Bootstrapping is done. To build, run: ./b2 To adjust configuration, edit 'project-config.jam'. Further information: - Command line help: ./b2 --help - Getting started guide: http://www.boost.org/more/getting_started/unix-variants.html - Boost.Build documentation: http://www.boost.org/boost-build2/doc/html/index.html $ ./b2 Building the Boost C++ Libraries. Component configuration: - chrono : not building - date_time : not building - exception : not building - filesystem : not building - graph : not building - graph_parallel : not building - iostreams : not building - locale : not building - math : not building - mpi : not building - program_options : not building - python : not building - random : not building - regex : not building - serialization : not building - signals : not building - system : building - test : not building - thread : not building - timer : not building - wave : not building ...found 78 targets... ...updating 17 targets... common.mkdir stage common.mkdir stage/lib common.mkdir bin.v2 common.mkdir bin.v2/libs common.mkdir bin.v2/libs/system common.mkdir bin.v2/libs/system/build common.mkdir bin.v2/libs/system/build/vacpp common.mkdir bin.v2/libs/system/build/vacpp/release common.mkdir bin.v2/libs/system/build/vacpp/release/threading-multi Segmentation fault (core dumped)
Also tried:
./b2 address-model=64
(same behaviour, withrelease/address-model-64/threading-multi
as last output instead ofrelease/threading-multi
)
Observed since:
- boost-1.50.0. (Worked fine with boost-1.49.0.)
Unfortunately I am not familiar enough with either AIX 6.1 nor the Boost building process to know what other information might be helpful for you, or how to get them, but am willing to be talked through additional debugging steps and providing relevant logs or whatnot.
Attachments (6)
Change History (26)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Component: | Building Boost → build |
---|---|
Owner: | set to |
Please, give the requested information, otherwise we can not help to fix any possible issue.
comment:3 by , 10 years ago
Sorry, I somehow missed the update by steven_watanabe.
The following is using TRUNK code (r82720). No sense in testing against code two releases back.
First off, behaviour with TRUNK b2 is still as described above.
Building b2 with --debug makes the problem go away, so unfortunately there's no debug backtrace to be had.
Memory usage on the machine is a nowhere near 100% (this being a server which has about 80 gigs of RAM installed), so it seems steven was right suspecting an optimization issue.
Sorry that I couldn't be of more help.
comment:4 by , 10 years ago
Okay. An optimization problem. I'll at least need a backtrace for any debugging, so...
- In $BOOST/tools/build/v2/engine/build.jam find the line that says "toolset vacpp xlc : ..."
- Edit the release flags to add debug symbols. (-g)
- re-run the bootstrap script
- run gdb ./b2 to get a backtrace from the error
That should at least give me an idea of what to look at.
comment:5 by , 10 years ago
I couldn't get gdb to produce any sensible output:
(gdb) bt #0 0x10008ec0 in ?? ()
Fortunately, dbx was a bit more forthcoming. Output from commands where it made sense:
Segmentation fault in . at 0x10008ec0 0x10008ec0 (???) 93830000 stw r28,0x0(r3)
(dbx) listi 0x10008ec0 (???) 93830000 stw r28,0x0(r3) 0x10008ec4 (???) 7c06292e stwx r0,r6,r5 0x10008ec8 (???) 408100b8 ble 0x10008f80 (???) 0x10008ecc (???) 7c8300d0 neg r4,r3 0x10008ed0 (???) 7c1f00d0 neg r0,r31 0x10008ed4 (???) 7c652378 or r5,r3,r4 0x10008ed8 (???) 7fe00378 or r0,r31,r0 0x10008edc (???) 38830004 addi r4,0x4(r3) 0x10008ee0 (???) 7ca5fe70 srawi r5,r5,0x1f 0x10008ee4 (???) 38df0004 addi r6,0x4(r31)
(dbx) registers $r0:0x00000000 $stkp:0x2ff20310 $toc:0x30003d5c $r3:0x00000001 $r4:0x00000000 $r5:0x00000000 $r6:0x30004330 $r7:0x00000004 $r8:0x00000000 $r9:0xf0696f04 $r10:0x00000004 $r11:0x00000000 $r12:0x1003c318 $r13:0xf0619018 $r14:0x66666667 $r15:0x30004848 $r16:0x00000000 $r17:0x00000000 $r18:0x00000001 $r19:0x00000000 $r20:0x30af2ef8 $r21:0x00000000 $r22:0x00000001 $r23:0x30b20d48 $r24:0x30b19538 $r25:0x304b3e18 $r26:0x30473d8c $r27:0x00000002 $r28:0x00000001 $r29:0x30b45c78 $r30:0x00000000 $r31:0x30af2ef8 $iar:0x10008ec0 $msr:0x0000d032 $cr:0x44200284 $link:0x1003c338 $ctr:0xd013fc20 $xer:0x2000001a Condition status = 0:g 1:g 2:e 5:e 6:l 7:g [unset $noflregs to view floating point registers] [unset $novregs to view vector registers] [unset $novsregs to view vector scalar registers] in . at 0x10008ec0 0x10008ec0 (???) 93830000 stw r28,0x0(r3)
(dbx) corefile Process Name: ./b2 Version: 430 Flags: CORE_VERSION_1 | UBLOCK_VALID | USTACK_VALID | LE_VALID Signal: SEGV Process Mode: 32 bit
(dbx) coremap Mapping: Stack (size=0x3000) from (address): 0x2ff20000 - 0x2ff23000 to (offset) : 0xb10 - 0x3b10 in file : core Mapping: Loaded Module Text (size=0x59e44) from (address): 0x10000000 - 0x10059e44 to (offset) : 0x0 - 0x59e44 in file : ./b2 Mapping: Loaded Module Data (size=0x6eff) from (address): 0x300005c9 - 0x300074c8 to : not available Mapping: Loaded Module Text (size=0x3d0c3a) from (address): 0xd0118500 - 0xd04e913a to (offset) : 0x29500 - 0x3fa13a in file : /usr/lib/libc.a Mapping: Loaded Module Text (size=0x93e) from (address): 0xd0529100 - 0xd0529a3e to (offset) : 0x100 - 0xa3e in file : /usr/lib/libcrypt.a Mapping: Loaded Module Data (size=0xce938) from (address): 0xf0616290 - 0xf06e4bc8 to : not available Mapping: Loaded Module Data (size=0x128) from (address): 0xf06e5608 - 0xf06e5730 to : not available
(dbx) where list_copy(??) at 0x10008ec0 cmd_new(??, ??, ??, ??) at 0x1003c334 make1cmds(??) at 0x10038960 make1b(??) at 0x100377b4 make1(??) at 0x100371a0 make(??, ??) at 0x10034bc0 builtin_update_now(??, ??) at 0x10025f3c function_run(??, ??, ??) at 0x1001efbc evaluate_rule(??, ??) at 0x10031cb8 function_call_rule(??, ??, ??, ??, ??, ??, ??) at 0x1000fe24 function_run(??, ??, ??) at 0x10020428 parse_file(??, ??) at 0x1003eb78 function_run(??, ??, ??) at 0x1002135c evaluate_rule(??, ??) at 0x10031cb8 function_call_rule(??, ??, ??, ??, ??, ??, ??) at 0x1000fe24 function_run(??, ??, ??) at 0x10020428 evaluate_rule(??, ??) at 0x10031cb8 function_call_rule(??, ??, ??, ??, ??, ??, ??) at 0x1000fe24 function_run(??, ??, ??) at 0x10020428 parse_file(??, ??) at 0x1003eb78 function_run(??, ??, ??) at 0x1002135c parse_file(??, ??) at 0x1003eb78 function_run(??, ??, ??) at 0x1002135c evaluate_rule(??, ??) at 0x10031cb8 function_call_rule(??, ??, ??, ??, ??, ??, ??) at 0x1000fe24 function_run(??, ??, ??) at 0x10020428 parse_file(??, ??) at 0x1003eb78 function_run(??, ??, ??) at 0x1002135c parse_file(??, ??) at 0x1003eb78 main(??, ??, ??) at 0x10000d68
(dbx) proc { -Identification/Authentication Info--------------------------- pi_pid: 10092636 pi_sid: 23658612 pi_ppid: 25297130 pi_pgrp: 6907 pi_uid: 6907 pi_suid: 6907 ---------------Controlling TTY Info--------------------------- pi_ttyp: 10092636 pi_ttyd: 0x0000000000000001 pi_ttympx: 0x0000000000000000 -----------------------------Scheduler Information------------ pi_nice: 0x00000014 pi_state: SACTIVE pi_flags: SLOAD | SNOSWAP | STRCME | SEXECED pi_flags2: <none> pi_thcount: 1 pi_cpu: 0 pi_pri: 67 ---------------------------------------------File Management-- pi_maxofile: 0x00000004 pi_cmask: 0x0002 pi_cdir: 0x73504020 pi_rdir: 0x00000000 pi_comm: "b2" ----------------------------------Memory---------------------- pi_adspace: 0x000000007f80f480 pi_majflt: 0x0000000000000000 pi_minflt: 0x0000000000000f50 pi_repage: 0x0000000000000000 pi_size: 0x000000000000091b pi_utime: N/A pi_stime: N/A -------Credentials, Accounting, Profiling & Resource Limits--- pi_cred: (use proc cred) pi_ru: (use proc ru) pi_cru: (use proc cru) pi_ioch: 0x0000000000104f9e pi_irss: 0x00000000000b75f0 pi_start: Thu Mar 7 14:43:32 2013 pi_rlimit: (use proc rlimit) -Memory Usage------------------------------------------------- pi_drss: 0x00000000000008cc pi_trss: 0x000000000000005a pi_dvm: 0x00000000000008cc pi_pi_prm: 0x0000000000000000 pi_tsize: 0x000000000004e4c9 pi_dsize: 0x0000000010b47c30 pi_sdsize: 0x0000000000000000 ------------------Signal Management--------------------------- pi_signal: (use proc signal) pi_sigflags: (use proc sigflags) pi_sig: <none> ---------------------------------WLM Information-------------- pi_classname: Unclassified pi_tag: <none> pi_chk_utime: N/A pi_chk_ctime: N/A }
(dbx) fd 0: { fp = 0x0000000000000001, flags = ALLOCATED, count = 0 } 1: { fp = 0x0000000000000001, flags = ALLOCATED, count = 0 } 2: { fp = 0x0000000000000001, flags = ALLOCATED, count = 0 } 3: { fp = 0x0000000000000001, flags = ALLOCATED, count = 0 }
(dbx) map Entry 1: Object name: ./b2 Text origin: 0x10000000 Text length: 0x59e44 Data origin: 0x300005c9 Data length: 0x6eff File descriptor: 0x5 Entry 2: Object name: /usr/lib/libcrypt.a Member name: shr.o Text origin: 0xd0529100 Text length: 0x93e Data origin: 0xf06e5608 Data length: 0x128 File descriptor: 0x7 Entry 3: Object name: /usr/lib/libc.a Member name: shr.o Text origin: 0xd0118500 Text length: 0x3d0c3a Data origin: 0xf0616290 Data length: 0xce938 File descriptor: 0x9
Other commands weren't as helpful:
(dbx) malloc libcdebug.a cannot be initialized.
Sorry that you have to talk me through this, but I am not familiar with either Boost Jam or AIX-based debugging. I'm developing on Cygwin / Linux mostly, then merely compile the application on Visual Studio / AIX. When AIX starts throwing problems at me, I feel pretty much lost. ;-)
comment:6 by , 10 years ago
Okay. I guess the next step is to start adding printfs.
I would start with cmd_new in command.c.
#include <stdio.h> /* at the start of cmd_new */ printf("printing targets: "); fflush(stdout); list_print(targets); printf("\n"); printf("printing sources"); fflush(stdout); list_print(sources); printf("\n"); printf("printing shell"); fflush(stdout); list_print(shell); printf("\n"); fflush(stdout)
The possible results are:
- b2 runs without errors -- The problem is in cmd_new. Generate assembly so I can see what the compiler is doing.
- These printfs produce incomplete output because b2 crashes inside one of the calls to list_print. This indicates a problem in the caller. I'd need to know which list failed, and the values printed for the other lists.
- b2 crashes in the same place as before -- The problem is either in cmd_new or in list_copy. My guess would be cmd_new, but I'd like to see assembler dumps of both functions.
I expect that (2) is the most likely. In this case, repeat this process in make.c (around line 1050 at the call to cmd_new. The important variables are nt, shell, ns, chunk, and start)
Hmm. This actually looks a bit like a use-after-free error. I think I'd really like to see the assembler for list_copy. Also, does AIX have any tool like valgrind that you can use?
comment:7 by , 10 years ago
Output from the printf()'s:
...found 83 targets... ...updating 20 targets... printing targets: bin.v2 printing sources printing shell common.mkdir bin.v2 printing targets: bin.v2/libs printing sources printing shell common.mkdir bin.v2/libs printing targets: bin.v2/libs/system printing sources printing shell common.mkdir bin.v2/libs/system printing targets: bin.v2/libs/system/build printing sources printing shell common.mkdir bin.v2/libs/system/build printing targets: bin.v2/libs/system/build/vacpp printing sources printing shell common.mkdir bin.v2/libs/system/build/vacpp printing targets: bin.v2/libs/system/build/vacpp/debug printing sources printing shell common.mkdir bin.v2/libs/system/build/vacpp/debug printing targets: bin.v2/libs/system/build/vacpp/debug/error_code.o printing sourceslibs/system/src/error_code.cpp printing shell Segmentation fault (core dumped)
comment:8 by , 10 years ago
That definitely puts us in case (3). Can you generate assembler for command.c and lists.c? I'm not familiar with xlc, but I know -S is fairly common for this. Be sure to use exactly the same optimization options as bootstrap.
comment:9 by , 10 years ago
It took me a moment to figure out what exactly the command line used by Jam would be, as it does not show up in bootstrap.log. Then I came up with the idea of putting a "GNARF" in the options in build.jam, and checking the error message in the log, which *did* print the command line.
So, the two attacked assembler files were generated via:
xlc -o bin.aixppc/command.s -DNDEBUG -DOPT_HEADER_CACHE_EXT -DOPT_GRAPH_DEBUG_EXT -DOPT_SEMAPHORE -DOPT_AT_FILES -DOPT_DEBUG_PROFILE -DOPT_FIX_TARGET_VARIABLES_EXT -DOPT_IMPROVED_PATIENCE_EXT -DYYSTACKSIZE=5000 -S -O3 -qstrict -qinline -bmaxdata:0x40000000 command.c
(Equivalent for lists.c.)
I don't know about a Valgrind tool for AIX, but some of the compiler debugging options look promising. I'll toss them at the problem as soon as I get around to it. However, I can only access the AIX during office hours, which puts me on a budget here.
Thanks for your help, anyway.
comment:10 by , 10 years ago
Here's my current analysis of the behavior:
The error appears on the instruction:
lwz r0,0(r3) #0x0000057c
Inside the block labelled __L578 in list_copy.
This instruction corresponds to the source lists.c:34
freelist[ bucket ] = result->next;
r3 holds the variable result and its value is 0x1, hence the seg-fault. This means that the free list is corrupted. Now, since this is for lists of size 1, that's probably where the 0x1 comes from. (The next pointer in the free list occupies the same memory as the size in the LIST struct.).
The most likely culprit is list_sublist, since (a) it was the last list operation called before the error and (b) this was the first time that list_sublist was called with a non-empty list. I'll review this function, but it'll take a little while since I'm not very familiar with PPC assembly.
What might help for tracking this down is in lists.c:
#undef NDEBUG #include <assert.h>
and sprinkle
assert((unsigned long)freelist[0] != 1ul);
around. Adding this assertion between lines 33 and 34 in list_alloc should catch just before the segfault.
by , 10 years ago
Attachment: | lists.s.assert added |
---|
lists.s assembler output, with assert added in line 34 and NDEBUG undefined.
comment:11 by , 10 years ago
You won't like this...
Adding the assert in line 33/34 makes the error go away.
I added the assembler of lists.c with the assert and undef line added.
comment:12 by , 10 years ago
That's actually a useful data point, as it proves that the problem is in lists.c.
comment:13 by , 10 years ago
I've found the culprit. As I suspected, it's in list_sublist.
837 __L830: # 0x00000830 (H.10.NO_SYMBOL+0x830) 838 neg r3,r0 839 addi r0,r28,4 840 stw r30,0(r6) 841 or r7,r6,r3 842 lwz r3,0(r6) 843 stwx r3,r5,r4
line 34(list_alloc): freelist[ bucket ] = result->next; line 171(list_copy_range): result->impl.size = size
result is the same pointer in both functions. This is reordered to
result->impl.size = size; freelist[ bucket ] = result->next;
which is equivalent to
result->impl.size = size; freelist[ bucket ] = (struct freelist_node*)size;
comment:14 by , 10 years ago
This is almost certainly caused by strict aliasing. You can disable strict aliasing with -qalias=noansi. The attached patch should fix the problem permanently. If you can confirm this, I'll commit it.
follow-up: 16 comment:15 by , 10 years ago
I can confirm that -qalias=noansi (added to the release compiler options) solves the problem.
However, the provided patch does *not* solve the problem. (Do you need another ASM dump?)
comment:16 by , 10 years ago
Replying to Martin Baute <solar@…>:
I can confirm that -qalias=noansi (added to the release compiler options) solves the problem.
However, the provided patch does *not* solve the problem.
Ugh. Try this one. If it doesn't work, I think it's a compiler bug.
(Do you need another ASM dump?)
No, it's highly unlikely to be different from the original.
comment:19 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Well, I'd like to see a backtrace for the error. You'll need to make a debug build of b2:
There are a couple of possibilities:
If the build completes without errors, we're probably either dealing with an optimizer bug or undefined behavior exposed by compiler optimization. The other possibility I can think of is that you're running out of memory. Boost.Build uses a lot of memory and doesn't handle failed allocations correctly.
Also, try building b2 in the trunk
I know I did fix at least one optimizer problem on AIX, and I'd like to make sure that your problem hasn't already been fixed.