Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#7093 closed Bugs (fixed)

Building "system" segfaults on AIX 6.1 / vacpp

Reported by: Martin Baute <solar@…> Owned by: Vladimir Prus
Milestone: To Be Determined Component: build
Version: Boost 1.50.0 Severity: Regression
Keywords: aix Cc:

Description

Machine:

  • AIX 6.1, 4-CPU PowerPC_POWER7

Compiler:

  • IBM XL C/C++ for AIX, V11.1

Steps to reproduce:

  • unpack Boost source tarball
  • ./bootstrap.sh --with-libraries=system
  • ./b2

Observed behaviour:

$ ./bootstrap.sh --with-libraries=system

-n Building Boost.Build engine with toolset vacpp... 
tools/build/v2/engine/bin.aixppc/b2
-n Unicode/ICU support for Boost.Regex?... 
not found.
Generating Boost.Build configuration in project-config.jam...

Bootstrapping is done. To build, run:

    ./b2
    
To adjust configuration, edit 'project-config.jam'.
Further information:

   - Command line help:
     ./b2 --help
     
   - Getting started guide: 
     http://www.boost.org/more/getting_started/unix-variants.html
     
   - Boost.Build documentation:
     http://www.boost.org/boost-build2/doc/html/index.html


$ ./b2

Building the Boost C++ Libraries.



Component configuration:

    - chrono                   : not building
    - date_time                : not building
    - exception                : not building
    - filesystem               : not building
    - graph                    : not building
    - graph_parallel           : not building
    - iostreams                : not building
    - locale                   : not building
    - math                     : not building
    - mpi                      : not building
    - program_options          : not building
    - python                   : not building
    - random                   : not building
    - regex                    : not building
    - serialization            : not building
    - signals                  : not building
    - system                   : building
    - test                     : not building
    - thread                   : not building
    - timer                    : not building
    - wave                     : not building

...found 78 targets...
...updating 17 targets...
common.mkdir stage
common.mkdir stage/lib
common.mkdir bin.v2
common.mkdir bin.v2/libs
common.mkdir bin.v2/libs/system
common.mkdir bin.v2/libs/system/build
common.mkdir bin.v2/libs/system/build/vacpp
common.mkdir bin.v2/libs/system/build/vacpp/release
common.mkdir bin.v2/libs/system/build/vacpp/release/threading-multi
Segmentation fault (core dumped)

Also tried:

  • ./b2 address-model=64 (same behaviour, with release/address-model-64/threading-multi as last output instead of release/threading-multi)

Observed since:

  • boost-1.50.0. (Worked fine with boost-1.49.0.)

Unfortunately I am not familiar enough with either AIX 6.1 nor the Boost building process to know what other information might be helpful for you, or how to get them, but am willing to be talked through additional debugging steps and providing relevant logs or whatnot.

Attachments (6)

lists.s (105.2 KB ) - added by Martin Baute <solar@…> 10 years ago.
lists.s assembler output
command.s (10.1 KB ) - added by Martin Baute <solar@…> 10 years ago.
command.s assembler output
lists.s.assert (110.6 KB ) - added by Martin Baute <solar@…> 10 years ago.
lists.s assembler output, with assert added in line 34 and NDEBUG undefined.
lists.c.patch (483 bytes ) - added by Steven Watanabe 10 years ago.
Patch to fix the problem
lists.c.2.patch (1.5 KB ) - added by Steven Watanabe 10 years ago.
Patch to fix the problem (take 2)
lists.patch (2.1 KB ) - added by Steven Watanabe 10 years ago.
One more attempt

Download all attachments as: .zip

Change History (26)

comment:1 by Steven Watanabe, 10 years ago

Well, I'd like to see a backtrace for the error. You'll need to make a debug build of b2:

cd tools/build/v2/engine
./build.sh --debug
cd -
gdb tools/build/v2/engine/bin.*.debug/b2

There are a couple of possibilities:

If the build completes without errors, we're probably either dealing with an optimizer bug or undefined behavior exposed by compiler optimization. The other possibility I can think of is that you're running out of memory. Boost.Build uses a lot of memory and doesn't handle failed allocations correctly.

Also, try building b2 in the trunk

svn checkout http://svn.boost.org/svn/boost/trunk/tools/build/v2/engine

I know I did fix at least one optimizer problem on AIX, and I'd like to make sure that your problem hasn't already been fixed.

comment:2 by viboes, 10 years ago

Component: Building Boostbuild
Owner: set to Vladimir Prus

Please, give the requested information, otherwise we can not help to fix any possible issue.

comment:3 by Martin Baute <solar@…>, 10 years ago

Sorry, I somehow missed the update by steven_watanabe.

The following is using TRUNK code (r82720). No sense in testing against code two releases back.

First off, behaviour with TRUNK b2 is still as described above.

Building b2 with --debug makes the problem go away, so unfortunately there's no debug backtrace to be had.

Memory usage on the machine is a nowhere near 100% (this being a server which has about 80 gigs of RAM installed), so it seems steven was right suspecting an optimization issue.

Sorry that I couldn't be of more help.

comment:4 by Steven Watanabe, 10 years ago

Okay. An optimization problem. I'll at least need a backtrace for any debugging, so...

  1. In $BOOST/tools/build/v2/engine/build.jam find the line that says "toolset vacpp xlc : ..."
  2. Edit the release flags to add debug symbols. (-g)
  3. re-run the bootstrap script
  4. run gdb ./b2 to get a backtrace from the error

That should at least give me an idea of what to look at.

comment:5 by Martin Baute <solar@…>, 10 years ago

I couldn't get gdb to produce any sensible output:

(gdb) bt
#0  0x10008ec0 in ?? ()

Fortunately, dbx was a bit more forthcoming. Output from commands where it made sense:

Segmentation fault in . at 0x10008ec0
0x10008ec0 (???) 93830000         stw   r28,0x0(r3)
(dbx) listi
0x10008ec0 (???) 93830000         stw   r28,0x0(r3)
0x10008ec4 (???) 7c06292e        stwx   r0,r6,r5
0x10008ec8 (???) 408100b8         ble   0x10008f80 (???) 
0x10008ecc (???) 7c8300d0         neg   r4,r3
0x10008ed0 (???) 7c1f00d0         neg   r0,r31
0x10008ed4 (???) 7c652378          or   r5,r3,r4
0x10008ed8 (???) 7fe00378          or   r0,r31,r0
0x10008edc (???) 38830004        addi   r4,0x4(r3)
0x10008ee0 (???) 7ca5fe70       srawi   r5,r5,0x1f
0x10008ee4 (???) 38df0004        addi   r6,0x4(r31)
(dbx) registers
  $r0:0x00000000  $stkp:0x2ff20310   $toc:0x30003d5c    $r3:0x00000001  
  $r4:0x00000000    $r5:0x00000000    $r6:0x30004330    $r7:0x00000004  
  $r8:0x00000000    $r9:0xf0696f04   $r10:0x00000004   $r11:0x00000000  
 $r12:0x1003c318   $r13:0xf0619018   $r14:0x66666667   $r15:0x30004848  
 $r16:0x00000000   $r17:0x00000000   $r18:0x00000001   $r19:0x00000000  
 $r20:0x30af2ef8   $r21:0x00000000   $r22:0x00000001   $r23:0x30b20d48  
 $r24:0x30b19538   $r25:0x304b3e18   $r26:0x30473d8c   $r27:0x00000002  
 $r28:0x00000001   $r29:0x30b45c78   $r30:0x00000000   $r31:0x30af2ef8  
 $iar:0x10008ec0   $msr:0x0000d032    $cr:0x44200284  $link:0x1003c338  
 $ctr:0xd013fc20   $xer:0x2000001a  
          Condition status = 0:g 1:g 2:e 5:e 6:l 7:g 
        [unset $noflregs to view floating point registers]
        [unset $novregs to view vector registers]
        [unset $novsregs to view vector scalar registers]
in . at 0x10008ec0
0x10008ec0 (???) 93830000         stw   r28,0x0(r3)
(dbx) corefile
 Process Name:  ./b2
 Version:       430
 Flags:         CORE_VERSION_1 | UBLOCK_VALID | USTACK_VALID | LE_VALID
 Signal:        SEGV
 Process Mode:  32 bit
(dbx) coremap
Mapping: Stack (size=0x3000)
   from (address): 0x2ff20000 - 0x2ff23000
   to (offset)   : 0xb10 - 0x3b10
   in file       : core

Mapping: Loaded Module Text (size=0x59e44)
   from (address): 0x10000000 - 0x10059e44
   to (offset)   : 0x0 - 0x59e44
   in file       : ./b2

Mapping: Loaded Module Data (size=0x6eff)
   from (address): 0x300005c9 - 0x300074c8
   to  : not available

Mapping: Loaded Module Text (size=0x3d0c3a)
   from (address): 0xd0118500 - 0xd04e913a
   to (offset)   : 0x29500 - 0x3fa13a
   in file       : /usr/lib/libc.a

Mapping: Loaded Module Text (size=0x93e)
   from (address): 0xd0529100 - 0xd0529a3e
   to (offset)   : 0x100 - 0xa3e
   in file       : /usr/lib/libcrypt.a

Mapping: Loaded Module Data (size=0xce938)
   from (address): 0xf0616290 - 0xf06e4bc8
   to  : not available

Mapping: Loaded Module Data (size=0x128)
   from (address): 0xf06e5608 - 0xf06e5730
   to  : not available
(dbx) where
list_copy(??) at 0x10008ec0
cmd_new(??, ??, ??, ??) at 0x1003c334
make1cmds(??) at 0x10038960
make1b(??) at 0x100377b4
make1(??) at 0x100371a0
make(??, ??) at 0x10034bc0
builtin_update_now(??, ??) at 0x10025f3c
function_run(??, ??, ??) at 0x1001efbc
evaluate_rule(??, ??) at 0x10031cb8
function_call_rule(??, ??, ??, ??, ??, ??, ??) at 0x1000fe24
function_run(??, ??, ??) at 0x10020428
parse_file(??, ??) at 0x1003eb78
function_run(??, ??, ??) at 0x1002135c
evaluate_rule(??, ??) at 0x10031cb8
function_call_rule(??, ??, ??, ??, ??, ??, ??) at 0x1000fe24
function_run(??, ??, ??) at 0x10020428
evaluate_rule(??, ??) at 0x10031cb8
function_call_rule(??, ??, ??, ??, ??, ??, ??) at 0x1000fe24
function_run(??, ??, ??) at 0x10020428
parse_file(??, ??) at 0x1003eb78
function_run(??, ??, ??) at 0x1002135c
parse_file(??, ??) at 0x1003eb78
function_run(??, ??, ??) at 0x1002135c
evaluate_rule(??, ??) at 0x10031cb8
function_call_rule(??, ??, ??, ??, ??, ??, ??) at 0x1000fe24
function_run(??, ??, ??) at 0x10020428
parse_file(??, ??) at 0x1003eb78
function_run(??, ??, ??) at 0x1002135c
parse_file(??, ??) at 0x1003eb78
main(??, ??, ??) at 0x10000d68
(dbx) proc
{
-Identification/Authentication Info---------------------------
 pi_pid:        10092636                pi_sid:         23658612
 pi_ppid:       25297130                pi_pgrp:            6907
 pi_uid:            6907                pi_suid:            6907

---------------Controlling TTY Info---------------------------
 pi_ttyp:       10092636                pi_ttyd:        0x0000000000000001
 pi_ttympx:     0x0000000000000000

-----------------------------Scheduler Information------------
 pi_nice:       0x00000014              pi_state:       SACTIVE
 pi_flags:      SLOAD | SNOSWAP | STRCME | SEXECED 
 pi_flags2:     <none>
 pi_thcount:           1                pi_cpu:        0
 pi_pri:              67

---------------------------------------------File Management--
 pi_maxofile:   0x00000004              pi_cmask:       0x0002
 pi_cdir:       0x73504020              pi_rdir:        0x00000000        
 pi_comm:       "b2"

----------------------------------Memory----------------------
 pi_adspace:    0x000000007f80f480
 pi_majflt:     0x0000000000000000      pi_minflt:      0x0000000000000f50
 pi_repage:     0x0000000000000000      pi_size:        0x000000000000091b
 pi_utime:                   N/A        pi_stime:                    N/A

-------Credentials, Accounting, Profiling & Resource Limits---
 pi_cred:       (use proc cred)
 pi_ru:         (use proc ru)
 pi_cru:        (use proc cru)
 pi_ioch:       0x0000000000104f9e      pi_irss:        0x00000000000b75f0
 pi_start:      Thu Mar  7 14:43:32 2013
 pi_rlimit:     (use proc rlimit)

-Memory Usage-------------------------------------------------
 pi_drss:       0x00000000000008cc      pi_trss:        0x000000000000005a
 pi_dvm:        0x00000000000008cc      pi_pi_prm:      0x0000000000000000
 pi_tsize:      0x000000000004e4c9      pi_dsize:       0x0000000010b47c30
 pi_sdsize:     0x0000000000000000

------------------Signal Management---------------------------
 pi_signal:     (use proc signal)       pi_sigflags:    (use proc sigflags)
 pi_sig:        <none>


---------------------------------WLM Information--------------
 pi_classname:  Unclassified
 pi_tag:        <none>
 pi_chk_utime:        N/A               pi_chk_ctime:         N/A
                                                               }
(dbx) fd
0: { fp = 0x0000000000000001, flags = ALLOCATED, count = 0 }
1: { fp = 0x0000000000000001, flags = ALLOCATED, count = 0 }
2: { fp = 0x0000000000000001, flags = ALLOCATED, count = 0 }
3: { fp = 0x0000000000000001, flags = ALLOCATED, count = 0 }
(dbx) map
Entry 1:
   Object name: ./b2
   Text origin:     0x10000000
   Text length:     0x59e44
   Data origin:     0x300005c9
   Data length:     0x6eff
   File descriptor: 0x5

Entry 2:
   Object name: /usr/lib/libcrypt.a
   Member name: shr.o
   Text origin:     0xd0529100
   Text length:     0x93e
   Data origin:     0xf06e5608
   Data length:     0x128
   File descriptor: 0x7

Entry 3:
   Object name: /usr/lib/libc.a
   Member name: shr.o
   Text origin:     0xd0118500
   Text length:     0x3d0c3a
   Data origin:     0xf0616290
   Data length:     0xce938
   File descriptor: 0x9

Other commands weren't as helpful:

(dbx) malloc
libcdebug.a cannot be initialized.

Sorry that you have to talk me through this, but I am not familiar with either Boost Jam or AIX-based debugging. I'm developing on Cygwin / Linux mostly, then merely compile the application on Visual Studio / AIX. When AIX starts throwing problems at me, I feel pretty much lost. ;-)

comment:6 by Steven Watanabe, 10 years ago

Okay. I guess the next step is to start adding printfs.

I would start with cmd_new in command.c.

#include <stdio.h>

/* at the start of cmd_new */
printf("printing targets: ");
fflush(stdout);
list_print(targets);
printf("\n");
printf("printing sources");
fflush(stdout);
list_print(sources);
printf("\n");
printf("printing shell");
fflush(stdout);
list_print(shell);
printf("\n");
fflush(stdout)

The possible results are:

  1. b2 runs without errors -- The problem is in cmd_new. Generate assembly so I can see what the compiler is doing.
  2. These printfs produce incomplete output because b2 crashes inside one of the calls to list_print. This indicates a problem in the caller. I'd need to know which list failed, and the values printed for the other lists.
  3. b2 crashes in the same place as before -- The problem is either in cmd_new or in list_copy. My guess would be cmd_new, but I'd like to see assembler dumps of both functions.

I expect that (2) is the most likely. In this case, repeat this process in make.c (around line 1050 at the call to cmd_new. The important variables are nt, shell, ns, chunk, and start)

Hmm. This actually looks a bit like a use-after-free error. I think I'd really like to see the assembler for list_copy. Also, does AIX have any tool like valgrind that you can use?

comment:7 by Martin Baute <solar@…>, 10 years ago

Output from the printf()'s:

...found 83 targets...
...updating 20 targets...
printing targets: bin.v2
printing sources
printing shell
common.mkdir bin.v2
printing targets: bin.v2/libs
printing sources
printing shell
common.mkdir bin.v2/libs
printing targets: bin.v2/libs/system
printing sources
printing shell
common.mkdir bin.v2/libs/system
printing targets: bin.v2/libs/system/build
printing sources
printing shell
common.mkdir bin.v2/libs/system/build
printing targets: bin.v2/libs/system/build/vacpp
printing sources
printing shell
common.mkdir bin.v2/libs/system/build/vacpp
printing targets: bin.v2/libs/system/build/vacpp/debug
printing sources
printing shell
common.mkdir bin.v2/libs/system/build/vacpp/debug
printing targets: bin.v2/libs/system/build/vacpp/debug/error_code.o
printing sourceslibs/system/src/error_code.cpp
printing shell
Segmentation fault (core dumped)

comment:8 by Steven Watanabe, 10 years ago

That definitely puts us in case (3). Can you generate assembler for command.c and lists.c? I'm not familiar with xlc, but I know -S is fairly common for this. Be sure to use exactly the same optimization options as bootstrap.

by Martin Baute <solar@…>, 10 years ago

Attachment: lists.s added

lists.s assembler output

by Martin Baute <solar@…>, 10 years ago

Attachment: command.s added

command.s assembler output

comment:9 by Martin Baute <solar@…>, 10 years ago

It took me a moment to figure out what exactly the command line used by Jam would be, as it does not show up in bootstrap.log. Then I came up with the idea of putting a "GNARF" in the options in build.jam, and checking the error message in the log, which *did* print the command line.

So, the two attacked assembler files were generated via:

xlc -o bin.aixppc/command.s -DNDEBUG -DOPT_HEADER_CACHE_EXT -DOPT_GRAPH_DEBUG_EXT -DOPT_SEMAPHORE -DOPT_AT_FILES -DOPT_DEBUG_PROFILE -DOPT_FIX_TARGET_VARIABLES_EXT -DOPT_IMPROVED_PATIENCE_EXT -DYYSTACKSIZE=5000 -S -O3 -qstrict -qinline -bmaxdata:0x40000000 command.c

(Equivalent for lists.c.)

I don't know about a Valgrind tool for AIX, but some of the compiler debugging options look promising. I'll toss them at the problem as soon as I get around to it. However, I can only access the AIX during office hours, which puts me on a budget here.

Thanks for your help, anyway.

comment:10 by Steven Watanabe, 10 years ago

Here's my current analysis of the behavior:

The error appears on the instruction:

lwz        r0,0(r3)  #0x0000057c

Inside the block labelled __L578 in list_copy.

This instruction corresponds to the source lists.c:34

freelist[ bucket ] = result->next;

r3 holds the variable result and its value is 0x1, hence the seg-fault. This means that the free list is corrupted. Now, since this is for lists of size 1, that's probably where the 0x1 comes from. (The next pointer in the free list occupies the same memory as the size in the LIST struct.).

The most likely culprit is list_sublist, since (a) it was the last list operation called before the error and (b) this was the first time that list_sublist was called with a non-empty list. I'll review this function, but it'll take a little while since I'm not very familiar with PPC assembly.

What might help for tracking this down is in lists.c:

#undef NDEBUG
#include <assert.h>

and sprinkle

assert((unsigned long)freelist[0] != 1ul);

around. Adding this assertion between lines 33 and 34 in list_alloc should catch just before the segfault.

by Martin Baute <solar@…>, 10 years ago

Attachment: lists.s.assert added

lists.s assembler output, with assert added in line 34 and NDEBUG undefined.

comment:11 by Martin Baute <solar@…>, 10 years ago

You won't like this...

Adding the assert in line 33/34 makes the error go away.

I added the assembler of lists.c with the assert and undef line added.

comment:12 by Steven Watanabe, 10 years ago

That's actually a useful data point, as it proves that the problem is in lists.c.

comment:13 by Steven Watanabe, 10 years ago

I've found the culprit. As I suspected, it's in list_sublist.

837	__L830:                                 # 0x00000830 (H.10.NO_SYMBOL+0x830)
838	        neg        r3,r0
839	        addi       r0,r28,4
840	        stw        r30,0(r6)
841	        or         r7,r6,r3
842	        lwz        r3,0(r6)
843	        stwx       r3,r5,r4
line 34(list_alloc):       freelist[ bucket ] = result->next;
line 171(list_copy_range): result->impl.size = size

result is the same pointer in both functions. This is reordered to

result->impl.size = size;
freelist[ bucket ] = result->next;

which is equivalent to

result->impl.size = size;
freelist[ bucket ] = (struct freelist_node*)size;

comment:14 by Steven Watanabe, 10 years ago

This is almost certainly caused by strict aliasing. You can disable strict aliasing with -qalias=noansi. The attached patch should fix the problem permanently. If you can confirm this, I'll commit it.

by Steven Watanabe, 10 years ago

Attachment: lists.c.patch added

Patch to fix the problem

comment:15 by Martin Baute <solar@…>, 10 years ago

I can confirm that -qalias=noansi (added to the release compiler options) solves the problem.

However, the provided patch does *not* solve the problem. (Do you need another ASM dump?)

in reply to:  15 comment:16 by Steven Watanabe, 10 years ago

Replying to Martin Baute <solar@…>:

I can confirm that -qalias=noansi (added to the release compiler options) solves the problem.

However, the provided patch does *not* solve the problem.

Ugh. Try this one. If it doesn't work, I think it's a compiler bug.

(Do you need another ASM dump?)

No, it's highly unlikely to be different from the original.

by Steven Watanabe, 10 years ago

Attachment: lists.c.2.patch added

Patch to fix the problem (take 2)

by Steven Watanabe, 10 years ago

Attachment: lists.patch added

One more attempt

comment:17 by Steven Watanabe, 10 years ago

This last patch gets rid of anything even remotely dubious.

comment:18 by Martin Baute <solar@…>, 10 years ago

Either patch works fine. Well done!

comment:19 by Steven Watanabe, 10 years ago

Resolution: fixed
Status: newclosed

(In [83408]) Prevent incorrect reordering with xlc -qalias=ansi. Fixes #7093.

comment:20 by Steven Watanabe, 10 years ago

Thank you very much for your help debugging this.

Note: See TracTickets for help on using tickets.