Tuesday, August 10, 2010

Mixed-Language Programming and External Linkage

The C++ standard provides a mechanism called linkage specification for mixing code that was written in different programming languages and was compiled by the respective compilers, in the same program. Linkage specification refers to the protocol for linking functions or procedures written in different languages. Linkage is the term used by the C++ standard to describe the accessibility of objects from one file to another or even within the same file. Three types of linkage exist:
  1. No linkage
  2. Internal linkage
  3. External linkage
Something internal to a function, in regard to its arguments, variables, and so on, always has no linkage and hence can be accessed only within the function.

Sometimes it is necessary to declare functions and other objects within a single file in a way that allows them to reference each other, but not to be accessible from outside that file. This can be done through internal linkage. Symbols with internal linkage only refer to the same object within a single source file. Prefixing the declarations with the keyword static changes the linkage of external objects from external linkage to internal linkage.

Objects that have external linkage are all considered to be located at the outermost level of the program. This is the default linkage for functions and anything declared outside of a function. All instances of a particular name with external linkage refer to the same object in the program. If two or more declarations of the same symbol have external linkage, but with incompatible types (for example, mismatch of declaration and definition), then the program may either crash or show abnormal behaviour. The rest of the article discusses one of the issues with mixed code and provides a recommended solution with external linkage.

In the real world, it is very common to use the functionality of code written in one programming language from code written in another. A trivial example is a C++ programmer relying on a standard C library (libc) for sorting a series of integers with the "quick sort" technique. It works because the C implementation takes care of the language linkage for us. But we need to take additional care if we use our own libraries written in C, from a C++ program. Otherwise the compilation may fail with link errors caused by unresolved symbols. Consider the following example:

Assume that we're writing C++ code and wish to call a C function from C++ code. Here's the code for the callee, for example, C routine:

%cat greet.h
extern char *greet();

%cat greet.c
#include "greet.h"

char *greet() {
           return ((char *) "Hello!");
}

%cc -G -o libgreet.so greet.c

Note: The extern keyword declares a variable or function and specifies that it has external linkage, i.e., its name is visible from files other than the one in which it's defined.
Let's try to call the C function greet() from a C++ program

%cat mixedcode.cpp
#include <iostream.h>
#include "greet.h"

int main() {
        char *greeting = greet();
    cout << greeting << "\n";
        return (0);
}

 
%CC -lgreet mixedcode.cpp
Undefined                       first referenced
 symbol                            in file
char*greet()                    mixedcode.o
ld: fatal: Symbol referencing errors. No output written to a.out

Though the C++ code is linked with the dynamic library that holds the implementation for greet(), libgreet.so, the linking failed with undefined symbol error. What went wrong?

The reason for the link error is that a typical C++ compiler mangles (encodes) function names to support function overloading. So, the symbol greet is changed to something else depending on the algorithm implemented in the compiler during the name mangling process. Hence the object file does not have the symbol greet anywhere in the symbol table. The symbol table of mixedcode.o confirms this. Let's have a look at the symbol tables of both libgreet.so and mixedcode.o:

%elfdump1 -s libgreet.so

Symbol Table Section:  .symtab
index    value       size     type bind oth ver shndx       name
...
[1]  0x00000000 0x00000000  FILE LOCL  D    0 ABS         libgreet.so
...
[37]  0x00000268 0x00000004  OBJT GLOB  D    0 .rodata     _lib_version
[38]  0x000102f3 0x00000000  OBJT GLOB  D    0 .data1      _edata
[39]  0x00000228 0x00000028  FUNC GLOB  D    0 .text       greet
[40]  0x0001026c 0x00000000  OBJT GLOB  D    0 .dynamic    _DYNAMIC

%elfdump -s mixedcode.o

Symbol Table Section:  .symtab
index    value       size     type bind oth ver shndx       name
[0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF
[1]  0x00000000 0x00000000  FILE LOCL  D    0 ABS         mixedcode.cpp
[2]  0x00000000 0x00000000  SECT LOCL  D    0 .rodata
[3]  0x00000000 0x00000000  FUNC GLOB  D    0 UNDEF     
    __1cDstd2l6Frn0ANbasic_ostream4Ccn0ALchar_traits4Cc____pkc_2_
[4]  0x00000000 0x00000000  FUNC GLOB  D    0 UNDEF       __1cFgreet6F_pc_
[5]  0x00000000 0x00000000  NOTY GLOB  D    0 UNDEF       __1cDstdEcout_
[6]  0x00000010 0x00000050  FUNC GLOB  D    0 .text       main
[7]  0x00000000 0x00000000  NOTY GLOB  D    0 ABS         __fsr_init_value

%dem2 __1cFgreet6F_pc_

__1cFgreet6F_pc_ == char*greet()

char*greet() has been mangled to __1cFgreet6F_pc_ by the C++ compiler. That's the reason why the static linker (ld) couldn't match the symbol in the object file.
Note that a C compiler that complies with the C99 standard may mangle some names. For example, on systems in which linkers cannot accept extended characters, a C compiler may encode the universal character name in forming valid external identifiers.
 
How to solve this problem?
 
The C++ standard provides a mechanism called linkage specification to enables smooth compilation of mixed code. Linkage between C++ and non-C++ code fragments is called language linkage. All function types, function names, and variable names have a default C++ language linkage. Language linkage can be achieved using the following linkage specification
 
The string-literal specifies the linkage associated with a particular function, for example, C and C++. Every C++ implementation provides for linkage to functions written in C language ("C") and linkage to C++ ("C++").
The solution to the problem under discussion is to ask the C++ compiler to use C mangling for the external functions to be called, so we can use the functionality of external C functions from C++ code, without any issues. We can accomplish this using the linkage to C. The following declaration of greet() in greet.h should resolve the problem:

extern "C" char *greet();

Because we were calling C code from a C++ program, C linkage was used for the routine greet(). The linkage directive extern "C" tells the compiler to change from C++ mangling to C mangling for the function, and to use C calling conventions while sending external information to the linker. In other words, the C linkage specification forces the C++ compiler to adopt C conventions, which are not the same as C++ conventions.

So, let's modify the header greet.h, and recompile:

%cat greet.h
#if defined __cplusplus
        extern "C" {
#endif

        char *greet();

#if defined __cplusplus
    }
#endif

%cc -G -o libgreet.so greet.c
%CC -lgreet mixedcode.cpp
%./a.out
Hello!

It works! Since the header greet.h was used in both C and C++ files, it is necessary to guard extern "C" with the C++ compiler's predefined macro _cplusplus. This is because the C compiler doesn't recognize the "C" portion of extern "C", and throws an error message for the same.
Let's have a look at the symbol table of mixedcode.o one more time
 
%CC -c -lgreet mixedcode.cpp
%elfdump -s mixedcode.o

Symbol Table Section:  .symtab
index    value       size     type bind oth ver shndx       name
[0]  0x00000000 0x00000000  NOTY LOCL  D    0 UNDEF
[1]  0x00000000 0x00000000  FILE LOCL  D    0 ABS         mixedcode.cpp
[2]  0x00000000 0x00000000  SECT LOCL  D    0 .rodata
[3]  0x00000000 0x00000000  FUNC GLOB  D    0 UNDEF     
    __1cDstd2l6Frn0ANbasic_ostream4Ccn0ALchar_traits4Cc____pkc_2_
[4]  0x00000000 0x00000000  FUNC GLOB  D    0 UNDEF       greet
[5]  0x00000000 0x00000000  NOTY GLOB  D    0 UNDEF       __1cDstdEcout_
[6]  0x00000010 0x00000050  FUNC GLOB  D    0 .text       main
[7]  0x00000000 0x00000000  NOTY GLOB  D    0 ABS         __fsr_init_value

 The function name greet was not mangled by the C++ compiler, and hence the linker could find the symbol in the object file and was able to build the executable.

No comments:

Post a Comment