Binaries and Dependencies¶
Learning Objectives¶
In this section we will …¶
Understand why we build Python packages with native binaries: 1) performance and 2) library integration
Understand different components of the binary build process and their role: headers, libraries, compilers, linkers, build systems, system introspection tools, package managers
Understand basic requirements for binary compatibility: a) C-runtime library compatibility and b) shared library compatibilty
Understand scikit-build’s role in coordinating components of the binary build process and conda’s role in resolving dependencies and creating compatible platform binaries
Tutorial¶
Introduction¶
This section discusses the creation of Python packages that contain native binaries.
First, we explain why building Python packages with native binaries is often desirable or necessary for scientific applications.
Next, an overview of the requirements to build native binaries is provided. Within this the context, we explain how scikit-build and conda-build make life easier when we want to satisfy these requirements.
Finally, run an exercise where we build a native Python wth native binaries package and analyze the different stages of the build process.
Motivation¶
Scientific computing applications demand higher performance than other domains because of the:
Size of the datasets to be analyzed
Complexity of the algorithms evaluated
In order to achieve high performance, programs can:
Minimized the number of operations on the CPU required to acheive a certain task
Execute in parallel to leverage multi-core, many-core, and GPGPU system architectures
Carefully and precisely manage memory allocation and use
Greater performance is achieved with native binaries over CPython because:
Tasks are compiled down to minimal processor operations, as opposed to high level programming language instructions that must be interpreted
Parallel computing is not impared by CPython’s Global Interpreter Lock (GIL)
Memory can be managed explicitly and deterministically
Many existing scientific codes are written in programming languages other than Python. It is necessary to re-use these libraries since:
Resources are not available to re-implement work that is sometimes the result of multiple decades of effort from multiple researchers.
The scientific endeavor is built on the practice of reproducing and building on the top of the efforts of our predecessors.
The lingua franca of computing is the C programming language because most operating systems themselves are written in C.
As a consequence,
Native binaries reflect characteristics and compatibility with of the C language
The reference implementation of Python, CPython, is implemented in C
CPython supports binary extension modules written in C
Most other pre-compiled programming languages have a compatibility layer with C
CPython is an excellent language to integrate scientific codes!
Common programming languages compiled into native libraries for scientific computing include:
Fortran
C
C++
Cython
Rust
Build Components and Requirements¶
Build component categories:
- build tools
Tools use in the build process, such as the compiler, linker, build systems, system introspection tool, and package manager
Example compilers:
GCC
Clang
Visual Studio
Compilers translate source code from a human readable to a machine readable form.
Example linkers:
ld
ld.gold
link.exe
Linkers combine the results of compilers into a shared library that is executed at program runtime.
Example build systems:
distutils.build_ext
Unix Makefiles
Ninja
MSBuild in Visual Studio
Builds systems coordinate invocation of the compiler and linker, passing flags, and only out-of-date build targets are built.
Example system introspection tools:
CMake
GNU Autotools
Meson
System introspection tools examine the host system for available build tools, the location of build dependencies, and properties of the build target to generate the appropriate build system configuration files.
Example package managers:
conda
pip
apt
yum
chocolatey
homebrew
Package managers resolve dependencies so the required build host artifacts are available for the build.
- build host artifacts
These are files required on the host system performing the build. This includes header files, *.h files, which define the C program symbols, i.e. variable and function names, for the native binary with which we want to integrate. This also usually includes the native binaries themselves, i.e. the executable or shared library. An important exception to this rule is libpython, which we do not need on some platforms due to weak linking rules.
- target system artifacts
These are artifacts intended to be run on the target system, typically the shared library C-extension.
When the build host system is different from the target system, we are cross-compiling.
For example, when we are building a Linux Python package on macOS is cross-compiling. In this case macOS is the host system and Linux is the target system.
Distributable binaries must use a compatible C-runtime.
The table below lists the different C runtime implementations, compilers and their usual distribution mechanisms for each operating systems.
Linux |
MacOSX |
Windows |
|
---|---|---|---|
C runtime |
|||
Compiler |
Microsoft C/C++ Compiler (cl.exe) |
||
Provenance |
OSX SDK within XCode |
Linux C-runtime compatibility is determined by the version of glibc used for the build.
The glibc library shared by the system is forwards compatible but not backwards compatible. That is, a package built on an older system will work on a newer system, while a package built on a newer system will not work on an older system.
The manylinux project provides Docker images that have an older version of glibc to use for distributable Linux packages.
The C-runtime on macOS is determined by a build time option, the osx deployment target, which defines the minmum version of macOS to support, e.g. 10.9.
A macOS system comes with support for running building binaries for its version of OSX and older versions of OSX.
The XCode toolchain comes with SDK’s that support multiple target versions of OSX.
When building a wheel, this can be specified with –plat-name:
python setup.py bdist_wheel --plat-name macosx-10.6-x86_64
The C-runtime used on Windows is associated with the version of Visual Studio.
Architecture |
||
---|---|---|
CPython Version |
x86 (32-bit) |
x64 (64-bit) |
3.5 and above |
Visual Studio 14 2015 |
Visual Studio 14 2015 Win64 |
3.3 to 3.4 |
Visual Studio 10 2010 |
Visual Studio 10 2010 Win64 |
2.7 to 3.2 |
Visual Studio 9 2008 |
Visual Studio 9 2008 Win64 |
Distributable binaries are also built to be compatible with a certain CPU architecture class. For example
x86_64 (currently the most common)
x86
ppc64le
Scientific Python Build Tools¶
scikit-build is an improved build system generator for CPython C/C++/Fortran/Cython extensions.
scikit-build provides better support for additional compilers, build systems, cross compilation, and locating dependencies and their associated build requirements.
The scikit-build package is fundamentally just glue between the setuptools Python module and CMake.
To build and install a project configured with scikit-build:
pip install .
To build and install a project configured with scikit-build for development:
pip install -e .
To build and package a project configured with scikit-build:
pip wheel -w dist .
Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux.
Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer.
Conda was created for Python programs, but it can package and distribute software for any language.
scikit-build and conda abstract away and manage platform-specific details for you!
Exercises¶
Exercise 1: Build a Python Package with a C++ Extension Module¶
Download the hello-cpp example C++ project and build a wheel package with the commands:
cd hello-cpp
pip wheel -w dist --verbose .
Examine files referenced in the build output. What is the purpose of all referenced files?
Exercise 2: Build a Python Package with a Cython Extension Module¶
Download the hello-cython example C++ project and build a wheel package with the commands:
cd hello-cython
pip wheel -w dist --verbose .
Examine files referenced in the build output. What is the purpose of all referenced files?
Bonus Exercise 3: Build a Distributable Linux Wheel Package¶
If Docker is installed, create a dockcross manylinux bash driver script. From a bash shell, run:
# cd into the hello-cpp project from Exercise 1
cd hello-cpp
docker run --rm dockcross/manylinux-x64 > ./dockcross-manylinux-x64
chmod +x ./dockcross-manylinux-x64
The dockcross driver script simplifies execution of commands in the isolated Docker build environment that use sources in the current working directory.
To build a distributable Python 3.6 Python wheel, run:
./dockcross-manylinux-x64 /opt/python/cp36-cp36m/bin/pip wheel -w dist .
Which will output:
Processing /work
Building wheels for collected packages: hello-cpp
Running setup.py bdist_wheel for hello-cpp ... done
Stored in directory: /work/dist
Successfully built hello-cpp
and produce the wheel:
./dist/hello_cpp-1.2.3-cp36-cp36m-linux_x86_64.whl
To find the version of glibc required by the extension, run:
./dockcross-manylinux-x64 bash -c 'cd dist && unzip -o hello_cpp-1.2.3-cp36-cp36m-linux_x86_64.whl && objdump -T hello/_hello.cpython-36m-x86_64-linux-gnu.so | grep GLIBC'
What glibc version compatibility is required for this binary?
manylinux: https://github.com/pypa/manylinux
Bonus Exercise 4: Setting up continuous integration¶
See branch master-with-ci branch associated with
hello-cpp
example:Use scikit-ci for simpler and centralized CI configuration for Python extensions.
Use scikit-ci-addons, a set of scripts useful to help drive CI.
On CircleCI, use manylinux dockcross images including scikit-build, cmake and ninja packages.