VecTcl

Performance considerations

Release .zip Release .tar.gz Other versions

Performance considerations

Shimmering performance

Due to the way the grammar is defined, a value to be interpreted as a NumArray must be parsed recursively down to the level of individual elements at the moment of shimmering, i.e. in the setFromAnyProc. At this moment, it must accept every possible number of dimensions or data type. Since numerical objects can be very large, the setFromAnyProc tries to perform the conversion without triggering a string representation. This is done by looking into the typePtr field of the Tcl_Obj representing the input value, before attempting to fetch the value using Tcl_GetFooFromObj(). Only if the typePtr can neither be read as an integer, double, complex, or list type, the string representation is examined. A pure sequence of Tcl_GetFooFromObj() is insufficient to prevent the string conversion, possibly multiple times, especially when non-core numeric datatypes like complex values or NumArray slices are involved. Therefore, the efficiency of converting a pure list of doubles into NumArray is dependent on the fact that the core datatypes int, double, list, are registered using Tcl_RegisterObjType(). An alternative interface satisfying the needs of NumArray’s setFromAnyProc would be a Tcl_MaybeGetFooFromObj() function which fails if the Tcl_Obj in question cannot be converted to the requested value type without triggering a string conversion.

Of equal importance than the conversion of Tcl’s native datatypes into NumArray is the other way round. At some point, the data must leave the internal representation of NumArray. The only official way to do this is the updateStringProc. But the string representation is not always the final step in the conversion. If the NumArray is not scalar, then in most cases the code using the result expects a list, and the generated string representation is just used to parse the values back into a list. For instance, consider a program which wishes to draw a line on the canvas computed by VecTcl.

# compute a sine wave using VecTcl
vexpr { 
	x=linspace(0,10,1000)
	y=sin(x)*50
	coords = hstack(x*10+100, y+100)
	coords = reshape(coords, 2000)
}
# display on canvas
.c create line $coords

As long as the canvas is not rewritten to understand NumArrays directly,coords shimmers via string to list of doubles. VecTcl contains an experimental feature to enable direct conversion from the internal representation, enabled by --enable-listpatch during compilation. It requires a small patch to the Tcl core to work. The patch replaces calls to setListFromAny in the core (single file tclListObj.c) with equivalent calls to Tcl_ConvertToType() and removes the const qualifier from the list Tcl_ObjType. Thus VecTcl can patch the list Tcl_ObjType() and wrap the setFromAny proc, such that NumArrays are checked first. The conversion code constructs a list of scalar values from one-dimensional NumArrays. Higher-dimensional NumArrays are returned as a list of slices into the original buffer. Thus, a lindex into a matrix with 1000×1000 elements, shimmers the object into a list of 1000 slices, and the accessed row into a list of 1000 doubles. In contrast, without the patch, it prints a million doubles into ASCII. A cleaner way to achieve this goal would be, if the core provided a method to register an optimized myObjTypetoList() function, or even more general, a way to register an A to B transform. Even the core itself could profit from such an infrastructure, since, e.g., the dict and list types can be converted circumventing the string representation, which is done currently by making list and dict setFromAny know of each other.

Execution speed of vexpr: Compilation

Compilation speed of vexpr in its current implementation is satisfying. Simple statements compile faster than a millisecond, and relatively long expressions or programs, such as the integration loop in José Ignacio Marín’s example, take a few milliseconds to compile

vproc shipTrajectory {} {
    numSteps = 13000
    x = zeros(numSteps + 1, 2); # m
    v = zeros(numSteps + 1, 2); # m/s

    x[0, 0] = 15e6
    x[0, 1] = 1e6
    v[0, 0] = 2e3
    v[0, 1] = 4e3

    for i=0:numSteps-1 {
        a = acceleration(x[i,:])
        v[i+1,0] = v[i,0]+::h*a[0]
        v[i+1,1] = v[i,1]+::h*a[1]
        x[i+1,0] = x[i,0]+::h*v[i,0]
        x[i+1,1] = x[i,1]+::h*v[i,1]
    }
    list(x,v)
}

Execution speed of a sequence of commands

On the other hand, the execution speed of this code is very slow compared to the clock speed of the computer. On the same machine on which the standard benchmarks are performed, it takes 600 milliseconds to integrate the 13,000 steps. A single iteration of the loop is therefore 100,000 clock cycles. The code hidden in acceleration needs a few multiplications, a division and a square root, but this still doesn’t justify 100,000 clock cycles on a modern machine. When the updates in the inner loop are rewritten in vector form, the execution speed nearly doubles. This code

v[i+1,:] = v[i,:]+::h*a
x[i+1,:] = x[i,:]+::h*v[i,:]

takes 350 milliseconds to complete 13,000 steps. The main reason for this difference is that the time for actually performing this computation is insignificant. Calling a C coded command which doesn’t do anything in a for loop many times takes 200 ns per iteration, which is almost 500 times slower than the clock speed of the machine. The second variant of the code simply runs faster because it eliminates half of the command calls. To make this kind of code run fast, the speed of the Tcl bytecode engine will have to be improved, or an alternative backend for the VecTcl compiler must be sought. Circumventing the Tcl bytecode compiler and directly assembling Tcl bytecodes could solve certain issues, for instance the stringification of literals, which currently restricts constant folding in order to not generate Tcl code consisting mostly of huge literal strings. This is unlikely to improve performance by a large margin, since scalar values and small literal lists are already constant-folded.

One possible alternative implementation could generate code in C, which still calls the same underlying C commands, but does not dispatch through the Tcl interpreter. Such an implementation is within reach with the advent of the tcc4tcl extension, which can compile C code very fast into memory and execute it from there. Using the critcl extension to compile with an external compiler to disk and loading the code from there would lead to much more optimized code, however, for the purposes of VecTcl this seems impractical because it requires the heavy-weight installation of an external compiler with all its dependencies. The compilation speed will be orders of magnitude slower and will pay off only under certain circumstances, and when the source code is transformed into C level loops with deeper analysis, instead of a sequence of command dispatches.

Execution speed of a single NumArray vector command

The execution performance of a VecTcl expression can also be bound by the performance of the underlying NumArray operations. This is the case, when only a few instructions are performed which operate on a large number of datapoints simultaneously, such as a long vector or a huge matrix. In fact, programmers which use languages similar to VecTcl, for instance Matlab or NumPy, often seek to write their code in this style, which is usually termed vectorization, because it generally provides the fastest programs. Most elemental operations of NumArray already reach the maximum performance, as evidenced by the memory benchmark, where an elemental operation such as adding two vectors is measured against the speed of memcpy. Memcpy is a very tough competitor, since it doesn’t compute anything while copying the values and is heavily optimized by the compiler. The benchmark shows, that most of the pointwise operations come close to the memory bandwidth for vectors, but performance is dependent on the shape of the matrix. A possible improvement on this end would be the usage of highly optimized linear algebra subroutines (BLAS) such as provided by ATLAS or Intel MKL for important special cases. This could be done as a compile-time option, such that the basic package is not unconditionally dependent on these external libraries. Such a library would also boost performance in some more complex operations like matrix multiplication and equation solving, and is necessary to be competitive in this end with other numerical packages. Due to the existence of a standard unified interface, the best available BLAS could be selected at compile time with little effort in the VecTcl codebase. Another simple enhancement which is generally applicable to all internal loops is the use of OpenMP instructions. OpenMP annotates C code with pragmas which leads to parallelized code from supporting compilers. It is widely supported and gives parallel execution with very little effort. It should be noted that these optimizations only come into play with certain length of the operands; from the memory benchmark it is evident, that below a size of 10,000 bytes, or roughly 1,000 floating point values in double precision, the command dispatch takes more time than the actual computation.

Execution speed of a single complex expression

The third type of performance impact on the speed of a compiled expression can be seen as a mixture of the first two; it arises when a vectorized expression with large vectors consists of many terms. Consider an expression such as

vexpr {r = a.*a+b.*b }

Each of the subexpressions is evaluated in sequence; the expression is evaluated as if we had written

vexpr {
	temp1 = a.*a
	temp2 = b.*b
	r = temp1 + temp2
}

This impacts performance in two ways. First, it performs three passes in total over the input data. Second, if the vectors are sufficiently long, temp1 and temp2 will not remain in the cache of the CPU and spilled to memory, which leads to high latency when they are added together and in consequence many wasted clock cycles, where the CPU just waits for the data from main memory. In contrast, if we had written the same loop in a compiled language

for (int i=0; i<N; i++) {
	r[i] = a[i]*a[i] + b[i]*b[i]
}

This code does only one pass over the data. It can load the operands a[i] and b[i] in parallel and does not cause cache misses, and all intermediate values are held in a register. Therefore, this code should execute at least 2-3 times faster than the version above. This speed bottleneck is really hard to overcome; even compiled languages have suffered from this problem. In fact, C++ was considered to be “too slow for numerical computation” until some tricky template technique was invented, the main purpose of which is to move the elementwise expression into the inner loop. One way to tackle this problem could be the use of blocking, that is to break the vector expression into smaller chunks such that the temporaries still fit into the cache size, and to execute highly optimized fixed-size code pieces for each chunk. This is the approach taken by, e.g., the Python numexpr extension. Good blocking is also the backbone of fast higher-level algorithms inside LAPACK and fast BLAS libraries. Another approach could again be the use of code generation, to generate native machine code using, e.g. tcc4tcl, which executes the loop in one pass. However this is not likely to be implemented soon, as it requires deeper analysis and type/dimensionality inference to work. In addition, performance may actually suffer up to a certain level of complexity of the expression due to the weaker optimization of tcc in comparison to the native compiler.

The performance improvements suggested in this section all share the nice property, that they can be implemented within the vexpr command without the client code ever noticing, that the compilation is directed via C into native machine code or bytecode for a blocking expression evaluator. This comes as a consequence, that vexpr accepts the expression as a string due to Tcl’s syntax rules and is in contrast to, e.g. similar solutions in Python, where the expression must be explicitly supplied as a string or decorated. However, the proposed performance improvements still remain to be implemented.