Threaded code

Threaded code

In computer science, the term threaded code refers to a compiler implementation technique where the generated code has a form that essentially consists entirely of calls to subroutines. The code may be processed by an interpreter, or may simply be a sequence of machine code call instructions.

Threaded code has better code density than code generated by alternative code generation techniques and alternative calling conventions, at the expense of slightly slower execution speed (usually only one machine instruction). However, a program small enough to fit fully in a computer processor's cache may run faster than a larger program that suffers many cache misses.[1]

Threaded code is best known as the implementation technique commonly used in some programming languages, such as Forth, many implementations of BASIC, some implementations of COBOL, early versions of B, and other languages for small minicomputers and satellites.

Contents

Preparatory history

The common way to make computer programs is to 'translate' a computer program written in some symbolic language to machine code using a compiler. The code is typically fast but nonportable since the binary code is designed for a specific hardware platform. A different approach uses a virtual machine instruction set - that has no particular target hardware. An interpreter executes it on each new target hardware.

Early computers had relatively little memory. For example, most Data General Nova, IBM 1130, and many Apple II computers had only 4 K words of RAM installed. Consequently a lot of time was spent trying to find ways to reduce the size of programs so they would fit in the memory available. At the same time, computers were relatively slow, so simple interpretation was very noticeably slower than executing machine code.

Instead of writing out every step of an operation in every part of the program where it was needed, programmers saved memory by writing each step of such operations once (see "Don't repeat yourself") and placing it in a subroutine.

This process — code refactoring — is used today, although for different reasons. The top-level application in these programs may consist of nothing but subroutine calls. Many of these subroutines, in turn, also consist of nothing but lower level subroutine calls.

Mainframes and some early microprocessors such as the RCA 1802 required several instructions to call a subroutine. In the top-level application and in many subroutines, that sequence is constantly repeated, only the subroutine address changing from one call to the next. Using memory to store the same instructions repeatedly is wasteful.

To save space, programmers squeezed that series of subroutine calls into a list containing only contiguous addresses of the sub-routines, and used a tiny "interpreter" to call each subroutine in turn.

This is identical to the way other programmers squeezed a series of jumps in a branch table, dispatch table, or virtual method table into a list containing only the destination addresses, and used a small selector to branch to the selected destination. In threaded code and these other techniques, the program becomes a list of entry points to the actual code to be executed.

Over the years, programmers have created many variations on that "interpreter" or "small selector". The particular address in the list of addresses may be extracted using an index, general purpose register or pointer. The addresses may be direct or indirect, contiguous or non-contiguous (linked by pointers), relative or absolute, resolved at compile time or dynamically built. No one variation is "best".

Development

To save space, programmers squeezed the lists of subroutine calls into simple lists of subroutine addresses, and used a small loop to call each subroutine in turn. For example:

start:
  ip = &thread
top:
  jump *ip++
thread:
  &pushA
  &pushB
  &add
  ...
pushA:
  *sp++ = A
  jump top
pushB:
  *sp++ = B
  jump top
add:
  *sp++ = *--sp + *--sp
  jump top

In this case, decoding the bytecodes is performed once, during program compilation or program load, so it is not repeated each time an instruction is executed. This can save much time and space when decode and dispatch overhead is large compared to the execution cost.

Note, however, addresses in thread for &pushA, &pushB, etc., are two or more bytes, compared to one byte, typically, for the decode and dispatch interpreter described above. In general, instructions for a decode and dispatch interpreter may be any size. For example, a decode and dispatch interpreter to simulate an Intel Pentium decodes instructions that range from 1 to 16 bytes. However, bytecoded systems typically choose 1-byte codes for the most-common operations. Thus, the thread often has a higher space cost than bytecodes. In most uses, the reduction in decode cost outweighs the increase in space cost.

Note also that while bytecodes are nominally machine-independent, the format and value of the pointers in threads generally depend on the target machine which is executing the interpreter. Thus, an interpreter might load a portable bytecode program, decode the bytecodes to generate platform-dependent threaded code, then execute threaded code without further reference to the bytecodes.

The loop is simple, so is duplicated in each handler, removing jump top from the list of machine instructions needed to execute each interpreter instruction. For example:

start:
  ip = &thread
  jump *ip++
thread:
  &pushA
  &pushB
  &add
  ...
pushA:
  *sp++ = A
  jump *ip++
pushB:
  *sp++ = B
  jump *ip++
add:
  *sp++ = *--sp + *--sp
  jump *ip++

This is called direct threaded code (DTC). Although the technique is older, the first widely circulated use of the term "threaded code" is probably Bell's article "Threaded Code" from 1973.[2]

Charles H. Moore invented a more compact notation in 1970 for his Forth virtual machine: indirect threaded code (ITC). Originally, Moore invented this because it was easy and fast on Nova minicomputers, which have an indirection bit in every address. He said (in published remarks, Byte Magazine's Forth Issue) that he found it so convenient that he propagated it into all later Forth designs.

Some Forth compilers compile Forth programs into direct-threaded code, while others make indirect-threaded code. The programs act the same either way.

Threading models

Practically all executable threaded code uses one or another of these methods for invoking subroutines (each method is called a "threading model").

Direct threading

Addresses in the thread are the addresses of machine language. This form is simple, but may have overheads because the thread consists only of machine addresses, so all further parameters must be loaded indirectly from memory. Some Forth systems produce direct-threaded code. On many machines direct-threading is faster than subroutine threading (see reference below).

As example, a stack machine might execute the sequence "push A, push B, add". That might be translated to the following thread and routines, where ip is initialized to the address &thread.

thread:      pushA: *sp++ = A          pushB: *sp++ = B         add:  *sp++ = *--sp + *--sp
  &pushA            jump *ip++                jump *ip++              jump *ip++
  &pushB
  &add
  ...

Alternatively, operands may be included in the thread. This can remove some indirection needed above, but makes the thread larger:

thread:      push: *sp++ = *ip++      add: *sp++ = *--sp + *--sp
  &push            jump *ip++              jump *ip++
  &A
  &push
  &B
  &add

Indirect threading

Indirect threading uses pointers to locations that in turn point to machine code. The indirect pointer may be followed by operands which are stored in the indirect "block" rather than storing them repeatedly in the thread. Thus, indirect code is often more compact than direct-threaded code, but the indirection also typically makes it slower, though still usually faster than bytecode interpreters. Where the handler operands include both values and types, the space savings over direct-threaded code may be significant. Older FORTH systems typically produce indirect-threaded code.

As example, if the goal is to execute "push A, push B, add", the following might be used. Here, ip is initialized to address &thread, each code fragment (push, add) is found by double-indirecting through ip; and operands to each code fragment are found in the first-level indirection following the address of the fragment.

thread:      i_pushA:   push:                   add:
  &i_pushA     &push      *sp++ = *(*ip + 1)      *sp++ = *--sp + *--sp
  &i_pushB     &A         jump *(*ip++)           jump *(*ip++)
  &i_add     i_pushB:
               &push
               &B
             i_add:
               &add

Subroutine threading

So-called "subroutine-threaded code" (also "call-threaded code") consists of a series of machine-language "call" instructions (or addresses of functions to "call", as opposed to direct threading's use of "jump"). Early compilers for ALGOL, Fortran, Cobol and some Forth systems often produced subroutine-threaded code. The code in many of these systems operated on a last-in-first-out (LIFO) stack of operands, which had well-developed compiler theory. Most modern processors have special hardware support for subroutine "call" and "return" instructions, so the overhead of one extra machine instruction per dispatch is somewhat diminished. Anton Ertl has stated "that, in contrast to popular myths, subroutine threading is usually slower than direct threading."[3] However, Ertl's most recent tests[4] show that subroutine threading is faster than direct threading in 15 out of 25 test cases. Ertl's most recent tests show that direct threading is the fastest threading model on Xeon, Opteron, and Athlon processors; indirect threading is the fastest threading model on Pentium M processors; and subroutine threading is the fastest threading model on Pentium 4, Pentium III, and PPC processors.

As an example of call threading "push A, push B, add":

thread:           pushA:            pushB:         add:
  call pushA        *sp++ = A         *sp++ = B      *sp++ = *--sp + *--sp
  call pushB        ret               ret            ret
  call add

Token threading

Token threaded code uses lists of 8 or 12-bit indexes to a table of pointers. Token threaded code is notably compact, without much special effort by a programmer. It is usually half to three-fourths the size of other threaded-codes, which are themselves a quarter to an eighth the size of compiled code. The table's pointers can either be indirect or direct. Some Forth compilers produce token threaded code. Some programmers consider the "p-code" generated by some Pascal compilers, as well as the bytecodes used by .NET, Java, BASIC and some C compilers to be token-threading.

A common approach historically is bytecode, which uses 8-bit opcodes and, often, a stack-based virtual machine. A typical interpreter is known as a "decode and dispatch interpreter", and follows the form

bytecode:         top:                   pushA:         pushB:          add:
  0 /*pushA*/       i = decode(vpc++)      *sp++ = A      *sp++ = B       *sp++ = *--sp + *--sp
  1 /*pushB*/       addr = table[i]        jump top       jump top        jump top
  2 /*add*/         jump *addr

If the virtual machine uses only byte-size instructions, decode() is simply a fetch from bytecode, but often there are commonly used 1-byte instructions plus some less-common multibyte instructions, in which case decode() is more complex. The decoding of single byte opcodes can be very simply and efficiently handled by a branch table using the opcode directly as an index.

For instructions where the individual operations are simple, such as "push" and "add", the overhead involved in deciding what to execute is larger than the cost of actually executing it, so such interpreters are often much slower than machine code. However for more complex ("compound") instructions, the overhead percentage is proportionally less significant.

Huffman threading

Huffman threaded code consists of lists of Huffman codes. A Huffman code is a variable length bit string used to identify a unique item. A Huffman-threaded interpreter locates subroutines using an index table or tree of pointers that can be navigated by the Huffman code. Huffman threaded code is one of the most compact representations known for a computer program. Basically the index and codes are organized by measuring the frequency that each subroutine occurs in the code. Frequent calls are given the shortest codes. Operations with approximately equal frequencies are given codes with nearly equal bit-lengths. Most Huffman-threaded systems have been implemented as direct-threaded Forth systems, and used to pack large amounts of slow-running code into small, cheap microcontrollers. Most published uses have been in toys, calculators, and watches.

Lesser used threading

  • String threading, where operations are identified by strings, usually looked-up by a hash table. This was used in Charles H. Moore's earliest Forth implementations and in the University of Illinois's experimental hardware-interpreted computer language. It is also used in Bashforth.

Branches

Examples above show no branches. For all interpreters, a branch changes the thread pointer (ip above). As example, a conditional branch when the top-of-stack value is zero might be encoded as follows. Note that &thread[123] is the location to jump to, not the address of a handler, and so must be skipped (ip++) whether or not the branch is taken.

thread:              brz:
  ...                  tmp = *ip++
  &brz                 if (*sp++ == 0)
  &thread[123]           ip = tmp
  ...                  jump *ip++

Common amenities

Separating the data and return stacks in a machine eliminates a great deal of stack management code, substantially reducing the size of the threaded code. The dual-stack principle was originated three times independently: for Burroughs large systems, Forth and PostScript, and is used in some Java virtual machines.

Three registers are often present in a threaded virtual machine. Another one exists for passing data between subroutines ('words'). These are:

  • ip or i (instruction pointer) of the virtual machine (not to be confused with the program counter of the underlying hardware implementing the VM)
  • w (work pointer)
  • rp or r (return stack pointer)
  • sp or s (parameter stack pointer for passing parameters between words)

Often, threaded virtual machines such as implementations of Forth have a simple virtual machine at heart, consisting of three primitives. Those are:

  • nest, also called docol
  • unnest, or semi_s (;s)
  • next

In an indirect-threaded virtual machine, the one given here, the operations are:

next:   (ip)+ ->   w    ;  jmp (w)+
nest:    ip   -> -(rp)  ;  w -> ip  ;  next
unnest: (rp)+ ->   ip   ;  next

This is perhaps the simplest and fastest interpreter or virtual machine.

See also

References

Further reading


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Code bloat — is the production of code that is perceived as unnecessarily long, slow, or otherwise wasteful of resources. Code bloat can be caused by inadequacies in the language in which the code is written, inadequacies in the compiler used to compile the… …   Wikipedia

  • Threaded pipe — A threaded pipe is a pipe with screw threaded ends for assembly. Tapered threads The threaded pipes used in some plumbing installations for the delivery of gases or fluids under pressure have a threaded section that is slightly conical (in… …   Wikipedia

  • p-code machine — In computer programming, a p code machine, or portable code machine[citation needed] is a virtual machine designed to execute p code (the assembly language of a hypothetical CPU). This term is applied both generically to all such machines (such… …   Wikipedia

  • P-code machine — In computer programming, a p code machine or pseudo code machine is a specification of a CPU whose instructions are expected to be executed in software rather than in hardware (ie, interpreted). This term is applied both generically to all such… …   Wikipedia

  • National Electrical Code — The National Electrical Code, 2008 edition The National Electrical Code (NEC), or NFPA 70, is a regionally adoptable standard for the safe installation of electrical wiring and equipment. The NEC, while having no legally binding regulation as… …   Wikipedia

  • Thread (computer science) — This article is about the concurrency concept. For the multithreading in hardware, see Multithreading (computer architecture). For the form of code consisting entirely of subroutine calls, see Threaded code. For other uses, see Thread… …   Wikipedia

  • Multi-core processor — Diagram of a generic dual core processor, with CPU local level 1 caches, and a shared, on die level 2 cache …   Wikipedia

  • Forth — Saltar a navegación, búsqueda Para otros usos de este término, véase Forth (desambiguación). Forth o FORTH es un lenguaje de programación para computadores y un ambiente de programación ideado por Charles H. Moore y Elisabeth Rather entre los… …   Wikipedia Español

  • Calling convention — In computer science, a calling convention is a standardized method for a program to pass parameters to a function and receive a result value back from it. Calling conventions can differ in: * where they place parameters and return values (in… …   Wikipedia

  • Subroutine — In computer science, a subroutine (function, method, procedure, or subprogram) is a portion of code within a larger program, which performs a specific task and can be relatively independent of the remaining code. The syntax of many programming… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”