Duff's device

Duff's device: In computer science, Duff's device is an optimized implementation of a serial copy that uses a technique widely applied in assembly language for loop unwinding. Its discovery is credited to Tom Duff in November of 1983, who at the time was working for Lucasfilm. It is perhaps the most dramatic use of case label fall-through in the C programming language to date. Duff does not claim credit for discovering the concept of loop unrolling, just this particular expression of it in C.

Contents

1 Background

2 Original version

3 Reason it works

4 Performance

5 Stroustrup's version

6 Books

7 References

8 External links

Background

Loop-unrolling revolves around lowering the number of branches made, by batching them together. To handle cases where the number of iterations is not divisible by the unrolled-loop increments, a common technique is to jump directly into the middle of the unrolled loop for copying the remainder.

Duff was looking for a similar optimization for his case, and succeeded in doing so in C, unrolling a loop into a loop which assigns (up to) 8 values on each iteration.

Original version

Traditionally, a serial copy would look like:

do { /* count > 0 assumed */ *to = *from++; /* Note that the 'to' pointer is NOT incremented */ } while(--count > 0);

Note that to is not incremented because Duff was copying to a single memory-mapped output register.

While optimizing this, Duff realized that an unrolled version of his loop could be implemented by interlacing the structures of a switch and a loop.

send(to, from, count) register short *to, *from; register count; { register n = (count + 7) / 8; switch(count % 8) { case 0: do { *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; } while(--n > 0); } }

Notice that device can just as easily be applied with any other size for the unrolled loop, not just 8.

Reason it works

Based on an algorithm used widely by programmers coding in assembly for minimizing the number of tests and branches during a copy, Duff's device appears out of place when implemented in C. The device is valid, legal C by virtue of two attributes in C:

Relaxed specification of the switch statement in the language's definition. At the time of the device's invention this was the first edition of The C Programming Language which requires only that the controlled statement of the switch be a syntactically valid (compound) statement within which case labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a break statement, the flow of control will fall-through from a statement controlled by one case label to that controlled by the next, this means that the code specifies a succession of count copies from sequential source addresses to the memory-mapped output port.

The ability to legally jump into the middle of a loop in C.

Note that, as documented in the comment appearing in Duff's un-optimized version, the code assumes that count is strictly positive.

Performance

Many compilers will optimize the switch into a jump table just as would be done in an assembler implementation. C's default fall-through in case statements has long been one of its most controversial features; Duff observed that "This code forms some sort of argument in that debate, but I'm not sure whether it's for or against."^[1]

The primary increase in speed versus a simple, straightforward loop comes from loop unwinding, which reduces the number of branches performed (which are computationally expensive due to the need to flush - and hence stall - the pipeline). The switch/case statement is used to handle the remainder of the data not evenly divisible by the number of operations unrolled (in this example, 8 byte moves are unrolled, so the switch/case handles an extra 1–7 bytes automatically).

This automatic handling of the remainder may not be the best solution on all systems and compilers — in some cases two loops may actually be faster (one loop, unrolled, to do the main copy, and a second loop to handle the remainder). The problem appears to come down to the ability of the compiler to correctly optimize the device; it may also interfere with pipelining and branch prediction on some architectures.^[2] When numerous instances of Duff's device were removed from the XFree86 Server in version 4.0, there was an improvement in performance.^[3] Therefore, when considering using this code, it may be worth running a few benchmarks to verify that it actually is the fastest code on the target architecture, at the target optimization level, with the target compiler.

Stroustrup's version

The original Device was made for copying to a (memory-mapped) register. To actually copy memory from one location to another, an auto-increment must be added to every reference to to, like so:

*to++ = *from++;

This modified form of the Device appears as a "what does this code do?" exercise in Bjarne Stroustrup's book The C++ Programming Language, presumably because novice programmers cannot be expected to know about memory-mapped output registers. However, the standard C library provides the function memcpy for this purpose; it will not perform worse than this code, and may contain architecture specific optimizations that will make it significantly faster.

Books

Stroustrup, Bjarne, The C++ Programming Language, Third Edition. Addison-Wesley, ISBN 0-201-88954-4

Kernighan, Brian and Dennis Ritchie, The C Programming Language.

References

This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.

^ Duff's device from FOLDOC

^ James Ralston's USENIX 2003 Journal

^ Ted Tso on XFree86 and performance, Linux Kernel Archive ML

External links

Description and original mail by Duff at Lysator

Wikipedia's example annotated at Stack Overflow

Explanation from c-faq.com

Article at Dr.Dobb's Journal

Article at FOLDOC

Article at the Jargon File

Article at CodeMaestro

Google copy of original USENET post

Simon Tatham's coroutines in C utilizes the same switch/case trick

Adam Dunkels' Protothreads - Lightweight, Stackless Threads in C also uses nested switch/case statements (see also The lightest lightweight threads, Protothreads)

Categories:
C programming language

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

Duff’s Device — Duff s Device (auf deutsch etwa: Duff Apparat) ist ein nach seinem Erfinder Tom Duff benanntes Programmierverfahren zur Effizienzsteigerung bei Schleifen unter Ausnutzung einer speziellen Eigenschaft der Programmiersprache C. Inhaltsverzeichnis 1 … Deutsch Wikipedia
Duff's Device — ist ein nach seinem Erfinder Tom Duff benanntes Programmierverfahren zur Effizienzsteigerung bei Schleifen. Problem Soll der Computer eine Anweisung wiederholt ausführen, wird sie innerhalb einer Schleife ausgeführt. Dabei wird am Ausgangspunkt… … Deutsch Wikipedia
Duff's — may refer to: Duff s Brooklyn, Williamsburg, Brooklyn, NY, USA Duff s device, computer science implementation by Tom Duff See also Duff (disambiguation) This disambiguation page lists articles associated with the same title. If an … Wikipedia
Tom Duff — Thomas Douglas Selkirk Duff (b. December 8, 1952, named for his putative ancestor, the fifth Earl of Selkirk) is a computer programmer. He was born in Toronto, Ontario, Canada and grew up in Toronto and Leaside. In 1974 he graduated from the… … Wikipedia
Tom Duff — Thomas Douglas Selkirk Duff (* 8. Dezember 1952 in Toronto, Ontario) ist ein kanadischer Programmierer. Er wuchs in Toronto und Leaside (Ontario) auf. 1974 machte er seinen Abschluss in Mathematik an der University of Waterloo und zwei Jahre… … Deutsch Wikipedia
Duffapparat — Duff s Device ist ein nach seinem Erfinder Tom Duff benanntes Programmierverfahren zur Effizienzsteigerung bei Schleifen. Problem Soll der Computer eine Anweisung wiederholt ausführen, wird sie innerhalb einer Schleife ausgeführt. Dabei wird am… … Deutsch Wikipedia
ICFPC — Der ICFP Contest ist ein Programmierwettbewerb, der jährlich im Umfeld der ICFP Konferenz ausgerichtet wird. Der erste ICFP Contest fand 1998 statt. Inhaltsverzeichnis 1 Austragungsmodus 2 Teilnehmer 3 Austragungsort 4 Preise … Deutsch Wikipedia
ICFP Contest — Der ICFP Contest ist ein Programmierwettbewerb, der jährlich im Umfeld der ICFP Konferenz ausgerichtet wird. Der erste ICFP Contest fand 1998 statt. Inhaltsverzeichnis 1 Austragungsmodus 2 Teilnehmer 3 Austragungsort 4 Preise … Deutsch Wikipedia
Loop unwinding — Loop unwinding, also known as loop unrolling, is a loop transformation technique that attempts optimize a program s execution speed at the expense of its size.The goal of loop unwinding is to increase the programs speed by reducing (or… … Wikipedia
Protothreads — In der Informatik ist ein Protothread ein leichtgewichtiger Mechanismus zur parallelen Programmierung. Protothreads kommen im Gegensatz zu Threads ohne eigenen Stapelspeicher aus. Sie können blockierende Kontexte mit geringstmöglichem… … Deutsch Wikipedia

Academic Dictionaries and Encyclopedias

Duff's device

Contents

Background

Original version

Reason it works

Performance

Stroustrup's version

Books

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Duff's device

Contents

Background

Original version

Reason it works

Performance

Stroustrup's version

Books

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link