10 Optimizations

1 Non processor specific

The following sections describe the general optimizations done by the compiler, they are not processor specific. Some of these require some compiler switch override while others are done automatically (those which require a switch will be noted as such).

1 Constant folding

In Free Pascal, if the operand(s) of an operator are constants, they will be evaluated at compile time.

Example

   x:=1+2+3+6+5;

will generate the same code as

   x:=17;

Furthermore, if an array index is a constant, the offset will be evaluated at compile time. This means that accessing MyData[5] is as efficient as accessing a normal variable.

Finally, calling Chr, Hi, Lo, Ord, Pred, or Succ functions with constant parameters generates no run-time library calls, instead, the values are evaluated at compile time.

2 Constant merging

Using the same constant string, floating point value or constant set two or more times generates only one copy of that constant.

3 Short cut evaluation

Evaluation of boolean expression stops as soon as the result is known, which makes code execute faster then if all boolean operands were evaluated.

4 Constant set inlining

Using the in operator is always more efficient then using the equivalent <>, =, <=, >=, < and > operators. This is because range comparisons can be done more easily with in then with normal comparison operators.

5 Small sets

Sets which contain less then 33 elements can be directly encoded using a 32-bit value, therefore no run-time library calls to evaluate operands on these sets are required; they are directly encoded by the code generator.

6 Range checking

Assignments of constants to variables are range checked at compile time, which removes the need of the generation of runtime range checking code.

7 And instead of modulo

When the second operand of a mod on an unsigned value is a constant power of 2, an and instruction is used instead of an integer division. This generates more efficient code.

8 Shifts instead of multiply or divide

When one of the operands in a multiplication is a power of two, they are encoded using arithmetic shift instructions, which generates more efficient code.

Similarly, if the divisor in a div operation is a power of two, it is encoded using arithmetic shift instructions.

The same is true when accessing array indexes which are powers of two, the address is calculated using arithmetic shifts instead of the multiply instruction.

9 Automatic alignment

By default all variables larger then a byte are guaranteed to be aligned at least on a word boundary.

Alignment on the stack and in the data section is processor dependant.

10 Smart linking

This feature removes all unreferenced code in the final executable file, making the executable file much smaller.

Smart linking is switched on with the -Cx command-line switch, or using the {$SMARTLINK ON} global directive.

11 Inline routines

The following runtime library routines are coded directly into the final executable: Lo, Hi, High, Sizeof, TypeOf, Length, Pred, Succ, Inc, Dec and Assigned.

12 Case optimization

When using the -O1 (or higher) switch, case statements will be generated using a jump table if appropriate, to make them execute faster.

13 Stack frame omission

Under specific conditions, the stack frame (entry and exit code for the routine, see section section CallingConventions) will be omitted, and the variable will directly be accessed via the stack pointer.

Conditions for omission of the stack frame:

The function has no parameters nor local variables.
Routine does not call other routines.
Routine does not contain assembler statements. However, a assembler routine may omit it's stack frame.
Routine is not declared using the Interrupt directive.
Routine is not a constructor or destructor.

14 Register variables

When using the -Or switch, local variables or parameters which are used very often will be moved to registers for faster access.

2 Processor specific

This lists the low-level optimizations performed, on a processor per processor basis.

1 Intel 80x86 specific

Here follows a listing of the optimizing techniques used in the compiler:

When optimizing for a specific Processor (-Op1, -Op2, -Op3, the following is done:
- In case statements, a check is done whether a jump table or a sequence of conditional jumps should be used for optimal performance.
- Determines a number of strategies when doing peephole optimization, e.g.: movzbl (%ebp), %eax will be changed into xorl %eax,%eax; movb (%ebp),%al for Pentium and PentiumMMX.
When optimizing for speed (-OG, the default) or size (-Og), a choice is made between using shorter instructions (for size) such as enter $4, or longer instructions subl $4,%esp for speed. When smaller size is requested, things aren't aligned on 4-byte boundaries. When speed is requested, things are aligned on 4-byte boundaries as much as possible.
Fast optimizations (-O1): activate the peephole optimizer
Slower optimizations (-O2): also activate the common subexpression elimination (formerly called the "reloading optimizer")
Uncertain optimizations (-Ou): With this switch, the common subexpression elimination algorithm can be forced into making uncertain optimizations.
Although you can enable uncertain optimizations in most cases, for people who do not understand the following technical explanation, it might be the safest to leave them off.

Remark: If uncertain optimizations are enabled, the CSE algortihm assumes that
- If something is written to a local/global register or a procedure/function parameter, this value doesn't overwrite the value to which a pointer points.
- If something is written to memory pointed to by a pointer variable, this value doesn't overwrite the value of a local/global variable or a procedure/function parameter.
The practical upshot of this is that you cannot use the uncertain optimizations if you both write and read local or global variables directly and through pointers (this includes Var parameters, as those are pointers too).
The following example will produce bad code when you switch on uncertain optimizations:
```
Var temp: Longint;

Procedure Foo(Var Bar: Longint);
Begin
  If (Bar = temp)
    Then
      Begin
        Inc(Bar);
        If (Bar <> temp) then Writeln('bug!')
      End
End;

Begin
  Foo(Temp);
End.
```
The reason it produces bad code is because you access the global variable Temp both through its name Temp and through a pointer, in this case using the Bar variable parameter, which is nothing but a pointer to Temp in the above code.
On the other hand, you can use the uncertain optimizations if you access global/local variables or parameters through pointers, and only access them through this pointer.
For example:
```
Type TMyRec = Record
                a, b: Longint;
              End;
     PMyRec = ^TMyRec;


     TMyRecArray = Array [1..100000] of TMyRec;
     PMyRecArray = ^TMyRecArray;

Var MyRecArrayPtr: PMyRecArray;
    MyRecPtr: PMyRec;
    Counter: Longint;

Begin
  New(MyRecArrayPtr);
  For Counter := 1 to 100000 Do
    Begin
       MyRecPtr := @MyRecArrayPtr^[Counter];
       MyRecPtr^.a := Counter;
       MyRecPtr^.b := Counter div 2;
    End;
End.
```
Will produce correct code, because the global variable MyRecArrayPtr is not accessed directly, but only through a pointer (MyRecPtr in this case).
In conclusion, one could say that you can use uncertain optimizations only when you know what you're doing.

2 Motorola 680x0 specific

Using the -O2 switch does several optimizations in the code produced, the most notable being:

Sign extension from byte to long will use EXTB
Returning of functions will use RTD
Range checking will generate no run-time calls
Multiplication will use the long MULS instruction, no runtime library call will be generated
Division will use the long DIVS instruction, no runtime library call will be generated

3 Optimization switches

This is where the various optimizing switches and their actions are described, grouped per switch.

-On:

with n = 1..3: these switches activate the optimizer. A higher level automatically includes all lower levels.

Level 1 (-O1) activates the peephole optimizer (common instruction sequences are replaced by faster equivalents).
Level 2 (-O2) enables the assembler data flow analyzer, which allows the common subexpression elimination procedure to remove unnecessary reloads of registers with values they already contain.
Level 3 (-O3) enables uncertain optimizations. For more info, see -Ou.

-OG:

This causes the code generator (and optimizer, IF activated), to favor faster, but code-wise larger, instruction sequences (such as "subl $4,%esp") instead of slower, smaller instructions ("enter $4"). This is the default setting.

-Og:

This one is exactly the reverse of -OG, and as such these switches are mutually exclusive: enabling one will disable the other.

-Or:

This setting causes the code generator to check which variables are used most, so it can keep those in a register.

-Opn:

with n = 1..3: Setting the target processor does NOT activate the optimizer. It merely influences the code generator and, if activated, the optimizer:

During the code generation process, this setting is used to decide whether a jump table or a sequence of successive jumps provides the best performance in a case statement.
The peephole optimizer takes a number of decisions based on this setting, for example it translates certain complex instructions, such as
```
movzbl (mem), %eax|
```
to a combination of simpler instructions
```
xorl %eax, %eax
movb (mem), %al
```
for the Pentium.

-Ou:

This enables uncertain optimizations. You cannot use these always, however. The previous section explains when they can be used, and when they cannot be used.

4 Tips to get faster code

Here, some general tips for getting better code are presented. They mainly concern coding style.

Find a better algorithm. No matter how much you and the compiler tweak the code, a quicksort will (almost) always outperform a bubble sort, for example.
Use variables of the native size of the processor you're writing for. This is currently 32-bit or 64-bit for Free Pascal, so you are best to use longint and cardinal variables.
Turn on the optimizer.
Write your if/then/else statements so that the code in the "then"-part gets executed most of the time (improves the rate of successful jump prediction).
Profile your code (see the -pg switch) to find out where the bottlenecks are. If you want, you can rewrite those parts in assembler. You can take the code generated by the compiler as a starting point. When given the -a command-line switch, the compiler will not erase the assembler file at the end of the assembly process, so you can study the assembler file.

5 Tips to get smaller code

Here are some tips given to get the smallest code possible.

Find a better algorithm.
Use the -Og compiler switch.
Regroup global static variables in the same module which have the same size together to minimize the number of alignment directives (which increases the .bss and .data sections unecessarily). Internally this is due to the fact that all static data is written to in the assembler file, in the order they are declared in the pascal source code.
Do not use the cdecl modifier, as this generates about 1 additional instruction after each subroutine call.
Use the smartlinking options for all your units (including the system unit).
Do not use ansistrings and exception support, as these require a lot of code overhead.
Turn off range checking and stack-checking.

Free Pascal Compiler
2001-09-22