Author: Richard A. O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz>
    Status: Draft
    Type: Standards Track
    Erlang-Version: OTP_R12B-4
    Created: 27-Aug-2008
    Post-History:
****
EEP 22: Range checking for binaries
----

Abstract
========

A module may request that bit fields be range checked.

Specification
=============

A new directive is added.

    -bit_range_check(Wanted).

where Wanted is 'false' or 'true'.

Recall that a segment of a bit string (or binary) has the form

    Value [':' Size] ['/' Type_Specifier_List]

where `Type_Specifier_List` includes such things as 'integer',
'signed', and 'unsigned'.  Currently the documentation states
that

    "Signedness ... Only matters for matching and when the type
     is integer.  The default is unsigned."

Combining the `Size` with the `Unit` gives a `Size_In_Bits`.
The on-line Erlang manual does not state in section 6.16 that
in constructing a bit string the bottom `Size_In_Bits` bits of
an integer are used with the rest quietly ignored, but it is so.

The directive `-bit_range_check(false)` makes explicit the
programmer's intention that this C-like truncation should happen.

The directive `-bit_range_check(true)` says that it is a checked
run-time error in

    Value:Size/unsigned-integer-unit:1

or constructions otherwise equivalent to it if Value does not
lie in the range `0 <= Value < 2**Size`, and it is a checked
run-time error in

    Value:Size/signed-integer-unit:1

or constructions otherwise equivalent to it if Value does not
lie in the range `-(2**(Size-1)) <= Value < 2**(Size-1)`.

The error that is raised is like the error that would be raised
for `(1//0):Size/Type_Specifier_List` except for using 'badrange'
instead of 'badarith'.

The behaviour of integer bit syntax segments in the absence of
a `-bit_range_check` directive is implementation defined and
subject to change.

The BEAM system is extended with a new instruction or instructions
similar to the existing instruction or instructions for integer
segments but checking the range.  The compiler is extended to
generate them for `<<...>>` expressions in the range of a
`-bit_range_check(true)` directive.

A `-bit_range_check` directive may not appear after a bit syntax
pattern or expression or after another `-bit_range_check` directive.

Motivation
==========

It keeps on coming as an unpleasant surprise to Erlang programmers
that this truncation happens.  Quiet destruction of information is
otherwise alien to Erlang:  integer arithmetic is unbounded, not
wrapped as in some (but not all) C systems; element/2 doesn't take
indices modulo tuple size but raises an exception if the index is
out of range, and so on.

In any case where the truncation is wanted, an Erlang programmer
can already write

    (Value rem 256):unsigned-integer

and the Erlang compiler could notice this and optimise the 'rem'
operation away, so the truncation is not only unusual in Erlang,
it is also unexpected in this particular case.

It is not only unexpected, it removes a chance to find mistakes,
so it would seem to be undesirable.

Edwin Fine asked "How difficult could it be to add optional run-
time checking to detect this condition without a serious risk of
adverse effects on the correctness of Erlang run-time execution?"

Björn Gustavsson replied "it would be better to add optional
support in the compiler to turn on checks (either for an entire
module, or for individual segments of a binary).  If someone
writes an EEP, we will consider implementing it."

This is that EEP.

Rationale
=========

The Erlang/OTP team regard the old behaviour as a feature,
and wish to retain it.  In particular, they wish modules that
were written expecting the old behaviour to continue to work
(for now) without modification.

One alternative would be to add new syntax, such as having a
new 'checked' specifier, so that

    Value/checked-unsigned-integer

would require a value in the range 0..255.
But many Erlang programmers will want to use this as the normal
case, and will not like the safe version being so much more effort
to write than the unsafe version.

It appears that "truncation wanted/not wanted" is not a matter
of this expression or that, but of this programmer or that,
and we can expect that each module will be written by someone
expecting only one behaviour or expecting only the other.

Adding a

    -bit_range_check(true).

directive to a module is more work than doing nothing at all,
but programmers who want this behaviour should be able to set up
their editing environment to have this line in their template for
creating new Erlang modules.

There are several questions:

- Should this apply to bit strings as well as integers?
- What should the name of the directive be?
- What should the argument(s) of the directive be?
- Should multiple instances of the directive be allowed in
  a module?

Bit strings:  `Assume X = <<5:3/unsigned-integer-unit:1>>`.
Currently, `<<X:2/bits>>` quietly truncates `X`.  This drops bits
from the right of `X`, giving `<<2:2>>`.  If this worked the same
as integers, you would expect `<<1:2>>`.  This is certainly
very odd.  Since we get truncation on the left and padding on
the left for integers, we naturally expect padding on the
right for bit strings to go with truncation on the right.
But `<<X:4/bits>>` isn't `<<10:4>>`, it's a runtime exception.
All very odd indeed.  It would certainly be desirable to have
an easy way for the programmer to indicate whether they wanted
truncation on the left or the right and padding on the left or
the right.  Perhaps a new built in function

    set_bit_size(Bit_String, Desired_Width,
                 Truncation, Padding, Fill)

    Bit_String : a bit string
    Desired_Width : a non-negative integer, the width wanted
    Truncation: 'left' | 'right' | 'error';
        if bit_size(Bit_String) > Desired_Width
            truncate on the left/truncate on the right/
            report an error
    Padding: 'left' | 'right' | 'error';
        if bit_size(Bit_String) < Desired_Width
            pad on the left/pad on the right/report an error
    Fill: 0 | 1 | 'copy';
        pad with 0/pad with 1/pad with a copy of the
        last bit at the end where padding is done.

However, that idea is only partly baked, and is not part of the
current proposal.  As things currently stand, using the bit
syntax and relying on implicit truncation is the simplest way
to extract the leading bits of a bit string.

As long as the name of the directive is intention-revealing,
it doesn't matter very much what it is.
I proposed `bit_range_check` because it is all about checking,
ranges in bit syntax, but since in this draft it does NOT apply
to bit string segments, perhaps `bit_integer_range_check` would
be better.

The arguments false and true seem clear enough.
Alternatives would be something like

    -bit_integer_range(check).
    -bit_integer_range(no_check).

That would be fine too.

Classical Pascal compilers let you do things like

    {$I-}   (* disable index checks *)
    (* code with no index checks *)
    {$I+}   (* re-enable index checks *)

Allowing multiple `-bit_range_check` directives in a module could
let you use code written for the old approach inside a module
that otherwise uses the new approach.  I don't believe that we
want to encourage that sort of thing:  it is MUCH easier when
reading a module if all of it follows the same rule.

It is also easier for an Erlang compiler that expects to be able
to process function definitions in any order.  The compiler can
check for one of these directives anywhere in a module before it
handles any bit syntax forms anywhere.  However, it is easier for
people reading a module if, when they first see a `<<...>>`
construction, they have already seen any directive that might
affect what it means.

The restrictions on the number and placement of these directives
can always be relaxed later if necessary.

Backwards Compatibility
=======================

All existing Erlang code remains acceptable with unchanged semantics.

Reference Implementation
========================

None, because I still can't find my way around the compiler.

Copyright
=========

This document has been placed in the public domain.

[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"