Author: Raimo Niskanen <raimo(at)erlang(dot)org>,
            Kiko Fernandez-Reyes <kiko(at)erlang(dot)org>
    Status: Final/27-w  Implemented in OTP release 27
            with a warning in OTP version 26.1
    Type: Standards Track
    Created: 07-Jun-2023
    Erlang-Version: OTP-27.0
    Post-History:
        https://erlangforums.com/t/feature-heredocs-triple-quoted-text/2638/26
        https://github.com/erlang/otp/pull/7451
        https://erlangforums.com/t/triple-quoted-strings-adjacent-strings-without-white-space/3017
****
EEP 64: Triple-Quoted Strings
----

Abstract
========

This EEP proposes the introduction of *Triple-Quoted Strings*,
and defines their semantics.  The main benefit is to allow
multi-line strings in an easy and useful way,
i.e. with indentation, similar to other languages, e.g. [Elixir][].

Their first use case is for in-module documentation attributes
containing [Markdown][] or similarly formatted text where verbatim text
is desirable since any documentation text format has its own notion
of escape sequences which will collide with Erlang's escape sequences.

Rationale
=========

Today (June 2023), writing multi-line strings is awkward and arguably ugly.
They may contain escape sequences and have no concept of indentation:

    foo () ->
        case bar() of
             ok ->
                 X = "First line
    Second line with \"\\*not emphasized\\* Markdown\"
    Third line",
                 {ok, X}
        end.

The content's indentation cannot adhere to the surrounding code's
and the `*` has to be doubly escaped to get a  `\*` character
sequence into the actual content.

In a documentation attribute as suggested in [EEP 59][],
the indentation problem is not that pronounced because
the documentation attribute itself is not much indented:

    -doc "
    First line
    Second line with \"\\*not emphasized\\* Markdown\"
    Third line".

But it sure looks better with indentation and not having
to quote the backslashes:

    -doc """
        First line
        Second line with "\*not emphasized\* Markdown"
        Third line
        """.

The main reason to consider this EEP is for documentation
attributes, where not having to worry about escape sequences
is this EEP's most attractive property.  Introducing a new string
format, however, will also require defining how it shall behave
in Erlang code.

Having a string format that is only allowed in attributes would
simply be very strange and the one suggested in this EEP
would also be useful in Erlang code.

Design Decisions
----------------

An attribute is an Erlang form in the source code
that consists of a `-` token, an atom, one value term and a full stop
(dot).  The value term may be enclosed in parentheses
(which is not very interesting for documentation attributes).

    -doc "  Badly formatted
    documentation paragraph
    /-\\
    \\-/".

A documentation attribute should have a string as its content term,
and here we want to use our new and more convenient triple-quoted string
instead of a normal string:

    -doc """
          Better formatted
        documentation paragraph
        /-\
        \-/
        """.

### Verbatim Strings

We want the strings to be verbatim since then they are sure
to not clash with any documentation text format.  In [Markdown][]
(and [AsciiDoc][]) it is the use of the backslash character (`\`)
that collides since it has a meaning in normal Erlang strings,
so every backslash would need to be escaped into double backslash
if the string accepts escape sequences.

This is in many cases not a big problem, though.  [Elixir][] also uses
[Markdown][] as documentation format and mostly ignores this problem.
But in some modules, it *is* a problem, such as the regular expression
module, and there [Sigils][] for verbatim text are used to avoid
quoting all backslashes.

We could also do like Elixir and choose both, but then we have
to implement [Sigils][] or something like that before getting
triple-quoted strings that are useful enough.

So if we have to choose one format, it should be the Verbatim one,
since that is the more general format, also in code;
where you simply would want to define a string for some purpose,
impossible to foresee.

This does not close the door on [Sigils][].  We can state that
we have merely chosen that the default for triple-quoted strings
is verbatim, and for normal strings it is escaped.  Although
it is a bit annoying that we then do not have the same defaults
as [Elixir][], there are other more annoying subtle differences
between strings in the languages.  See [Comparison with Elixir][].

Because the strings are verbatim, and we want to use them
to define any string in code, we cannot have the limitation
that there always is a newline at the end.  Therefore the
last newline (the newline on the line that precedes
the string ending line) should be stripped.  It is easy
to add a newline if an ending newline is needed.

[Elixir][] does not strip the last newline and dodges this problem
by escaping newlines.  If you do not want the last newline you put
a backslash last on the last line.  This does not work in their
verbatim strings, though.  So you have to choose between either
verbatim with ending newline or having to escape backslashes.

A slightly annoying consequence of verbatim strings with a fixed
end marker is that there is no possibility to create a string containing
such an end marker.  Therefore this EEP, as a possible extension,
suggests allowing not only triple-quoted strings,
but 3-or-more-quoted strings.  Then a start and end marker
can be chosen that is not part of the string,
and any string can be created.  This is a not so pretty solution
for a very rare corner case.

### Triple-Quoted String Scanner Token

A triple-quoted string must be a token that the scanner recognizes
as a string, which makes it suitable for a documentation attribute
value term.  It starts and ends with three double quotes: `"""`.

Double quotes, `"`, are chosen because normal Erlang strings use them
and this is just a new variant.  Since double quotes are used
a triple-quoted string shall, as a normal string,
produce a list of characters (Unicode code-points).

It would be more convenient if a triple-quoted string produced
an UTF-8 binary, but that would be a surprising feature for double quotes,
and the documentation build process can work around this by converting
the character list into the needed binary chunk.

In source code a triple-quoted string is valid within a binary,
so producing a Unicode binary is reasonably straightforward:

    X = <<"""
        Line 1
        Line 2
        """/utf8>>

The 2 + 7 characters ("`<<`" + "`/utf8>>`") overhead is not exhausting
since we are targeting multi-line strings.

As a future expansion it has been proposed to use [Sigils][] (prefixes)
for specialized strings such as regular expressions,
interpolated variables ([PR-7343][]), Unicode binary strings, etc.
For example: `X = ~u"Tschüß"` for an UTF-8 encoded binary.

### Triple-Quoted String Start

After the starting `"""` only white-space is allowed
up to the end of the line.

As a possible future expansion we might allow text here
that shouldn't be part of the string content, but could be
a hint for e.g. syntax highlighting and indentation handling
in an editor / pretty printer.

    -doc """ md
        Markdown content
        * Bullet list
        """.

The scanner does not need to have any special treatment
of the characters on the line after the starting
`"""` except that it should not search for an ending `"""`.

A later step strips the characters up to and including the newline
from the string content.

If any of these characters is not white-space, a syntax error is reported.

### Triple-Quoted String End

All characters are collected as they are (verbatim) and becomes
the string content.

A triple-quoted string ends with newline followed by optional white-space
and then `"""`.  This completes the scanner token.

A later step uses the white-space on the ending line
as the definition of the string's indentation
and strips that particular white-space sequence from every line
in the string, and strips the newline preceding the ending line.

If any of the lines do not start with the defined indentation
either because the line is too short or if the prefix differs,
a syntax error is reported.  For convenience and to adhere
to editor conventions, however; an empty line may be
completely empty instead of indented, but if it starts with
white-space characters that is not a newline,
they must be the defined indentation.

Requiring that all lines (except empty lines) must have
exactly the same indentation characters is a simple solution
to not have to define how indentation white-space
(tab vs. space) normalization should be done,
and also seems like a reasonable requirement.

### `CR`, `LF` and White-Space

The characters `CR` - Unicode code-point value 13,
`LF` - code-point 10, and white-space - as defined
in the Erlang scanner today, are handled as usual by the scanner,
except that if the line preceding the ending line
ends in `CR LF` then also the `CR` is interpreted
as part of the newline and is stripped along with the `LF`.
This is a convenience for systems with `CR LF` newlines.

Other than that, `CR`, `LF` and white-space within the string
is passed through as-is.

This means that the examples in the following text
where a normal string is used as matching reference
assumes that the source code has `LF`-only newlines,
such as:

    """
    
    X
    """ = "\nX"

If the source code has `CR LF` newlines that example instead becomes:

    """
    
    X
    """ = "\r\nX"

This example works in both cases but may be harder to read:

    """
    
    X
    """ = "
    X"

### Leading and trailing newline

The rules above strips one leading and one trailing newline.
This is a simple convention that also gives control over
the string's content:

Example 1:

    """
    
      X
    
    """ = "\n  X\n"

Example 2:

    """
    X
    """ = "X"

Example 3:

    """
     
    """ = ""

Note that the following could be a syntax error; too short
multi-line string since trailing newline shall be stripped
both from the starting line and from the last content line,
so the content could be seen as less than empty,
but it is more convenient to regard that newline as doubly stripped,
and thereby allow this as also an empty string:

    """
    """ = ""

### Indentation

The rules above facilitates indentation of the content
to adhere to the surrounding code.  The ending line
determines the indentation.

Example 1:

    """
    This string
    is not indented
    """ =
        "This string\nis not indented"

Example 2:

    """
        This string
        is indented
        """ =
        "This string\nis indented"

Example 3:

    """
          This indented string
        has an indented first line
        """ =
        "  This indented string\nhas an indented first line"

    """

Example 4:

    foo() ->
        X =
            """
              This indented string
            has an indented first line

            and an empty line that is not indented
            """,
        %% That content line 3 is empty instead of indented
        %% is only visible if you "touch" the text
        %% with the cursor or the mouse
        X =
            "  This indented string\n"
            "has an indented first line\n"
            "\n"
            "and an empty line that is not indented".

Example 5:

    """
    This is a syntax error (incorrect indentation)
        """

Example 6:

    """ This is a syntax error
    (non-white-space on start line)
    """

Example 6:

    """
    This is an incomplete string so the scanner will search forward
    for the end, and the shell will block waiting for more lines,
    since these quote characters are not a valid string ending: """

### Backwards incompatibility

This is valid today:

    X = """
        X
        """

It is equivalent to:

    X = "" "
        X
        " ""

Which is equivalent to:

    X = "
        X
        "

Which is equivalent to:

    X = "\n    X\n"

But with the suggested triple-quoted strings the first
code snippet would instead be equivalent to:

    X = "X"

Also, this is valid today:

    X = """ xxx
      X
        """

But according to this EEP it would be two syntax errors:

1. The start line has got non-white-space after `"""`.
2. The first content line has incorrect indentation.

There are many other similar constructions that also
would be syntax errors.

* It is far from likely that anyone has deliberately
  used `"""` in source code to mean an empty string
  concatenated to another string.
* Most today allowed combinations with `"""` will cause
  syntax errors.  Only a few will have a subtly changed
  behaviour (string content).
* Users can simply grep for `"""` in their source code.
  Creating the same sequence e.g through macros would
  be harder to find; the worst problem would not
  be new syntax errors (hard to miss), but changed
  behaviour.  And the changed behaviour would be
  a slightly different string content.

Allowing 3-or-more-quoted strings as suggested
in the next chapter has the potential to cause
maybe larger backwards incompatibilities.
Here is an example:

    X = """"
        ++ foo() ++
        """"

That code is valid today and will prepend the empty string
before and append the empty string after the return value of `foo()`.
When introducing 3-or-more-quoted strings it will instead become:

    X = "++ foo() ++"

The code example is really strange, though.

Therefore, it should be very unlikely that anyone
encounters a real backwards incompatibility problem
from the suggestions in this EEP.

But, to help users find accidental uses of 3-or-more-quoted
strings, a compiler warning should be introduced
early in the current Erlang/OTP release, and this EEP
may be implemented in the next.

Such a warning may, unfortunately, be more complicated
to implement then this EEP, because the scanner cannot
emit warnings.  Nor can the parser.

A warning can be implemented by letting the scanner
emit a special dummy token that the parser strips and ignores.
The preprocessor can then peek at the scanned token stream
and emit the warning.  This way there is no change in
the parser output so the rest of the preprocessor
and compiler as well as e.g. parse transforms are unaffected.

### Quoting of `"""`

With the rules above there is no possibility to have `"""`
first on a line in a triple-quoted string.

This would be allowed:

    -doc """
        A triple-quoted string starts with: """
        and ends with: """
        """.

As long as `"""` isn't first on a line.  Unfortunately
that is a lie since the ending according to this EEP
*should* be first on a line...

It would be possible to work around in Erlang code:

    X = """
        A triple-quoted string starts with: """
        and ends with:
        
        """ "\"\"\"".

That is ugly.

We can either ignore this quirk since it is only
when placed first on a line that `"""` becomes a problem,
or we can use the [GitHub Flavored Markdown][Markdown] trick
to allow 3 or more start characters and matching end characters
so this would be valid:

    X = """"
        A triple-quoted string starts with: """
        and ends with:
        """
        """"

### Comparison with Elixir

[Elixir][] has got triple-quoted strings and names them *heredocs*.

They are delimited with `"""` as in this suggestion and have
got very nearly the same rules for start line, end line and indentation.
Here are the known differences:

* The last end of line is not stripped.
* They produce UTF-8 encoded binaries, just like `"` quoted
  strings also does in [Elixir][].
* They accept escape sequences.
* Newlines can be escaped.
* There is an escape sequence `"\a"`.
* They do not allow more than 3 `"` characters as string start,
  which is a suggestion in this EEP
* They accept [Sigils][], which allows the string to be verbatim
  but then the final newline cannot be avoided.

This EEP suggests triple-quoted strings that are as [Elixir][]'s
*heredocs* without interpolation and escaping:

    ~S"""
    Heredoc without interpolation and escaping
    """

The [Elixir][] string has a final newline that this EEP suggests
should always be removed, since without escape sequences
there is no possibility to remove it.

[EEP 59]: https://www.erlang.org/eeps/eep-0059
    "EEP 59: Module attributes for documentation"

[PR-7343]: https://github.com/erlang/otp/pull/7343
    "Feature: String Interpolation"

[Markdown]: https://github.github.com/gfm/
    "GitHub Flavored Markdown"

[AsciiDoc]: https://asciidoc.org/
    "AsciiDoc plain text markup language"

[Elixir]: https://elixir-lang.org
    "The Elixir programming language"

[Sigils]: https://elixir-lang.org/getting-started/sigils.html
    "Elixir Sigils"

[Comparison with Elixir]: #comparison-with-elixir
    "Comparison with Elixir"

Copyright
=========

This document is placed in the public domain or under the CC0-1.0-Universal
license, whichever is more permissive.

[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"
[VimVar]: <> " vim: set fileencoding=utf-8 expandtab shiftwidth=4 softtabstop=4: "