← The Tungsten Specification

2. Lexical Analysis

A Tungsten program is read by a lexical analyzer, or lexer, which converts an input stream of Unicode characters into a stream of tokens. If more than one token can match a sequence of characters in the source file, the lexer will form the longest possible lexical element. The stream of tokens output from the lexer is processed by a parser. Some tokens are discarded after processing.

This chapter describes how the lexical analyzer breaks a file into tokens.

Besides NL, SP, INDENT, and DEDENT; the following categories of tokens exist: identifiers, keywords, literals, operators, delimiters, and comments.

When defining lexical syntax, all whitespace is described explicitly.

2.1 Source code

Tungsten code must be encoded as UTF-8.

Tungsten code may exist in source files, or passed as a string to the eval function. In either case, the code is a sequence of Unicode characters processed by a lexer. Lexical analysis of the character stream, according to the grammar defined in this chapter, results in a stream of tokens. These tokens form the input of the parser grammar defined in later chapters of this specification.

If a file cannot be decoded as UTF-8, an Encoding<Error> must be raised.

Note: a UTF-8 byte order mark (BOM) U+FEFF ZERO WIDTH NO-BREAK SPACE may be the first character present, but is neither required nor recommended.

Text in source files must not be canonicalized or normalized by the lexer. For simplicity, this document will use the unqualified term character to refer to a single Unicode code point.

Tungsten source files may have the following extensions: .w, .wc, .ws, .wd.

2.2 Line structure

A Tungsten program is divided into one or more logical lines.

2.2.1 Logical lines

The end of a logical line is represented by the token NL. Statements cannot cross logical line boundaries except where NL is allowed by the syntax (e.g., between statements in compound statements). A logical line is constructed from one or more physical lines by following the explicit or implicit line joining rules.

2.2.2 Physical lines

A physical line is a sequence of characters terminated by a line terminator. In Tungsten, the only valid line terminators are the U+000A LINE FEED character and the end of file (or end of input).

NL = (U+0A | EOF) .

These must not be recognized as line terminators:

Tungsten does not impose any limits on the length of a line.

2.2.3 Explicit line joining

When a physical line begins with zero or more spaces followed by a period ^\s*\. it will be joined with the preceding logical line, removing the whitespace and any comments in between.

# This code
list.select &.nonzero?
    .uniq               # only one of each
    .sort

# will be interpreted as
list.select(&:nonzero?).uniq.sort

Note: Joining lines with a backslash is not supported as it frequently results in hard to read code.

2.2.4 Implicit line joining

Expressions contained within the following pairs can be split over more than one physical line:

 ( … )  parentheses
 [ … ]  square brackets
 { … }  curly braces
<[ … ]> angle square pairs
<( … )> angle parentheses
<< … >> double angle brackets

 %i[  ] array of symbols
 %w[  ] array of words
%wc[  ] array of words for case

months = [ 'January', 'February', 'March'     # List of month names
         , 'April',   'May',      'June'
         , 'July',    'August',   'September'
         , 'October', 'November', 'December'
         ]

Implicitly continued lines can be commented. The indentation of the continuation lines is not important. Blank continuation lines are allowed. There is no NL token between implicitly continued lines.

2.2.5 Blank Lines

A physical line that contains only whitespace with an optional comment is ignored, i.e., no NL token is generated.

2.2.6 End of File

Tungsten source is terminated by whichever comes first:

An EOF token is used to indicate the end of file.

2.3 Whitespace

Tungsten's grammar is more particular about whitespace than most other languages.

SP = "\U{20}" .

One or more spaces (U+0020 SPACE) are collapsed into a single SP token. In many places, the grammar is disambiguated by adding whitespace.

Example: 10m/s^2 is a decimal literal defining an amount of acceleration, 10m/s ^ 2 means the 2nd power of 10m/s.

Infix operators must be surrounded by whitespace characters.

Tab characters (U+0009 CHARACTER TABULATION) are only allowed within string literals.

2.4 Indentation

Indentation in source files must be two spaces. Lines in the same scope must have the same indent. Changes in indentation produce INDENT and DEDENT tokens.

2.5 Comments

A comment starts with an unquoted hash character # followed by a space or bang, and terminates at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules apply. Comments are ignored by the syntax; they do not emit tokens.

Comment = "#" (SP | "!") { ~ NL } NL .

2.6 Preprocessing Directives

Preprocessing directives are governed by tokens described by the following lexical definition:

Letter  = "A"…"Z" | "_" .
Token   = "#" Letter { Letter } .
Boolean = 'true' | 'false' .
Rule    = Token "=" Boolean .

Note: Preprocessing tokens beginning with W_ are reserved for use by the implementation.

puts "starting [Time.now]" #W_DEBUG
puts "loaded [file]"       #W_VERBOSE

# TODO: Refine this syntax
#[development]
#![profile]

2.7 Identifiers and Keywords

Identifiers (also referred to as names) are described by the following lexical definitions.

The syntax of identifiers in Tungsten is based on [UAX #31: Unicode Identifier and Pattern Syntax][tr31], with elaboration and changes as defined below:

Within the ASCII range U+0001U+007F, the valid characters for identifiers are the uppercase letters AZ, the lowercase letters az, the underscore _ and, except as an identifier start, the digits 09.

Identifiers are unlimited in length. Case is significant.

Identifier   = XID_Start { XID_Continue } .
ID_Start     = (* all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property *) .
ID_Continue  = (* all characters in ID_Start, plus characters in the categories Mn, Mc, Nd, Pc, and others with the Other_ID_Continue property *) .
XID_Start    = (* all characters in ID_Start whose NFKC normalization is in "ID_Start { XID_Continue }" *) .
XID_Continue = (* all characters in ID_Continue whose NFKC normalization is in "ID_Continue" *) .

Token: ID

The Unicode category codes mentioned above stand for:

* Lu uppercase letters
* Ll lowercase letters
* Lt titlecase letters
* Lm modifier letters
* Lo other letters
* Nl letter numbers
* Mn nonspacing marks
* Mc spacing combining marks
* Nd decimal numbers
* Pc connector punctuations
* Other_ID_Start    explicit list of characters in PropList.txt to support backwards compatibility
* Other_ID_Continue likewise

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

Characters in the category Currency_Symbol (Sc) are reserved for use by decimal literals.

2.7.1 Keywords

The following identifiers are used as reserved words, or keywords of the language, and cannot be used as identifiers.

They must be spelled exactly as written here:

break
case continue
else elsif exit
false
if in
next nil
raise redo rescue retry return
self super
trait true
unless until use
when while
yield

__DIR__
__FILE__
__LINE__
__METHOD__
__MODULE__

The following tokens are reserved for future expansion of the Tungsten language:

asm async await
macro
of out
ptr
secret sync
type
uniq
with

mut mod freeze
safe unsafe

always and as assert assigns at
bad begin by
class compare
do
end ensure error every export extends extern
fn for from
is
ln
module
noop not
or
private protected public
reraise rm
then


abort abstract alias align always args asm assert assigns async atomic auto await
base begin binding bitstype body bool byte bytetype
cache cast catch char clone compile const continue
debug default defer deferred defined? del delegate delete delta deprecated done dynamic
eager ensure enum eps eval event every except exec exit export external
factory fail fallthrough field final finally for foreach foreign from function
get global goto guard
immutable implements implicit import imports include inherit inline interface internal invariant involatile item
lambda lazy let library load local loop
macro map match me mixin mutable
namespace new none nothrow null
object of on operator out override
package packed parallel parse part perform pragma privately proc property pub pure
raises range record ref repeat require restrict resume rethrow
safe scope sealed set shadow shared sizeof static struct suspend switch sync synchronized
template test this throw throws to trait transient trap try type typealias typedef typeof
undef undefined union unreachable unsafe use using
val var version void volatile
with without

INFINITY ∞
NAN
quietly

on off
yes no
good bad

# maybe
done
elif extern

# magic methods
@@caller
@@message

# Unused Ruby keywords
BEGIN
END
__ENCODING__
__END__

Backquote-enclosed strings can be used if you really need to use a reserved word as an identifier.

crop.`yield` * 1⋅bushel⋅acre⁻¹

2.7.2 Context-Dependent Constants

Constant Description
__DIR__ The directory name of the current script
__FILE__ The file name of the current script
__LINE__ The number of the current line in the file

2.7.3 Reserved classes of identifiers

@todo Python reserves _*, __*__, and __* http://docs.python.org/3.4/reference/lexical_analysis.html

2.8 String literals

Literals are notations for constant values of some built-in types.

2.8.1 Characters

A Character represents one Unicode code point.

Character literals are described by the following lexical definition:

Hex              = "0"…"9" | "A"…"F" .
LiteralCharacter = "U+" [Hex] [Hex] Hex Hex Hex Hex .

Examples:

# Range of allowed code points
U+0000…U+10FFFF

# LATIN CAPITAL LETTER A
U+0041
U+000041

U+0041.class
=> CodePoint

U+0041 == "\u{41}".codepoints.first == "A".codepoints.first

A character may also be written with a :- prefix followed by a single character, or a backslash escape. The value of the literal is the code point of that character:

:-)     # => the character ")"  (code point 41)
:-(     # => the character "("  (code point 40)
:-A     # => the character "A"  (code point 65)
:-\n    # => LINE FEED          (code point 10)

The :- must be followed by a non-whitespace character; :- (with a space) is not a character literal. This is a general character-literal form, not a fixed table of emoticons — :-X always denotes the character X. The recognized backslash escapes are \0, \n, \r, \t, \s, \\, \', and \".

2.8.2 Strings

A String represents a sequence of Unicode code points.

String literals are described by the following lexical definitions:

LiteralString = '"' { StringItem } '"' .
StringItem    = StringChar | StringEscape | StringExp .
StringChar    = (* any Unicode character except "\" or newline or '"' *) .
Character     = "\U{00}"…"\U{10FFFF}" .
StringExp     = "[" Expression "]" .
StringEscape  =
              | "\0"
              | "\a"
              | "\b"
              | "\c"
              | "\e"
              | "\f"
              | "\l"
              | "\n"
              | "\r"
              | "\s"
              | "\t"
              | "\v"
              | "\""
              | "\'"
              | "\["
              | "\\"
              | "\x" Hex Hex
              | "\o" Octal Octal Octal
              | "\u" Hex Hex Hex Hex
              | "\U[" { Hex Hex [Hex Hex] [Hex Hex] } "]"
              | "\N[" Characters { "," Characters } "]"
              | "\P[" Characters [ "=" Characters ] "]"
              | "\" Character
              .

String literals are delimited by matching double quotes. The backslash character is used to escape characters that are unprintable or otherwise have special meaning, such as a newline, backslash itself, or the double quote character.

The following escape sequences are recognized:

Escape sequence Control Unicode Abbr Character name Description, C0 of ISO 646
\0 \@ U+0000 NUL NULL A control character used to accomplish media-fill or time-fill. Null characters may be inserted into or removed from a stream of data without affecting the information content of that stream. But then the addition or removal of these characters may affect the information layout and/or the control of equipment.
\^A U+0001 SOH START OF HEADING A transmission control character used as the first character of a heading of an information message.
\^B U+0002 STX START OF TEXT A transmission control character which precedes a text and which is used to terminate a heading.
\^C U+0003 ETX END OF TEXT A transmission control character which terminates a text.
\^D U+0004 EOT END OF TRANSMISSION A transmission control character used to indicate the conclusion of the transmission of one or more texts.
\^E U+0005 ENQ ENQUIRY A transmission control character used as a request for a response from a remote station; the response may include station identification and/or station status. When a "Who are you" function is required on the general switched transmission network, the first use of ENQ after the connection is established shall have the meaning "Who are you" (station identification). Subsequent use of ENQ may, or may not, include the function "Who are you", as determined by agreement.
\^F U+0006 ACK ACKNOWLEDGE A transmission control character transmitted by a receiver as an affirmative response to the sender.
\a \^G U+0007 BEL ALERT A control character that is used when there is a need to call for attention; it may control alarm or attention devices.
\b \^H U+0008 BS BACKSPACE A format effector which moves the active position one character position backwards on the same line.
\t \^I U+0009 TAB CHARACTER TABULATION A format effector which advances the active position to the next pre-determined character position on the same line.
\n, \l \^J U+000A LF LINE FEED A format effector which advances the active position to the same character position of the next line.
\v \^K U+000B VT LINE TABULATION A format effector which advances the active position to the same character position on the next pre-determined line.
\f \^L U+000C FF FORM FEED A format effector which advances the active position to the same character position on a pre-determined line of the next form or page.
\r, \c \^M U+000D CR CARRIAGE RETURN A format effector which moves the active position to the first character position on the same line.
\^N U+000E SO SHIFT OUT A control character which is used in conjunction with SHIFT IN and ESCAPE to extend the graphic character set of the code. It may alter the meaning of octets 33 - 126 (dec). The effect of this character when using code extension techniques is described in International Standard ISO 2022.
\^O U+000F SI SHIFT IN A control character which is ued in conjunction with SHIFT OUT and ESCAPE to extend the graphic character set of the code. It may reinstate the standard meanings of the octets which follow it. The effect of this character when using code extension techniques is described in International Standard ISO 2022.
\^P U+0010 DLE DATA LINK ESCAPE A transmission control character which will change the meaning of a limited number of contiguously following characters. It is used exclusively to provide supplementary data transmission control functions. Only graphic characters and trensmission control characters can be used in DLE sequences.
\^Q U+0011 DC1 DEVICE CONTROL ONE A device control character which is primarily intended for turning on or starting an ancillary device. If it is not required for this purpose, it may be used to restore a device to the basic mode of operation (see also DC2 and DC3), or for any other device control function not provided by other DCs.
\^R U+0012 DC2 DEVICE CONTROL TWO A device control character which is primarily intended for turning on or starting an ancillary device. If it is not required for this purpose, it may be used to set a device to a special mode of operation (in which case DC1 is used to restore normal operation), or for any other device control function not provided by other DCs.
\^S U+0013 DC3 DEVICE CONTROL THREE A device control character which is primarily intended for turning off or stopping an ancillary device. This function may be a secondary level stop, for example, wait, pause, stand-by or halt (in which case DC1 is used to restore normal operation). If it is not required for this purpose, it may be used for any other device control function not provided by other DCs.
\^T U+0014 DC4 DEVICE CONTROL FOUR A device control character which is primarily intended for turning off, stopping or interrupting an ancillary device. If it is not required for this purpose, it may be used for any other device control function not provided by other DCs.
\^U U+0015 NAK NEGATIVE ACKNOWLEDGE A transmission control character transmitted by a receiver as a negative response to the sender.
\^V U+0016 SYN SYNCHRONOUS IDLE A transmission control character used by a synchronous transmission system in the absence of any other character (idle condition) to provide a signal from which synchronism may be achieved or retained between data terminal equipment.
\^W U+0017 ETB END OF TRANSMISSION BLOCK A transmission control character used to indicate the end of a transmission block of data where data is divided into such blocks for transmission purposes.
\^X U+0018 CAN CANCEL A character, or the first character of a sequence, indicating that the data preceding it is in error. As a result, this data is to be ignored. The specific meaning of this character must be defined for each application and/or between sender and recipient.
\^Y U+0019 EM END OF MEDIUM A control character that may be used to identify the physical end of a medium, or the end of the used portion of a medium, or the end of the wanted portion of data recorded on a medium. The position of this character does not necessarily correspond to the physical end of the medium.
\^Z U+001A SUB SUBSTITUTE A control character used in the place of a character that has been found to be invalid or in error. SUB is intended to be introduced by automatic means.
\e \^[ U+001B ESC ESCAPE A control character which is used to provide additional control functions. It alters the meaning of a limited number of contiguously following bit combinations. The use of this character is specified in Internation Standard ISO 2022.
\^\ U+001C FS INFORMATION SEPARATOR FOUR A control character used to separate and qualify data logically; its specific meaning has to be specified for each application. If this character is used in hierarchical order, it delimits a data item called a file.
\^] U+001D GS INFORMATION SEPARATOR THREE A control character used to separate and qualify data logically; its specific meaning has to be specified for each application. If this character is used in hierarchical order, it delimits a data item called a group.
\^^ U+001E RS INFORMATION SEPARATOR TWO A control character used to separate and qualify data logically; its specific meaning has to be specified for each application. If this character is used in hierarchical order, it delimits a data item called a record.
\^_ U+001F US INFORMATION SEPARATOR ONE A control character used to separate and qualify data logically; its specific meaning has to be specified for each application. If this character is used in hierarchical order, it delimits a data item called a unit.
\s U+0020 SP SPACE
\" U+0022 QUOTATION MARK
\' U+0027 APOSTROPHE
\[ U+005B LEFT SQUARE BRACKET
\\ U+005C REVERSE SOLIDUS
\` U+0060 GRAVE ACCENT
\d \? U+007F DEL DELETE
[expression] Interpolate value of expression
\oddd Unicode codepoint with octal value 'ddd'
\xhh Unicode codepoint with hex value 'hh'
\uhhhh Unicode codepoint with hex value 'hhhh'
\U[xx xxxx xxxxxx] 1 or more Unicode codepoints, by hex value
\N[NAME 1, NAME 2] 1 or more Unicode codepoints, by name
\P[prop=value] 1 or more Unicode codepoints, by property
\x x

2.8.3 ASCII-only Strings

An ASCII string represents a sequence of ASCII characters.

ASCII string literals are described by the following lexical definitions:

LiteralASCII = "'" { ASCIIPart } "'" .
ASCIIPart    = ASCIIChar | ASCIIEscape | ASCIIExp .
ASCIIChar    = (* any ASCII character except "\" or newline or "'" *) .
Character    = "\U{00}"…"\U{10FFFF}" .
ASCIIExp     = "[" Expression "]" .
ASCIIEscape  =
             | "\0"
             | "\a"
             | "\b"
             | "\c"
             | "\e"
             | "\f"
             | "\l"
             | "\n"
             | "\r"
             | "\s"
             | "\t"
             | "\v"
             | "\""
             | "\'"
             | "\["
             | "\\"
             | "\x" Hex Hex
             | "\o" Octal Octal Octal
             | "\" Character
             .
|b5 b6 b7 ---------> |000|001|010|011|100|101|110|111|
|b4 |b3 |b2 |b1 |r\c | - | - | - | - | - | - | - | - |
| 0 | 0 | 0 | 0 |  0 |NUL|DLE|SP | 0 | @ | P | ` | p |
| 0 | 0 | 0 | 1 |  1 |SOH|DC1| ! | 1 | A | Q | a | q |
| 0 | 0 | 1 | 0 |  2 |STX|DC2| " | 2 | B | R | b | r |
| 0 | 0 | 1 | 1 |  3 |ETX|DC3| # | 3 | C | S | c | s |
| 0 | 1 | 0 | 0 |  4 |EOT|DC4| $ | 4 | D | T | d | t |
| 0 | 1 | 0 | 1 |  5 |ENQ|NAK| % | 5 | E | U | e | u |
| 0 | 1 | 1 | 0 |  6 |ACK|SYN| & | 6 | F | V | f | v |
| 0 | 1 | 1 | 1 |  7 |BEL|ETB| ' | 7 | G | W | g | w |
| 1 | 0 | 0 | 0 |  8 | BS|CAN| ( | 8 | H | X | h | x |
| 1 | 0 | 0 | 1 |  9 | HT| EM| ) | 9 | I | Y | i | y |
| 1 | 0 | 1 | 0 | 10 | LF|SUB| * | : | J | Z | j | z |
| 1 | 0 | 1 | 1 | 11 | VT|ESC| + | ; | K | [ | k | { |
| 1 | 1 | 0 | 0 | 12 | FF| FS| , | < | L | \ | l | | |
| 1 | 1 | 0 | 1 | 13 | CR| GS| - | = | M | ] | m | } | 
| 1 | 1 | 1 | 0 | 14 | SO| RS| . | > | N | ^ | n | ~ |
| 1 | 1 | 1 | 1 | 15 | SI| US| / | ? | O | _ | o |DEL|
Binary Oct Dec Hex Abr C Name
0b000_0000 0o000 0 0x00 NUL ^@ \0 Null
0b000_0001 0o001 1 0x01 SOH ^A Start of Heading
0b000_0010 0o002 2 0x02 STX ^B Start of Text
0b000_0011 0o003 3 0x03 ETX ^C End of Text
0b000_0100 0o004 4 0x04 EOT ^D End of Transmission
0b000_0101 0o005 5 0x05 ENQ ^E Enquiry
0b000_0110 0o006 6 0x06 ACK ^F Acknowledgement
0b000_0111 0o007 7 0x07 BEL ^G \a Bell
0b000_1000 0o010 8 0x08 BS ^H \b Backspace
0b000_1001 0o011 9 0x09 HT ^I \t Horizontal Tab
0b000_1010 0o012 10 0x0A LF ^J \n Line Feed
0b000_1011 0o013 11 0x0B VT ^K \v Vertical Tab
0b000_1100 0o014 12 0x0C FF ^L \f Form Feed
0b000_1101 0o015 13 0x0D CR ^M \r Carraige Return
0b000_1110 0o016 14 0x0E SO ^N Shift Out
0b000_1111 0o017 15 0x0F SI ^O Shift In
0b001_0000 0o020 16 0x10 DLE ^P Data Link Escape
0b001_0001 0o021 17 0x11 DC1 ^Q Device Control 1 (often XON)
0b001_0010 0o022 18 0x12 DC2 ^R Device Control 2
0b001_0011 0o023 19 0x13 DC3 ^S Device Control 3 (often XOFF)
0b001_0100 0o024 20 0x14 DC4 ^T Device Control 4
0b001_0101 0o025 21 0x15 NAK ^U Negative Acknowledgement
0b001_0110 0o026 22 0x16 SYN ^V Synchronous Idle
0b001_0111 0o027 23 0x17 ETB ^W End of Transmission Block
0b001_1000 0o030 24 0x18 CAN ^X Cancel
0b001_1001 0o031 25 0x19 EM ^Y End of Medium
0b001_1010 0o032 26 0x1A SUB ^Z Substitute
0b001_1011 0o033 27 0x1B ESC ^[ \e Escape
0b001_1100 0o034 28 0x1C FS ^\ File Separator
0b001_1101 0o035 29 0x1D GS ^] Group Separator
0b001_1110 0o036 30 0x1E RS ^^ Record Separator
0b001_1111 0o037 31 0x1F US ^_ Unit Separator
0b010_0000 0o040 32 0x20
0b010_0001 0o041 33 0x21 !
0b010_0010 0o042 34 0x22 "
0b010_0011 0o043 35 0x23 #
0b010_0100 0o044 36 0x24 $
0b010_0101 0o045 37 0x25 %
0b010_0110 0o046 38 0x26 &
0b010_0111 0o047 39 0x27 '
0b010_1000 0o050 40 0x28 (
0b010_1001 0o051 41 0x29 )
0b010_1010 0o052 42 0x2A *
0b010_1011 0o053 43 0x2B +
0b010_1100 0o054 44 0x2C ,
0b010_1101 0o055 45 0x2D -
0b010_1110 0o056 46 0x2E .
0b010_1111 0o057 47 0x2F /
0b011_0000 0o060 48 0x30 0
0b011_0001 0o061 49 0x31 1
0b011_0010 0o062 50 0x32 2
0b011_0011 0o063 51 0x33 3
0b011_0100 0o064 52 0x34 4
0b011_0101 0o065 53 0x35 5
0b011_0110 0o066 54 0x36 6
0b011_0111 0o067 55 0x37 7
0b011_1000 0o070 56 0x38 8
0b011_1001 0o071 57 0x39 9
0b011_1010 0o072 58 0x3A :
0b011_1011 0o073 59 0x3B ;
0b011_1100 0o074 60 0x3C <
0b011_1101 0o075 61 0x3D =
0b011_1110 0o076 62 0x3E >
0b011_1111 0o077 63 0x3F ?
0b100_0000 0o100 64 0x40 @
0b100_0001 0o101 65 0x41 A
0b100_0010 0o102 66 0x42 B
0b100_0011 0o103 67 0x43 C
0b100_0100 0o104 68 0x44 D
0b100_0101 0o105 69 0x45 E
0b100_0110 0o106 70 0x46 F
0b100_0111 0o107 71 0x47 G
0b100_1000 0o110 72 0x48 H
0b100_1001 0o111 73 0x49 I
0b100_1010 0o112 74 0x4A J
0b100_1011 0o113 75 0x4B K
0b100_1100 0o114 76 0x4C L
0b100_1101 0o115 77 0x4D M
0b100_1110 0o116 78 0x4E N
0b100_1111 0o117 79 0x4F O
0b101_0000 0o120 80 0x50 P
0b101_0001 0o121 81 0x51 Q
0b101_0010 0o122 82 0x52 R
0b101_0011 0o123 83 0x53 S
0b101_0100 0o124 84 0x54 T
0b101_0101 0o125 85 0x55 U
0b101_0110 0o126 86 0x56 V
0b101_0111 0o127 87 0x57 W
0b101_1000 0o130 88 0x58 X
0b101_1001 0o131 89 0x59 Y
0b101_1010 0o132 90 0x5A Z
0b101_1011 0o133 91 0x5B [
0b101_1100 0o134 92 0x5C \
0b101_1101 0o135 93 0x5D ]
0b101_1110 0o136 94 0x5E ^
0b101_1111 0o137 95 0x5F _
0b110_0000 0o140 96 0x60 `
0b110_0001 0o141 97 0x61 a
0b110_0010 0o142 98 0x62 b
0b110_0011 0o143 99 0x63 c
0b110_0100 0o144 100 0x64 d
0b110_0101 0o145 101 0x65 e
0b110_0110 0o146 102 0x66 f
0b110_0111 0o147 103 0x67 g
0b110_1000 0o150 104 0x68 h
0b110_1001 0o151 105 0x69 i
0b110_1010 0o152 106 0x6A j
0b110_1011 0o153 107 0x6B k
0b110_1100 0o154 108 0x6C l
0b110_1101 0o155 109 0x6D m
0b110_1110 0o156 110 0x6E n
0b110_1111 0o157 111 0x6F o
0b111_0000 0o160 112 0x70 p
0b111_0001 0o161 113 0x71 q
0b111_0010 0o162 114 0x72 r
0b111_0011 0o163 115 0x73 s
0b111_0100 0o164 116 0x74 t
0b111_0101 0o165 117 0x75 u
0b111_0110 0o166 118 0x76 v
0b111_0111 0o167 119 0x77 w
0b111_1000 0o170 120 0x78 x
0b111_1001 0o171 121 0x79 y
0b111_1010 0o172 122 0x7A z
0b111_1011 0o173 123 0x7B {
0b111_1100 0o174 124 0x7C
0b111_1101 0o175 125 0x7D }
0b111_1110 0o176 126 0x7E ~
0b111_1111 0o177 127 0x7F DEL ^? Delete

ASCII literals are delimited by matching single quotes. The backslash character is used to escape characters that are unprintable or otherwise have special meaning, such as a newline, backslash itself, or the single quote character.

2.8.4 String interpolation

String interpolation uses square brackets: [].

name = "Tungsten"
puts "Hello [name]"

2.8.5 String literal concatenation

Multiple adjacent string literals (delimited by whitespace) are allowed, and their meaning is the same as their concatenation. Thus, "hello" "world" is equivalent to "helloworld". This feature can be used to split long strings or to add comments to parts of strings.

Note: although this feature is defined at the syntactic level, it is implemented at compile time. The "+" operator must be used to concatenate string expressions at run time.

2.8.6 Here documents

HereDocument         = "<<-" NAME ... NAME .
IndentedHereDocument = "<<~" NAME ... NAME .

2.8.7 ByteStrings

A Tungsten ByteString represents a sequence of bytes and is described by the following lexical definition:

Hex               = "0"…"9" | "A"…"F" | "a"…"f" .
LiteralByteString = "<<" [Hex Hex] { "," Hex Hex } ">>" .

Example:

<<84,117,110,103,115,116,101,110>>

2.8.8 Symbols

A Tungsten Symbol represents a named token. They allow for Ruby-like DSLs.

Letter = "a"…"z" .
Digit  = "0"…"9" .
LiteralSymbol = ":" Letter { Letter | Digit | "_" } .

Because strings in Tungsten are immutable, symbols are less useful than in Ruby.

2.9 Numeric literals

There are 4 types of numeric literals: integers, decimals, floating point numbers, and imaginary numbers. There are no complex literals (complex numbers can be formed by adding a real number and an imaginary number).

Note that numeric literals do not include a sign; a phrase like −1 is actually an expression composed of the unary operator - and the literal 1.

2.9.1 Integers

Integer literals are described by the following lexical definitions:

Integer       = IntegerBase2 | IntegerBase8 | IntegerBase10 | IntegerBase16 | IntegerBase20 .

DigitBase2    = "0"…"1" .
DigitBase8    = "0"…"7" .
DigitBase10   = "0"…"9" .
DigitBase16   = "0"…"9" | "a"…"f" | "A"…"F" .
DigitBase20   = "0"…"9" | "a"…"j" | "A"…"J" .

IntegerBase2  = "0b" DigitBase2   { ["_"] DigitBase2  } .
IntegerBase8  = "0o" DigitBase8   { ["_"] DigitBase8  } .
IntegerBase10 =      DigitBase10  { ["_"] DigitBase10 } .
IntegerBase16 = "0x" DigitBase16  { ["_"] DigitBase16 } . # Unsigned integers
IntegerBase20 = "0v" DigitBase20  { ["_"] DigitBase20 } .

There is no limit for the length of integer literals apart from what can be stored in available memory.

Numerical constants can contain underscores for readability. Integers can be created as decimal (no prefix), binary (0b), octal (0o), and hexadecimal (0x).

Note: non-zero decimal literals may have leading zeros, as the octal literals have the 0o prefix. Note: leading and trailing underscores are not allowed.

Unsigned literals are described by the following lexical definitions:

Hex     = "0"…"9" | "A"…"F" | "a"…"f" .

Int8U   = "0x" Hex .
Int16U  = "0x" Hex Hex .
Int32U  = "0x" Hex Hex Hex Hex .
Int64U  = "0x" Hex Hex Hex Hex ["_"] Hex Hex Hex Hex .
Int128U = "0x" Hex Hex Hex Hex ["_"] Hex Hex Hex Hex ["_"] Hex Hex Hex Hex ["_"] Hex Hex Hex Hex .

# Should "0b" and "0o" literals also be unsigned?

Integer Types

Type Signed? Bits Min value Max value
Int8 8 −2⁷ 2⁷ − 1
Int8U 8 0 2⁸ − 1
Int16 16 −2¹⁵ 2¹⁵ − 1
Int16U 16 0 2¹⁶ − 1
Int32 32 −2³¹ 2³¹ − 1
Int32U 32 0 2³² − 1
Int64 64 −2⁶³ 2⁶³ − 1
Int64U 64 0 2⁶⁴ − 1
Int128 128 −2¹²⁷ 2¹²⁷ − 1
Int128U 128 0 2¹²⁸ − 1
BigInt −∞

The default type for an integer literal is 64-bits:

wit> 1.class
Int64

Unsigned integers are input and output using the 0x prefix and hexadecimal (base 16) digits 0–9a–f (the capitalized digits A–F also work). The size of the unsigned value is determined by the number of hex digits used:

wit> 0x1.class
Int8U

wit> 0x123.class
Int16U

wit> 0x1234567.class
Int32U

wit> 0x123456789abcdef.class
Int64U

This behavior is based on the observation that when one uses unsigned hex literals for integer values, one typically is using them to represent a fixed numeric byte sequence, rather than just an integer value.

Binary and octal literals are also supported:

wit> 0b10
0x02

wit> 0b10.class
Int8U

wit> 0o10
0x08

wit> 0o10.class
Int8U

The minimum and maximum representable values of primitive numeric types such as integers are given by the min/0 and max/0 methods:

# wit> (Int32: min max)
# wit> Int32{min max}
wit> (Int32.min, Int32.max)
(−2147483648, 2147483647)

wit> [Int8, Int16, Int32, Int64, Int128, Int8U, Int16U, Int32U, Int64U, Int128U].each do |type|
       puts "[type.lpad(7)]: ([type.min], [type.max])"

   Int8: (-128, 127)
  Int16: (-32768, 32767)
  Int32: (-2147483648, 2147483647)
  Int64: (-9223372036854775808, 9223372036854775807)
 Int128: (-170141183460469231731687303715884105728, 170141183460469231731687303715884105727)
  UInt8: (0, 255)
 UInt16: (0, 65535)
 UInt32: (0, 4294967295)
 UInt64: (0, 18446744073709551615)
UInt128: (0, 340282366920938463463374607431768211455)

The values returned by min/0 and max/0 are always of the receiver's type.

2.9.2 Decimals

Decimal literals (or rationals) are described by the following lexical definitions:

(* @todo Roman numeral characters U+2160–217F, counting rods U+1D360 to U+1D37F *)

Digit      = "0"…"9" . 
Letter     = "\p{letter}" .
Letters    = Letter { Letter | "_" } .
Currency   = "\p{currency symbol}" .
SuperNZ    =       "¹" | "²" | "³" | "⁴" | "⁵" | "⁶" | "⁷" | "⁸" | "⁹" .
Super      = "⁰" | "¹" | "²" | "³" | "⁴" | "⁵" | "⁶" | "⁷" | "⁸" | "⁹" .
Supers     = ["⁻" | "⁺"] SuperNZ { Super } .

Prefix     = Currency .
Suffix     = Units | Degrees | Percents .

Unit       = (Letters ["-" Letters] | [Letters] "/" Letters) [ "^" Exponent | Supers ] .
Units      = ["⋅"] Unit { "⋅" Unit } .
Degrees    = "℃" | "℉" | "°C" | "°F" | "°" [Letter] .
Percents   = "%" | "‰" | "‱" | "٪" | "؉" | "؊" | "﹪" | "%" | "percent" .

Exponent   = ["+" | "-" | "−"] Digits .

Scientific = "x10^" Exponent
           | "×10^" Exponent
           | "x10"  Supers
           | "×10"  Supers
           | "e"    Exponent
           | "E"    Exponent
           .

Precision  = "±" Digits ["." Digits]
           | "±" Digits "/" Digits
           | "(" Digits ")"
           .

Digits     = Digit { Digit | "_" } .

Decimal    = [Prefix] Digits  "." Digits  [Precision]            [Suffix]
           | [Prefix] Digits  "/" Digits  [Precision]            [Suffix]
           | [Prefix] Digits              [Precision]             Suffix
           |  Prefix  Digits              [Precision]            [Suffix]
           | [Prefix] Digits ["." Digits] [Precision] Scientific [Suffix]
           .

Literal    = Decimal " "
           | "ℎ" (* Planck's constant *)
           | "ℏ" (* Reduced Planck constant *)
           | "ℇ" (* Eulers constant, irrational *)
           | "π" (* Pi, irrational *)
           | "ϕ" (* Phi, irrational *)
           .

Tungsten decimal literals can be annotated with semantic meaning that is available at run-time:

Tungsten allows you to create new units of measurement, auto-generating the conversions to other units.

Tungsten ships with all dimensions from the International System of Units, abbreviated SI from the French Le Système International d'Unités.

Example: literal definition of Planck's constant: ℎ = 6.626_069_57(29)×10²³J·s.

Note: Trailing zeros are meaningful, as they indicate the precision associated with the number. Decimal literals will be normalized to a standard form before returned: e.g., "x10^2" => "×10²".

Decimal literals can be defined with semantic meaning:

0.000_000_000_1
0.08
0.0800

wit> 22/7
  => 22/7

wit> 22/2
  => 11

wit> $3.50 - 25¢
  => $3.25

wit> $499 - 15%          # woah, same as $499 * 0.85
  => $424.15

wit> 20% - 15%
  => 5%

wit> 1cm * 1cm * 1cm
  => 1mL

wit> 10ft * 10ft
  => 100sqft

wit> 1ft + 12inches
  => 2ft

wit> 2ft.to_s
  => "2 feet"

wit> 299_792_458m/s

wit> 10ft·lbs

wit> 2m + 2lbs
error UnitMismatch

wit> 3ft - 1m
  => -3+3/8inches

wit> 73/100±1/100

wit> 1.602_176_487(40)x10^-19C
wit> 1.602_176_487±0.000_000_040x10^-19C

wit> V = (1kg * 1m^2) / (1amp * 1s^3)

wit> 100MV # megavoltage of lightning

wit> rate = 1/s

wit> 40°20′50″

wit> 3′5″ # 3 feet 5 inches (of length), or 3 minutes and 5 seconds (of time)
wit> 3m5s

wit> ℎ
  => 6.626_069_57(29)×10²³J·s

wit> ℏ
  => 1.054_571_726(47)×10³⁴J·s

# exact calculations using irrational numbers
wit> 2π - 1π - 1π
  => 0

wit> 3π / 2
  => 1.5π

wit> 512GiB %% bytes
  => 549_755_813_888·bytes

Tungsten.register_unit "km", alias: "kilometer",     equals: 1000m
Tungsten.register_unit "cm", alias: "centimeter",    equals: 1m⁻²
Tungsten.register_unit "in", aliases: ["inch", "\N[DOUBLE PRIME]"], equals: 2.54cm, as: "inch"

SI Base Units (meter kilogram second)

Name Abbr Measure
metre m length
kilogram kg mass
second s time
ampere A electric current
kelvin K thermodynamic temperature
mole mol amount of substance
candela cd luminous intensity

Note: The kilogram is the only prefixed SI Base Unit. The prefixes for mass refer to the gram as their base.

CGS Base Units (centimeter gram second)

Name Abbr Measure
centimeter cm length
gram g mass
second s time
centimeter per second cm/s velocity
gal Gal acceleration
dyne dyn force
erg erg energy
erg per second erg/s power
barye Ba pressure
poise P dynamic viscosity
stokes St kinematic viscosity
kayser cm⁻¹ wavenumber

Source: wikipedia.org/wiki/CGS

Significant figures

A significant figure is a digit in a number that adds to its precision. This includes all nonzero numbers, zeroes between significant digits, and zeroes indicated to be significant. Leading and trailing zeroes are not significant because they exist only to show the scale of the number. Therefore, 1,230,400 has five significant figures—1, 2, 3, 0, and 4; the two zeroes serve only as placeholders and add no precision to the original number.

When a number is converted into normalized scientific notation, it is scaled down to a number between 1 and 10. All of the significant digits remain, but all of the place holding zeroes are incorporated into the exponent. Following these rules, 1,230,400 becomes 1.2304 x 10⁶.

Ambiguity of the last digit

It is customary in scientific measurements to record all the significant digits from the measurements, and to guess one additional digit if there is any information at all available to the observer to make a guess. The resulting number is considered more valuable than it would be without that extra digit, and it is considered a significant digit because it contains some information leading to greater precision in measurements and in aggregations of measurements (e.g., when adding them or multiplying them together).

Additional information about precision can be conveyed through additional notations. In some cases, it may be useful to know how exact the final significant digit is. For instance, the accepted value of the unit of elementary charge can properly be expressed as 1.602_176_487(40)x10^-19C, which is shorthand for 1.602_176_487±0.000_000_040x10^-19C.

Metric prefixes

Prefix Symbol 1000m 10n Decimal English word Since
yotta Y 1000⁸ 10²⁴ 1 000 000 000 000 000 000 000 000 septillion 1991
zetta Z 1000⁷ 10²¹ 1 000 000 000 000 000 000 000 sextillion 1991
exa E 1000⁶ 10¹⁸ 1 000 000 000 000 000 000 quintillion 1975
peta P 1000⁵ 10¹⁵ 1 000 000 000 000 000 quadrillion 1975
tera T 1000⁴ 10¹² 1 000 000 000 000 trillion 1960
giga G 1000³ 10⁹ 1 000 000 000 billion 1960
mega M 1000² 10⁶ 1 000 000 million 1960
kilo k 1000¹ 10³ 1 000 thousand 1795
hecto h 10002/3 10² 1 00 hundred 1795
deca da 10001/3 10¹ 1 0 ten 1795
1000⁰ 10⁰ 1 one -
Prefix Symbol 1000m 10n Decimal English word Since
1000⁰ 10⁰ 1 one -
deci d 1000-1/3 10⁻¹ 0.1 tenth 1795
centi c 1000-2/3 10⁻² 0.01 hundredth 1795
milli m 1000⁻¹ 10⁻³ 0.001 thousandth 1795
micro µ, mc 1000⁻² 10⁻⁶ 0.000 001 millionth 1960
nano n 1000⁻³ 10⁻⁹ 0.000 000 001 billionth 1960
pico p 1000⁻⁴ 10⁻¹² 0.000 000 000 001 trillionth 1960
femto f 1000⁻⁵ 10⁻¹⁵ 0.000 000 000 000 001 quadrillionth 1964
atto a 1000⁻⁶ 10⁻¹⁸ 0.000 000 000 000 000 001 quitillionth 1964
zepto z 1000⁻⁷ 10⁻²¹ 0.000 000 000 000 000 000 001 sextillionth 1991
yocto y 1000⁻⁸ 10⁻²⁴ 0.000 000 000 000 000 000 000 001 septillionth 1991

Note: dag,dkg: decagram, mcg: microgram, megagram: tonne (t), megatonne or megaton -> teragram (Tg)

Binary prefixes

Prefix Symbol 2n Derivation Decimal
kibi Ki 2¹⁰ kilo: (10³)¹ 1 024
mebi Mi 2²⁰ mega: (10³)² 1 048 576
gibi Gi 2³⁰ giga: (10³)³ 1 073 741 824
tebi Ti 2⁴⁰ tera: (10³)⁴ 1 099 511 627 776
pebi Pi 2⁵⁰ peta: (10³)⁵ 1 125 899 906 842 624
exbi Ei 2⁶⁰ exa: (10³)⁶ 1 152 921 504 606 846 976
zebi Zi 2⁷⁰ zetta: (10³)⁷ 1 180 591 620 717 411 303 424
yobi Yi 2⁸⁰ yobi: (10³)⁸ 1 208 925 819 614 629 174 706 176

2.9.3 Floating points

Floating point literals are described by the following lexical definitions:

Digit    = "0"…"9" .
Digits   = Digit { ["_"] Digit } .
Float    = "~" Digits ["." Digits] Exponent .
Exponent = ("e" | "E") ["+" | "-" | "−"] Digit { Digit } .

Floating-point Types

Type Precision Bits IEEE 754 sn exp sig
Float16 half 16 binary16 1 5 11
Float32 single 32 binary32 1 8 23
Float64 double 64 binary64 1 11 52
Float128 quad 128 binary128 1 15 112
Float256 octuple 256 binary256 1 19 236

IEEE 754 IEEE 754 Standard

Floating-point zero

Floating-point numbers have two zeros, positive zero and negative zero. They are equal to each other but have different binary representations.

wit> ~+0.0e0 == ~-0.0e0
true

wit> ~+0.0e0.bits
"0000000000000000000000000000000000000000000000000000000000000000"

wit> ~-0.0e0.bits
"1000000000000000000000000000000000000000000000000000000000000000"

wit> ~210.0e0
~2.1e2

Special floating-point values

There are three specified standard floating-point values that do not correspond to any point on the real number line:

Float16 Float32 Float64 Float128 Float256 Name Description
Inf16 Inf32 Inf, ∞ Inf128 Inf256 positive infinity A value greater than all finite floating-point values
−Inf16 −Inf32 −Inf, −∞ -Inf128 -Inf256 negative infinity A value less than all finite floating-point values
NaN16 NaN32 NaN NaN128 NaN256 not a number A value not equal to any floating-point value (including itself)

For further discussion of how these non-finite floating-point values are ordered with respect to each other and other floats, see Numeric Comparisons. By the IEEE 754 standard, these floating-point values are the results of certain arithmetic operations.

wit> 1 / Inf
~0.0e0

wit> ~1.0 / 0
Inf

wit> −~1.0 / 0
−Inf

wit> ~0.1 / 0
Inf

wit> ~0.0 / 0
NaN

wit> ~1.0 + Inf
Inf

wit> ~1.0 − Inf
−Inf

wit> Inf + Inf
Inf

wit> Inf − Inf
NaN

wit> Inf * Inf
Inf

wit> Inf / Inf
NaN

wit> ~0.0 * Inf
NaN

The #min and #max methods are available for floating-point types:

wit> (Float16.min, Float16.max)
(−Inf16, Inf16)

wit> (Float32.min, Float32.max)
(−Inf32, Inf32)

wit> (Float64.min, Float64.max)
(−Inf, Inf)
(−∞, ∞)

Machine epsilon

Most real numbers cannot be represented exactly with floating-point numbers, and so for many purposes it is important to know the distance between two adjacent representable floating-point numbers, which is often known as machine epsilon.

Tungsten provides .eps, which gives the distance between 1.0 and the next larger representable floating-point value:

wit> Float32.eps
~1.1920929e-7

wit> Float64.eps
~2.220446049250313e-16

These values are ~2.0^-23 and ~2.0^-52 as Float32 and Float64 values, respectively. The #eps method is also available on instances of floating-point numbers and gives the absolute difference between that value and the next representable floating-point value. That is, x.eps yields a value of the same type as x such that x + x.eps is the next representable floating-point value larger than x:

wit> ~1.0.eps
~2.220446049250313e-16

wit> ~1000.0.eps
~1.1368683772161603e-13

wit> ~1e-27.eps
~1.793662034335766e-43

wit> ~0.0.eps
~5.0e-324

The distance between two adjacent representable floating-point values is not constant, but is smaller for smaller values and larger for larger values. In other words, the representable floating-point numbers are densest in the real number line near zero, and grow sparser exponentially as one moves farther away from zero. By definition, 1.0.eps is the same as Float64.eps since 1.0 is a 64-bit floating-point value.

Tungsten also provides the #next and #prev methods which return the next larger or smaller representable floating-point number to the receiver, respectively:

wit> x = ~1.25e0
~1.25e0

wit> x.next
~1.2500001e0

wit> x.prev
~1.2499999e0

wit> x.prev.bits
"00111111100111111111111111111111"

wit> x.bits
"00111111101000000000000000000000"

wit> x.next.bits
"00111111101000000000000000000001"

This example highlights the general principal that the adjacent representable floating-point numbers also have adjacent binary integer representations.

Rounding modes

If a number doesn't have an exact floating-point representation, it must be rounded to an appropriate representable value, however, if wanted, the manner in which this rounding is done can be changed according to the rounding modes presented in the IEEE 754 standard:

wit> ~1.1e0 + ~1.0e-1
~1.2000000000000002

wit> with_rounding(Float64, :round_down) -> ~1.1e0 + ~1.0e-1
~1.2

The default mode used is always :round_nearest, which rounds to the nearest representable value, with ties rounded towards the nearest value with an even least significant bit.

Background and References

Floating-point arithmetic entails many subtleties which can be surprising to users who are unfamiliar with the low-level implementation details. However, these subtleties are described in detail in most books on scientific computation, and also in the following references:

2.9.4 Imaginary literals

Imaginary literals are described by the following lexical definitions:

Imaginary = Float "i" .

An imaginary literal yields a complex number with a real part of 0.0. Complex numbers are represented as a pair of floating point numbers and have the same restrictions on their range. To create a complex number with a nonzero real part, add a floating point number to it, e.g., (3 + 4i).

2.9.5 Literal zero and one

Tungsten provides methods which return literal 0 and 1 corresponding to a specified type or the type of a given variable.

| Method | Description                                      |
| ------ | ------------------------------------------------ |
| x.zero | Literal zero of type `x` or type of variable `x` |
| x.one  | Literal one of type `x` or type of variable `x`  |

Examples:

wit> Float32.zero
~0.0e0

wit> ~1.0.zero
~0.0

wit> Int32.one
1

wit> BigFloat.one
~1e+00 with 256 bits of precision

2.10 Temporal, network, and structured literals

Besides strings and numbers, the lexer recognizes several families of domain literals — colors, dates, network addresses, and durations — each as a single token that yields a value of a dedicated built-in type. Because these forms overlap syntactically with comments, subtraction, and method chains, the lexer disambiguates them by strict adjacency (no interior spaces) and, in most cases, by digit count.

Note: The reference interpreter is the authoritative implementation of these literals. The self-hosted native compiler recognizes the common cases but does not yet lex every form — MAC addresses, microsecond and ISO-8601 durations, and the fully-expanded (non-::) IPv6 form are reference-interpreter-only — and it validates digit counts and octet ranges without range-checking calendar or clock fields. Divergences are noted per form below.

2.10.1 Color literals

A color literal is a # immediately followed by exactly three, four, six, or eight hexadecimal digits, and not followed by a further hexadecimal digit or identifier character.

Hex   = "0"…"9" | "a"…"f" | "A"…"F" .
Color = "#" ( Hex Hex Hex
            | Hex Hex Hex Hex
            | Hex Hex Hex Hex Hex Hex
            | Hex Hex Hex Hex Hex Hex Hex Hex ) .

The three- and four-digit forms are shorthand: each nibble is doubled (#RGB becomes #RRGGBB, #RGBA becomes #RRGGBBAA). Six digits are RRGGBB; eight are RRGGBBAA. A literal with no alpha channel is fully opaque (α = 255).

#FF0000       # => Color [255, 0, 0, 255]
#F00          # => Color [255, 0, 0, 255]   shorthand
#FF000080     # => Color [255, 0, 0, 128]   with alpha
#F008         # => Color [255, 0, 0, 136]   shorthand with alpha

Because a color begins with # — the comment character (§2.5) — the two are told apart by what follows: a run of exactly 3, 4, 6, or 8 hexadecimal digits not glued to further word characters is a color; every other #… is a comment. Thus #FF (two digits) and #FFFFF (five) are comments, and #FF0000abcd is a comment because the trailing letters run past eight hex digits.

Token: COLOR. Runtime type: Color.

2.10.2 Date and month literals

A date literal is a four-digit year, a hyphen, and then either a two-digit month with a two-digit day (a calendar date) or a three-digit ordinal day-of-year.

Year    = Digit Digit Digit Digit .
Month   = Digit Digit .
Day     = Digit Digit .
Ordinal = Digit Digit Digit .
Date    = Year "-" ( Month "-" Day | Ordinal ) .
MonthOf = Year "-" Month .                # no day component

YYYY-MM-DD    # => Date     calendar date
YYYY-DDD      # => Date     ordinal day-of-year
YYYY-MM       # => Month    year and month

A year followed by - and a two-digit month but no -DD yields a Month value rather than a Date.

Disambiguation from subtraction. The date scanner fires only when each hyphen is immediately adjacent to the digits on both sides. YYYY-MM-DD is a date; YYYY - MM - DD, with spaces around the operators, is integer subtraction (§2.3).

Compiler divergence: the reference interpreter range-checks the fields (months 0112, days 0131, ordinals 001366) and additionally accepts ISO week dates such as YYYY-Www-D; the native compiler checks only digit counts, so it accepts an out-of-range date like YYYY-99-99.

Token: DATE, or MONTH for the day-less form. Runtime types: Date, Month.

2.10.3 DateTime literals

A datetime literal is a calendar date, the letter T, and a time. Hours and minutes are required; seconds, fractional seconds, and a timezone are optional.

Time     = Hour ":" Minute [ ":" Second [ "." Fraction ] ] [ Zone ] .
Zone     = "Z" | ( "+" | "-" ) Hour [ ":" Minute ] .
DateTime = Date "T" Time .

YYYY-MM-DDT14:30            # date and time, no zone
YYYY-MM-DDT14:30:00Z        # UTC
YYYY-MM-DDT09:00:00-08:00   # with offset
YYYY-MM-DDT14:30:00.500+05:30

Compiler divergence: the reference interpreter range-checks the clock fields (hours 0023, minutes 0059, seconds 0060 for leap seconds), caps fractional seconds at three digits, and accepts the 24:00 end-of-day form; the native compiler checks only digit counts on the time and leaves the fraction unbounded.

Token: DATETIME.

2.10.4 IP-address literals

An IPv4 literal is four dot-separated octets, each in the range 0–255, with an optional :port (0–65535).

Octet = Digit [ Digit [ Digit ] ] .       # value 0…255
Port  = Digits .                          # value 0…65535
IPv4  = Octet "." Octet "." Octet "." Octet [ ":" Port ] .

192.168.1.1
10.0.0.1:8080     # with port
255.255.255.0

The IPv4 scanner runs before the floating-point path: a one-to-three-digit integer ≤ 255 immediately followed by . and a digit begins an address attempt, which succeeds only when exactly four octets are present. A three-part form such as 1.2.3 therefore backtracks to a decimal 1.2 followed by .3 (see §2.10.8).

Both engines also recognize IPv6 literals (RFC 5952) in ::-compressed form (::1, 2001:db8::1, bare ::) and the IPv4-mapped form (::ffff:1.2.3.4); the native compiler prints them fully expanded (::10:0:0:0:0:0:0:1). Per RFC 5952 §4.3 an IPv6 literal must be lowercase: fe80::1 is an address, FE80::1 is not — reserving an uppercase leading letter for class references (Tungsten:JSON). An IPv6 literal also never follows a word character, so Foo::Bar stays a name/scope form, not an address. Reference-interpreter-only: the fully-expanded input form with no :: (2001:db8:0:0:0:0:0:1) — which the compiler leaves as colon-separated fragments to avoid mis-lexing hash keys and namespaces — plus zone identifiers, bracketed-with-port forms, and MAC addresses.

Token: IP4 / IP6. Runtime types: IPv4, IPv6.

2.10.5 CIDR literals

A CIDR literal is an IPv4 address, a slash, and a prefix length of 0–32.

CIDR = IPv4 "/" Prefix .                   # Prefix 0…32

10.0.0.0/8
192.168.0.0/24
0.0.0.0/0

A prefix greater than 32 is not a CIDR; the /prefix is left as a division operator applied to the address.

Both engines also recognize IPv6 CIDR (2001:db8::/32, ::/0, prefix 0–128) for the ::-compressed address forms.

Token: CIDR4 (reference-only CIDR6). Runtime type: CIDR.

2.10.6 Duration literals

A duration literal is a compact sequence of number-and-unit components in descending order of magnitude, drawn from y, mo, w, d, h, m, s, ms, and ns.

Unit     = "y" | "mo" | "w" | "d" | "h" | "m" | "s" | "ms" | "ns" .
Duration = ( Digits Unit ) { Digits Unit } .   # components largest to smallest

5m30s        # 5 minutes, 30 seconds
2h30m
1y2mo3d
500ms        # a single component is a duration only for ms, ns, mo

Two or more components always form a duration. A single component whose unit is ambiguous with a unit-of-measurement (y, w, d, h, m, s) is instead a Quantity (§2.9.2); only the unambiguous single units ms, ns, and mo form a one-component duration. Components must be written from largest to smallest.

Reference-interpreter-only: microsecond durations (µs / μs) and ISO-8601 durations (P1Y2M3DT4H5M6S, PT1.5H, P3W) are recognized by the reference lexer only.

Angle literals combining degrees, arcminutes, and arcseconds — 40°20′50″ — are not recognized: ° is a unit character (§2.9.2), but and are not lexed.

Token: DURATION. Runtime type: Duration.

2.10.7 UUID literals

A UUID literal is the canonical hyphenated 8-4-4-4-12 hexadecimal form, with a version nibble of 1–8 and an RFC 4122 variant nibble.

Version = "1"…"8" .
Variant = "8" | "9" | "a" | "A" | "b" | "B" .
UUID    = Hex⁸ "-" Hex⁴ "-" Version Hex³ "-" Variant Hex³ "-" Hex¹² .

550e8400-e29b-41d4-a716-446655440000

Both engines recognize UUID literals. A sequence that is not a valid UUID — wrong field lengths, or a version nibble outside 1–8 — is instead read as hexadecimal integers joined by - operators.

Token: UUID. Runtime type: UUID.

2.10.8 A note on version-like sequences

Tungsten has no version or semantic-version literal. A sequence such as 1.2.3 is read by the ordinary number machinery as a decimal 1.2 followed by .3 (a call to member 3), and cannot form an IPv4 address because that path requires four octets (§2.10.4).

2.11 Boolean literals

Tungsten represents boolean values with two objects literals: true and false.

Boolean = True | False .
True    = "true"  | "on"  | yes" .
False   = "false" | "off" | "no" .

Token: BOOLEAN

2.12 Nil literal

The Nil type has only one possible value: nil.

Nil = "nil" .

Token: NIL

2.13 Regular expression literals

Regex = "/" Characters "/" .

Regular expressions participate in pattern matching through =~ and through regex arms in case expressions.

On a successful match, $1, $2, ... denote the corresponding parenthesized capture groups. Capture variables are scoped to the same statement as the regex literal that introduced the match. A newline or semicolon ends the capture-variable lexical scope.

Examples:

if /^--(.+)=(.+)$/ =~ arg then [$1.to_sym, $2]

case arg
  /^--(.+)=(.+)$/ => [$1.to_sym, $2]

2.14 Collection literals

2.14.1 Tuples

Tungsten tuple literals are described by the following lexical definitions:

Tuple = "(" Expression { "," Expression } ")" .

2.14.2 Arrays

Tungsten array literals are described by the following lexical definitions:

Array = "[" Expression { "," Expression } "]" .

2.14.3 Hashes

Tungsten hash literals are described by the following lexical definitions:

Hash  = "{" Pair { "," Pair } "}" .
Pair  = Key ":" Expression | '"' Key '"' Space ":" Space Expression .
Key   = .
Space = " " { " " } .

Examples

hash = { one: 1, two: 2 }
hash = { "one" : 1, "two" : 2 }
hash = {
  one: 1
  two: 2
}

2.14.4 Sets

Set = "<(" Expression { "," Expression } ")>" .

2.14.5 Multisets

Multiset = "<{" Expression { "," Expression } "}>" .

2.14.6 Word and symbol arrays

Two percent-literal forms build arrays of short strings or symbols without quotes or commas. Only the [ ] delimiter is accepted, and elements are separated by whitespace (spaces, tabs, or newlines).

WordArray   = "%w[" { Whitespace Word } Whitespace "]" .
SymbolArray = "%i[" { Whitespace Word } Whitespace "]" .

%w[red green blue]     # => ["red", "green", "blue"]
%i[get post put]       # => [:get, :post, :put]

Multi-line forms are allowed, with newlines acting as separators. There is no escape mechanism, so an element cannot itself contain ].

2.15 Operators and delimiters

The following character sequences are operators and/or punctuation:

. , ; : .. ... … ` ! @ # $ ? + - * / % ** // %% ^^ -- ++ ~~ && || ~ & | ^ <- -> => #-> #->>
= == === !== != ≠ =~ !~ !~~ < > <= >= ≤ ≥ <=> += -= /= *= %= ^= &= |= ~= &&= ||=
{ } ( ) [ ] << >> <" "> <[ ]> <( )>
→ ←

Certain symbols serve more than one purpose in the grammar.

The augmented assignment operators, serve lexically as delimiters, but also perform an operation.

Any printing ASCII character not listed above as an operator, delimiter, or literal introducer is unused by Tungsten; its occurrence outside string literals and comments is an error.

A physical line is a sequence of characters terminated by an end-of-line sequence. In source files, any of the standard platform line termination sequences can be used – Unix (LF), Windows (CR LF), or the old Macintosh (CR). All line termination sequences can be used interchangeably, regardless of platform.