2. Lexical Analysis
A Tungsten program is read by a lexical analyzer, or lexer, which converts an input stream of Unicode characters into a stream of tokens. If more than one token can match a sequence of characters in the source file, the lexer will form the longest possible lexical element. The stream of tokens output from the lexer is processed by a parser. Some tokens are discarded after processing.
This chapter describes how the lexical analyzer breaks a file into tokens.
Besides NL, SP, INDENT, and DEDENT; the following categories of tokens exist: identifiers, keywords, literals, operators, delimiters, and comments.
When defining lexical syntax, all whitespace is described explicitly.
2.1 Source code
Tungsten code must be encoded as UTF-8.
Tungsten code may exist in source files, or passed as a string to the eval function. In either case, the code is a sequence of Unicode characters processed by a lexer. Lexical analysis of the character stream, according to the grammar defined in this chapter, results in a stream of tokens. These tokens form the input of the parser grammar defined in later chapters of this specification.
If a file cannot be decoded as UTF-8, an Encoding<Error> must be raised.
Note: a UTF-8 byte order mark (BOM) U+FEFF ZERO WIDTH NO-BREAK SPACE may be the first character present, but is neither required nor recommended.
Text in source files must not be canonicalized or normalized by the lexer. For simplicity, this document will use the unqualified term character to refer to a single Unicode code point.
Tungsten source files may have the following extensions: .w, .wc, .ws, .wd.
2.2 Line structure
A Tungsten program is divided into one or more logical lines.
2.2.1 Logical lines
The end of a logical line is represented by the token NL. Statements cannot cross logical line boundaries except where NL is allowed by the syntax (e.g., between statements in compound statements). A logical line is constructed from one or more physical lines by following the explicit or implicit line joining rules.
2.2.2 Physical lines
A physical line is a sequence of characters terminated by a line terminator. In Tungsten, the only valid line terminators are the U+000A LINE FEED character and the end of file (or end of input).
NL = (U+0A | EOF) .
These must not be recognized as line terminators:
VT:U+000B LINE TABULATIONFF:U+000C FORM FEEDCR:U+000D CARRIAGE RETURNLS:U+2028 LINE SEPARATORPS:U+2029 PARAGRAPH SEPARATORNEL:U+0085 NEXT LINECR LF:CRfollowed byLFLF CR:LFfollowed byCR
Tungsten does not impose any limits on the length of a line.
2.2.3 Explicit line joining
When a physical line begins with zero or more spaces followed by a period ^\s*\. it will be joined with the preceding logical line, removing the whitespace and any comments in between.
# This code
list.select &.nonzero?
.uniq # only one of each
.sort
# will be interpreted as
list.select(&:nonzero?).uniq.sort
Note: Joining lines with a backslash is not supported as it frequently results in hard to read code.
2.2.4 Implicit line joining
Expressions contained within the following pairs can be split over more than one physical line:
( … ) parentheses
[ … ] square brackets
{ … } curly braces
<[ … ]> angle square pairs
<( … )> angle parentheses
<< … >> double angle brackets
%i[ ] array of symbols
%w[ ] array of words
%wc[ ] array of words for case
months = [ 'January', 'February', 'March' # List of month names
, 'April', 'May', 'June'
, 'July', 'August', 'September'
, 'October', 'November', 'December'
]
Implicitly continued lines can be commented. The indentation of the continuation lines is not important. Blank continuation lines are allowed. There is no NL token between implicitly continued lines.
2.2.5 Blank Lines
A physical line that contains only whitespace with an optional comment is ignored, i.e., no NL token is generated.
2.2.6 End of File
Tungsten source is terminated by whichever comes first:
- Physical end of file
- U+0000
- U+001A
An EOF token is used to indicate the end of file.
2.3 Whitespace
Tungsten's grammar is more particular about whitespace than most other languages.
SP = "\U{20}" .
One or more spaces (U+0020 SPACE) are collapsed into a single SP token. In many places, the grammar is disambiguated by adding whitespace.
Example: 10m/s^2 is a decimal literal defining an amount of acceleration, 10m/s ^ 2 means the 2nd power of 10m/s.
Infix operators must be surrounded by whitespace characters.
Tab characters (U+0009 CHARACTER TABULATION) are only allowed within string literals.
2.4 Indentation
Indentation in source files must be two spaces. Lines in the same scope must have the same indent. Changes in indentation produce INDENT and DEDENT tokens.
2.5 Comments
A comment starts with an unquoted hash character # followed by a space or bang, and terminates at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules apply. Comments are ignored by the syntax; they do not emit tokens.
Comment = "#" (SP | "!") { ~ NL } NL .
2.6 Preprocessing Directives
Preprocessing directives are governed by tokens described by the following lexical definition:
Letter = "A"…"Z" | "_" .
Token = "#" Letter { Letter } .
Boolean = 'true' | 'false' .
Rule = Token "=" Boolean .
Note: Preprocessing tokens beginning with W_ are reserved for use by the implementation.
puts "starting [Time.now]" #W_DEBUG
puts "loaded [file]" #W_VERBOSE
# TODO: Refine this syntax
#[development]
#![profile]
2.7 Identifiers and Keywords
Identifiers (also referred to as names) are described by the following lexical definitions.
The syntax of identifiers in Tungsten is based on [UAX #31: Unicode Identifier and Pattern Syntax][tr31], with elaboration and changes as defined below:
Within the ASCII range U+0001…U+007F, the valid characters for identifiers are the uppercase letters A…Z, the lowercase letters a…z, the underscore _ and, except as an identifier start, the digits 0…9.
Identifiers are unlimited in length. Case is significant.
Identifier = XID_Start { XID_Continue } .
ID_Start = (* all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property *) .
ID_Continue = (* all characters in ID_Start, plus characters in the categories Mn, Mc, Nd, Pc, and others with the Other_ID_Continue property *) .
XID_Start = (* all characters in ID_Start whose NFKC normalization is in "ID_Start { XID_Continue }" *) .
XID_Continue = (* all characters in ID_Continue whose NFKC normalization is in "ID_Continue" *) .
Token: ID
The Unicode category codes mentioned above stand for:
* Lu uppercase letters
* Ll lowercase letters
* Lt titlecase letters
* Lm modifier letters
* Lo other letters
* Nl letter numbers
* Mn nonspacing marks
* Mc spacing combining marks
* Nd decimal numbers
* Pc connector punctuations
* Other_ID_Start explicit list of characters in PropList.txt to support backwards compatibility
* Other_ID_Continue likewise
All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.
Characters in the category Currency_Symbol (Sc) are reserved for use by decimal literals.
2.7.1 Keywords
The following identifiers are used as reserved words, or keywords of the language, and cannot be used as identifiers.
They must be spelled exactly as written here:
break
case continue
else elsif exit
false
if in
next nil
raise redo rescue retry return
self super
trait true
unless until use
when while
yield
__DIR__
__FILE__
__LINE__
__METHOD__
__MODULE__
The following tokens are reserved for future expansion of the Tungsten language:
asm async await
macro
of out
ptr
secret sync
type
uniq
with
mut mod freeze
safe unsafe
always and as assert assigns at
bad begin by
class compare
do
end ensure error every export extends extern
fn for from
is
ln
module
noop not
or
private protected public
reraise rm
then
abort abstract alias align always args asm assert assigns async atomic auto await
base begin binding bitstype body bool byte bytetype
cache cast catch char clone compile const continue
debug default defer deferred defined? del delegate delete delta deprecated done dynamic
eager ensure enum eps eval event every except exec exit export external
factory fail fallthrough field final finally for foreach foreign from function
get global goto guard
immutable implements implicit import imports include inherit inline interface internal invariant involatile item
lambda lazy let library load local loop
macro map match me mixin mutable
namespace new none nothrow null
object of on operator out override
package packed parallel parse part perform pragma privately proc property pub pure
raises range record ref repeat require restrict resume rethrow
safe scope sealed set shadow shared sizeof static struct suspend switch sync synchronized
template test this throw throws to trait transient trap try type typealias typedef typeof
undef undefined union unreachable unsafe use using
val var version void volatile
with without
INFINITY ∞
NAN
quietly
on off
yes no
good bad
# maybe
done
elif extern
# magic methods
@@caller
@@message
# Unused Ruby keywords
BEGIN
END
__ENCODING__
__END__
Backquote-enclosed strings can be used if you really need to use a reserved word as an identifier.
crop.`yield` * 1⋅bushel⋅acre⁻¹
2.7.2 Context-Dependent Constants
| Constant | Description |
|---|---|
__DIR__ |
The directory name of the current script |
__FILE__ |
The file name of the current script |
__LINE__ |
The number of the current line in the file |
2.7.3 Reserved classes of identifiers
@todo Python reserves _*, __*__, and __* http://docs.python.org/3.4/reference/lexical_analysis.html
2.8 String literals
Literals are notations for constant values of some built-in types.
2.8.1 Characters
A Character represents one Unicode code point.
Character literals are described by the following lexical definition:
Hex = "0"…"9" | "A"…"F" .
LiteralCharacter = "U+" [Hex] [Hex] Hex Hex Hex Hex .
Examples:
# Range of allowed code points
U+0000…U+10FFFF
# LATIN CAPITAL LETTER A
U+0041
U+000041
U+0041.class
=> CodePoint
U+0041 == "\u{41}".codepoints.first == "A".codepoints.first
A character may also be written with a :- prefix followed by a single character, or a backslash escape. The value of the literal is the code point of that character:
:-) # => the character ")" (code point 41)
:-( # => the character "(" (code point 40)
:-A # => the character "A" (code point 65)
:-\n # => LINE FEED (code point 10)
The :- must be followed by a non-whitespace character; :- (with a space) is not a character literal. This is a general character-literal form, not a fixed table of emoticons — :-X always denotes the character X. The recognized backslash escapes are \0, \n, \r, \t, \s, \\, \', and \".
2.8.2 Strings
A String represents a sequence of Unicode code points.
String literals are described by the following lexical definitions:
LiteralString = '"' { StringItem } '"' .
StringItem = StringChar | StringEscape | StringExp .
StringChar = (* any Unicode character except "\" or newline or '"' *) .
Character = "\U{00}"…"\U{10FFFF}" .
StringExp = "[" Expression "]" .
StringEscape =
| "\0"
| "\a"
| "\b"
| "\c"
| "\e"
| "\f"
| "\l"
| "\n"
| "\r"
| "\s"
| "\t"
| "\v"
| "\""
| "\'"
| "\["
| "\\"
| "\x" Hex Hex
| "\o" Octal Octal Octal
| "\u" Hex Hex Hex Hex
| "\U[" { Hex Hex [Hex Hex] [Hex Hex] } "]"
| "\N[" Characters { "," Characters } "]"
| "\P[" Characters [ "=" Characters ] "]"
| "\" Character
.
String literals are delimited by matching double quotes. The backslash character is used to escape characters that are unprintable or otherwise have special meaning, such as a newline, backslash itself, or the double quote character.
The following escape sequences are recognized:
| Escape sequence | Control | Unicode | Abbr | Character name | Description, C0 of ISO 646 |
|---|---|---|---|---|---|
\0 |
\@ |
U+0000 | NUL | NULL | A control character used to accomplish media-fill or time-fill. Null characters may be inserted into or removed from a stream of data without affecting the information content of that stream. But then the addition or removal of these characters may affect the information layout and/or the control of equipment. |
\^A |
U+0001 | SOH | START OF HEADING | A transmission control character used as the first character of a heading of an information message. | |
\^B |
U+0002 | STX | START OF TEXT | A transmission control character which precedes a text and which is used to terminate a heading. | |
\^C |
U+0003 | ETX | END OF TEXT | A transmission control character which terminates a text. | |
\^D |
U+0004 | EOT | END OF TRANSMISSION | A transmission control character used to indicate the conclusion of the transmission of one or more texts. | |
\^E |
U+0005 | ENQ | ENQUIRY | A transmission control character used as a request for a response from a remote station; the response may include station identification and/or station status. When a "Who are you" function is required on the general switched transmission network, the first use of ENQ after the connection is established shall have the meaning "Who are you" (station identification). Subsequent use of ENQ may, or may not, include the function "Who are you", as determined by agreement. | |
\^F |
U+0006 | ACK | ACKNOWLEDGE | A transmission control character transmitted by a receiver as an affirmative response to the sender. | |
\a |
\^G |
U+0007 | BEL | ALERT | A control character that is used when there is a need to call for attention; it may control alarm or attention devices. |
\b |
\^H |
U+0008 | BS | BACKSPACE | A format effector which moves the active position one character position backwards on the same line. |
\t |
\^I |
U+0009 | TAB | CHARACTER TABULATION | A format effector which advances the active position to the next pre-determined character position on the same line. |
\n, \l |
\^J |
U+000A | LF | LINE FEED | A format effector which advances the active position to the same character position of the next line. |
\v |
\^K |
U+000B | VT | LINE TABULATION | A format effector which advances the active position to the same character position on the next pre-determined line. |
\f |
\^L |
U+000C | FF | FORM FEED | A format effector which advances the active position to the same character position on a pre-determined line of the next form or page. |
\r, \c |
\^M |
U+000D | CR | CARRIAGE RETURN | A format effector which moves the active position to the first character position on the same line. |
\^N |
U+000E | SO | SHIFT OUT | A control character which is used in conjunction with SHIFT IN and ESCAPE to extend the graphic character set of the code. It may alter the meaning of octets 33 - 126 (dec). The effect of this character when using code extension techniques is described in International Standard ISO 2022. | |
\^O |
U+000F | SI | SHIFT IN | A control character which is ued in conjunction with SHIFT OUT and ESCAPE to extend the graphic character set of the code. It may reinstate the standard meanings of the octets which follow it. The effect of this character when using code extension techniques is described in International Standard ISO 2022. | |
\^P |
U+0010 | DLE | DATA LINK ESCAPE | A transmission control character which will change the meaning of a limited number of contiguously following characters. It is used exclusively to provide supplementary data transmission control functions. Only graphic characters and trensmission control characters can be used in DLE sequences. | |
\^Q |
U+0011 | DC1 | DEVICE CONTROL ONE | A device control character which is primarily intended for turning on or starting an ancillary device. If it is not required for this purpose, it may be used to restore a device to the basic mode of operation (see also DC2 and DC3), or for any other device control function not provided by other DCs. | |
\^R |
U+0012 | DC2 | DEVICE CONTROL TWO | A device control character which is primarily intended for turning on or starting an ancillary device. If it is not required for this purpose, it may be used to set a device to a special mode of operation (in which case DC1 is used to restore normal operation), or for any other device control function not provided by other DCs. | |
\^S |
U+0013 | DC3 | DEVICE CONTROL THREE | A device control character which is primarily intended for turning off or stopping an ancillary device. This function may be a secondary level stop, for example, wait, pause, stand-by or halt (in which case DC1 is used to restore normal operation). If it is not required for this purpose, it may be used for any other device control function not provided by other DCs. | |
\^T |
U+0014 | DC4 | DEVICE CONTROL FOUR | A device control character which is primarily intended for turning off, stopping or interrupting an ancillary device. If it is not required for this purpose, it may be used for any other device control function not provided by other DCs. | |
\^U |
U+0015 | NAK | NEGATIVE ACKNOWLEDGE | A transmission control character transmitted by a receiver as a negative response to the sender. | |
\^V |
U+0016 | SYN | SYNCHRONOUS IDLE | A transmission control character used by a synchronous transmission system in the absence of any other character (idle condition) to provide a signal from which synchronism may be achieved or retained between data terminal equipment. | |
\^W |
U+0017 | ETB | END OF TRANSMISSION BLOCK | A transmission control character used to indicate the end of a transmission block of data where data is divided into such blocks for transmission purposes. | |
\^X |
U+0018 | CAN | CANCEL | A character, or the first character of a sequence, indicating that the data preceding it is in error. As a result, this data is to be ignored. The specific meaning of this character must be defined for each application and/or between sender and recipient. | |
\^Y |
U+0019 | EM | END OF MEDIUM | A control character that may be used to identify the physical end of a medium, or the end of the used portion of a medium, or the end of the wanted portion of data recorded on a medium. The position of this character does not necessarily correspond to the physical end of the medium. | |
\^Z |
U+001A | SUB | SUBSTITUTE | A control character used in the place of a character that has been found to be invalid or in error. SUB is intended to be introduced by automatic means. | |
\e |
\^[ |
U+001B | ESC | ESCAPE | A control character which is used to provide additional control functions. It alters the meaning of a limited number of contiguously following bit combinations. The use of this character is specified in Internation Standard ISO 2022. |
\^\ |
U+001C | FS | INFORMATION SEPARATOR FOUR | A control character used to separate and qualify data logically; its specific meaning has to be specified for each application. If this character is used in hierarchical order, it delimits a data item called a file. | |
\^] |
U+001D | GS | INFORMATION SEPARATOR THREE | A control character used to separate and qualify data logically; its specific meaning has to be specified for each application. If this character is used in hierarchical order, it delimits a data item called a group. | |
\^^ |
U+001E | RS | INFORMATION SEPARATOR TWO | A control character used to separate and qualify data logically; its specific meaning has to be specified for each application. If this character is used in hierarchical order, it delimits a data item called a record. | |
\^_ |
U+001F | US | INFORMATION SEPARATOR ONE | A control character used to separate and qualify data logically; its specific meaning has to be specified for each application. If this character is used in hierarchical order, it delimits a data item called a unit. | |
\s |
U+0020 | SP | SPACE | ||
\" |
U+0022 | QUOTATION MARK | |||
\' |
U+0027 | APOSTROPHE | |||
\[ |
U+005B | LEFT SQUARE BRACKET | |||
\\ |
U+005C | REVERSE SOLIDUS | |||
\` |
U+0060 | GRAVE ACCENT | |||
\d |
\? |
U+007F | DEL | DELETE | |
[expression] |
Interpolate value of expression | ||||
\oddd |
Unicode codepoint with octal value 'ddd' | ||||
\xhh |
Unicode codepoint with hex value 'hh' | ||||
\uhhhh |
Unicode codepoint with hex value 'hhhh' | ||||
\U[xx xxxx xxxxxx] |
1 or more Unicode codepoints, by hex value | ||||
\N[NAME 1, NAME 2] |
1 or more Unicode codepoints, by name | ||||
\P[prop=value] |
1 or more Unicode codepoints, by property | ||||
\x |
x |
2.8.3 ASCII-only Strings
An ASCII string represents a sequence of ASCII characters.
ASCII string literals are described by the following lexical definitions:
LiteralASCII = "'" { ASCIIPart } "'" .
ASCIIPart = ASCIIChar | ASCIIEscape | ASCIIExp .
ASCIIChar = (* any ASCII character except "\" or newline or "'" *) .
Character = "\U{00}"…"\U{10FFFF}" .
ASCIIExp = "[" Expression "]" .
ASCIIEscape =
| "\0"
| "\a"
| "\b"
| "\c"
| "\e"
| "\f"
| "\l"
| "\n"
| "\r"
| "\s"
| "\t"
| "\v"
| "\""
| "\'"
| "\["
| "\\"
| "\x" Hex Hex
| "\o" Octal Octal Octal
| "\" Character
.
|b5 b6 b7 ---------> |000|001|010|011|100|101|110|111|
|b4 |b3 |b2 |b1 |r\c | - | - | - | - | - | - | - | - |
| 0 | 0 | 0 | 0 | 0 |NUL|DLE|SP | 0 | @ | P | ` | p |
| 0 | 0 | 0 | 1 | 1 |SOH|DC1| ! | 1 | A | Q | a | q |
| 0 | 0 | 1 | 0 | 2 |STX|DC2| " | 2 | B | R | b | r |
| 0 | 0 | 1 | 1 | 3 |ETX|DC3| # | 3 | C | S | c | s |
| 0 | 1 | 0 | 0 | 4 |EOT|DC4| $ | 4 | D | T | d | t |
| 0 | 1 | 0 | 1 | 5 |ENQ|NAK| % | 5 | E | U | e | u |
| 0 | 1 | 1 | 0 | 6 |ACK|SYN| & | 6 | F | V | f | v |
| 0 | 1 | 1 | 1 | 7 |BEL|ETB| ' | 7 | G | W | g | w |
| 1 | 0 | 0 | 0 | 8 | BS|CAN| ( | 8 | H | X | h | x |
| 1 | 0 | 0 | 1 | 9 | HT| EM| ) | 9 | I | Y | i | y |
| 1 | 0 | 1 | 0 | 10 | LF|SUB| * | : | J | Z | j | z |
| 1 | 0 | 1 | 1 | 11 | VT|ESC| + | ; | K | [ | k | { |
| 1 | 1 | 0 | 0 | 12 | FF| FS| , | < | L | \ | l | | |
| 1 | 1 | 0 | 1 | 13 | CR| GS| - | = | M | ] | m | } |
| 1 | 1 | 1 | 0 | 14 | SO| RS| . | > | N | ^ | n | ~ |
| 1 | 1 | 1 | 1 | 15 | SI| US| / | ? | O | _ | o |DEL|
| Binary | Oct | Dec | Hex | Abr | C | Name | |
|---|---|---|---|---|---|---|---|
| 0b000_0000 | 0o000 | 0 | 0x00 | NUL | ^@ | \0 | Null |
| 0b000_0001 | 0o001 | 1 | 0x01 | SOH | ^A | Start of Heading | |
| 0b000_0010 | 0o002 | 2 | 0x02 | STX | ^B | Start of Text | |
| 0b000_0011 | 0o003 | 3 | 0x03 | ETX | ^C | End of Text | |
| 0b000_0100 | 0o004 | 4 | 0x04 | EOT | ^D | End of Transmission | |
| 0b000_0101 | 0o005 | 5 | 0x05 | ENQ | ^E | Enquiry | |
| 0b000_0110 | 0o006 | 6 | 0x06 | ACK | ^F | Acknowledgement | |
| 0b000_0111 | 0o007 | 7 | 0x07 | BEL | ^G | \a | Bell |
| 0b000_1000 | 0o010 | 8 | 0x08 | BS | ^H | \b | Backspace |
| 0b000_1001 | 0o011 | 9 | 0x09 | HT | ^I | \t | Horizontal Tab |
| 0b000_1010 | 0o012 | 10 | 0x0A | LF | ^J | \n | Line Feed |
| 0b000_1011 | 0o013 | 11 | 0x0B | VT | ^K | \v | Vertical Tab |
| 0b000_1100 | 0o014 | 12 | 0x0C | FF | ^L | \f | Form Feed |
| 0b000_1101 | 0o015 | 13 | 0x0D | CR | ^M | \r | Carraige Return |
| 0b000_1110 | 0o016 | 14 | 0x0E | SO | ^N | Shift Out | |
| 0b000_1111 | 0o017 | 15 | 0x0F | SI | ^O | Shift In | |
| 0b001_0000 | 0o020 | 16 | 0x10 | DLE | ^P | Data Link Escape | |
| 0b001_0001 | 0o021 | 17 | 0x11 | DC1 | ^Q | Device Control 1 (often XON) | |
| 0b001_0010 | 0o022 | 18 | 0x12 | DC2 | ^R | Device Control 2 | |
| 0b001_0011 | 0o023 | 19 | 0x13 | DC3 | ^S | Device Control 3 (often XOFF) | |
| 0b001_0100 | 0o024 | 20 | 0x14 | DC4 | ^T | Device Control 4 | |
| 0b001_0101 | 0o025 | 21 | 0x15 | NAK | ^U | Negative Acknowledgement | |
| 0b001_0110 | 0o026 | 22 | 0x16 | SYN | ^V | Synchronous Idle | |
| 0b001_0111 | 0o027 | 23 | 0x17 | ETB | ^W | End of Transmission Block | |
| 0b001_1000 | 0o030 | 24 | 0x18 | CAN | ^X | Cancel | |
| 0b001_1001 | 0o031 | 25 | 0x19 | EM | ^Y | End of Medium | |
| 0b001_1010 | 0o032 | 26 | 0x1A | SUB | ^Z | Substitute | |
| 0b001_1011 | 0o033 | 27 | 0x1B | ESC | ^[ | \e | Escape |
| 0b001_1100 | 0o034 | 28 | 0x1C | FS | ^\ | File Separator | |
| 0b001_1101 | 0o035 | 29 | 0x1D | GS | ^] | Group Separator | |
| 0b001_1110 | 0o036 | 30 | 0x1E | RS | ^^ | Record Separator | |
| 0b001_1111 | 0o037 | 31 | 0x1F | US | ^_ | Unit Separator | |
| 0b010_0000 | 0o040 | 32 | 0x20 | ||||
| 0b010_0001 | 0o041 | 33 | 0x21 | ! | |||
| 0b010_0010 | 0o042 | 34 | 0x22 | " | |||
| 0b010_0011 | 0o043 | 35 | 0x23 | # | |||
| 0b010_0100 | 0o044 | 36 | 0x24 | $ | |||
| 0b010_0101 | 0o045 | 37 | 0x25 | % | |||
| 0b010_0110 | 0o046 | 38 | 0x26 | & | |||
| 0b010_0111 | 0o047 | 39 | 0x27 | ' | |||
| 0b010_1000 | 0o050 | 40 | 0x28 | ( | |||
| 0b010_1001 | 0o051 | 41 | 0x29 | ) | |||
| 0b010_1010 | 0o052 | 42 | 0x2A | * | |||
| 0b010_1011 | 0o053 | 43 | 0x2B | + | |||
| 0b010_1100 | 0o054 | 44 | 0x2C | , | |||
| 0b010_1101 | 0o055 | 45 | 0x2D | - | |||
| 0b010_1110 | 0o056 | 46 | 0x2E | . | |||
| 0b010_1111 | 0o057 | 47 | 0x2F | / | |||
| 0b011_0000 | 0o060 | 48 | 0x30 | 0 | |||
| 0b011_0001 | 0o061 | 49 | 0x31 | 1 | |||
| 0b011_0010 | 0o062 | 50 | 0x32 | 2 | |||
| 0b011_0011 | 0o063 | 51 | 0x33 | 3 | |||
| 0b011_0100 | 0o064 | 52 | 0x34 | 4 | |||
| 0b011_0101 | 0o065 | 53 | 0x35 | 5 | |||
| 0b011_0110 | 0o066 | 54 | 0x36 | 6 | |||
| 0b011_0111 | 0o067 | 55 | 0x37 | 7 | |||
| 0b011_1000 | 0o070 | 56 | 0x38 | 8 | |||
| 0b011_1001 | 0o071 | 57 | 0x39 | 9 | |||
| 0b011_1010 | 0o072 | 58 | 0x3A | : | |||
| 0b011_1011 | 0o073 | 59 | 0x3B | ; | |||
| 0b011_1100 | 0o074 | 60 | 0x3C | < | |||
| 0b011_1101 | 0o075 | 61 | 0x3D | = | |||
| 0b011_1110 | 0o076 | 62 | 0x3E | > | |||
| 0b011_1111 | 0o077 | 63 | 0x3F | ? | |||
| 0b100_0000 | 0o100 | 64 | 0x40 | @ | |||
| 0b100_0001 | 0o101 | 65 | 0x41 | A | |||
| 0b100_0010 | 0o102 | 66 | 0x42 | B | |||
| 0b100_0011 | 0o103 | 67 | 0x43 | C | |||
| 0b100_0100 | 0o104 | 68 | 0x44 | D | |||
| 0b100_0101 | 0o105 | 69 | 0x45 | E | |||
| 0b100_0110 | 0o106 | 70 | 0x46 | F | |||
| 0b100_0111 | 0o107 | 71 | 0x47 | G | |||
| 0b100_1000 | 0o110 | 72 | 0x48 | H | |||
| 0b100_1001 | 0o111 | 73 | 0x49 | I | |||
| 0b100_1010 | 0o112 | 74 | 0x4A | J | |||
| 0b100_1011 | 0o113 | 75 | 0x4B | K | |||
| 0b100_1100 | 0o114 | 76 | 0x4C | L | |||
| 0b100_1101 | 0o115 | 77 | 0x4D | M | |||
| 0b100_1110 | 0o116 | 78 | 0x4E | N | |||
| 0b100_1111 | 0o117 | 79 | 0x4F | O | |||
| 0b101_0000 | 0o120 | 80 | 0x50 | P | |||
| 0b101_0001 | 0o121 | 81 | 0x51 | Q | |||
| 0b101_0010 | 0o122 | 82 | 0x52 | R | |||
| 0b101_0011 | 0o123 | 83 | 0x53 | S | |||
| 0b101_0100 | 0o124 | 84 | 0x54 | T | |||
| 0b101_0101 | 0o125 | 85 | 0x55 | U | |||
| 0b101_0110 | 0o126 | 86 | 0x56 | V | |||
| 0b101_0111 | 0o127 | 87 | 0x57 | W | |||
| 0b101_1000 | 0o130 | 88 | 0x58 | X | |||
| 0b101_1001 | 0o131 | 89 | 0x59 | Y | |||
| 0b101_1010 | 0o132 | 90 | 0x5A | Z | |||
| 0b101_1011 | 0o133 | 91 | 0x5B | [ | |||
| 0b101_1100 | 0o134 | 92 | 0x5C | \ | |||
| 0b101_1101 | 0o135 | 93 | 0x5D | ] | |||
| 0b101_1110 | 0o136 | 94 | 0x5E | ^ | |||
| 0b101_1111 | 0o137 | 95 | 0x5F | _ | |||
| 0b110_0000 | 0o140 | 96 | 0x60 | ` | |||
| 0b110_0001 | 0o141 | 97 | 0x61 | a | |||
| 0b110_0010 | 0o142 | 98 | 0x62 | b | |||
| 0b110_0011 | 0o143 | 99 | 0x63 | c | |||
| 0b110_0100 | 0o144 | 100 | 0x64 | d | |||
| 0b110_0101 | 0o145 | 101 | 0x65 | e | |||
| 0b110_0110 | 0o146 | 102 | 0x66 | f | |||
| 0b110_0111 | 0o147 | 103 | 0x67 | g | |||
| 0b110_1000 | 0o150 | 104 | 0x68 | h | |||
| 0b110_1001 | 0o151 | 105 | 0x69 | i | |||
| 0b110_1010 | 0o152 | 106 | 0x6A | j | |||
| 0b110_1011 | 0o153 | 107 | 0x6B | k | |||
| 0b110_1100 | 0o154 | 108 | 0x6C | l | |||
| 0b110_1101 | 0o155 | 109 | 0x6D | m | |||
| 0b110_1110 | 0o156 | 110 | 0x6E | n | |||
| 0b110_1111 | 0o157 | 111 | 0x6F | o | |||
| 0b111_0000 | 0o160 | 112 | 0x70 | p | |||
| 0b111_0001 | 0o161 | 113 | 0x71 | q | |||
| 0b111_0010 | 0o162 | 114 | 0x72 | r | |||
| 0b111_0011 | 0o163 | 115 | 0x73 | s | |||
| 0b111_0100 | 0o164 | 116 | 0x74 | t | |||
| 0b111_0101 | 0o165 | 117 | 0x75 | u | |||
| 0b111_0110 | 0o166 | 118 | 0x76 | v | |||
| 0b111_0111 | 0o167 | 119 | 0x77 | w | |||
| 0b111_1000 | 0o170 | 120 | 0x78 | x | |||
| 0b111_1001 | 0o171 | 121 | 0x79 | y | |||
| 0b111_1010 | 0o172 | 122 | 0x7A | z | |||
| 0b111_1011 | 0o173 | 123 | 0x7B | { | |||
| 0b111_1100 | 0o174 | 124 | 0x7C | ||||
| 0b111_1101 | 0o175 | 125 | 0x7D | } | |||
| 0b111_1110 | 0o176 | 126 | 0x7E | ~ | |||
| 0b111_1111 | 0o177 | 127 | 0x7F | DEL | ^? | Delete |
ASCII literals are delimited by matching single quotes. The backslash character is used to escape characters that are unprintable or otherwise have special meaning, such as a newline, backslash itself, or the single quote character.
2.8.4 String interpolation
String interpolation uses square brackets: [].
name = "Tungsten"
puts "Hello [name]"
2.8.5 String literal concatenation
Multiple adjacent string literals (delimited by whitespace) are allowed, and their meaning is the same as their concatenation. Thus, "hello" "world" is equivalent to "helloworld". This feature can be used to split long strings or to add comments to parts of strings.
Note: although this feature is defined at the syntactic level, it is implemented at compile time. The "+" operator must be used to concatenate string expressions at run time.
2.8.6 Here documents
HereDocument = "<<-" NAME ... NAME .
IndentedHereDocument = "<<~" NAME ... NAME .
2.8.7 ByteStrings
A Tungsten ByteString represents a sequence of bytes and is described by the following lexical definition:
Hex = "0"…"9" | "A"…"F" | "a"…"f" .
LiteralByteString = "<<" [Hex Hex] { "," Hex Hex } ">>" .
Example:
<<84,117,110,103,115,116,101,110>>
2.8.8 Symbols
A Tungsten Symbol represents a named token. They allow for Ruby-like DSLs.
Letter = "a"…"z" .
Digit = "0"…"9" .
LiteralSymbol = ":" Letter { Letter | Digit | "_" } .
Because strings in Tungsten are immutable, symbols are less useful than in Ruby.
2.9 Numeric literals
There are 4 types of numeric literals: integers, decimals, floating point numbers, and imaginary numbers. There are no complex literals (complex numbers can be formed by adding a real number and an imaginary number).
Note that numeric literals do not include a sign; a phrase like −1 is actually an expression composed of the unary operator - and the literal 1.
2.9.1 Integers
Integer literals are described by the following lexical definitions:
Integer = IntegerBase2 | IntegerBase8 | IntegerBase10 | IntegerBase16 | IntegerBase20 .
DigitBase2 = "0"…"1" .
DigitBase8 = "0"…"7" .
DigitBase10 = "0"…"9" .
DigitBase16 = "0"…"9" | "a"…"f" | "A"…"F" .
DigitBase20 = "0"…"9" | "a"…"j" | "A"…"J" .
IntegerBase2 = "0b" DigitBase2 { ["_"] DigitBase2 } .
IntegerBase8 = "0o" DigitBase8 { ["_"] DigitBase8 } .
IntegerBase10 = DigitBase10 { ["_"] DigitBase10 } .
IntegerBase16 = "0x" DigitBase16 { ["_"] DigitBase16 } . # Unsigned integers
IntegerBase20 = "0v" DigitBase20 { ["_"] DigitBase20 } .
There is no limit for the length of integer literals apart from what can be stored in available memory.
Numerical constants can contain underscores for readability. Integers can be created as decimal (no prefix), binary (0b), octal (0o), and hexadecimal (0x).
Note: non-zero decimal literals may have leading zeros, as the octal literals have the 0o prefix. Note: leading and trailing underscores are not allowed.
Unsigned literals are described by the following lexical definitions:
Hex = "0"…"9" | "A"…"F" | "a"…"f" .
Int8U = "0x" Hex .
Int16U = "0x" Hex Hex .
Int32U = "0x" Hex Hex Hex Hex .
Int64U = "0x" Hex Hex Hex Hex ["_"] Hex Hex Hex Hex .
Int128U = "0x" Hex Hex Hex Hex ["_"] Hex Hex Hex Hex ["_"] Hex Hex Hex Hex ["_"] Hex Hex Hex Hex .
# Should "0b" and "0o" literals also be unsigned?
Integer Types
| Type | Signed? | Bits | Min value | Max value |
|---|---|---|---|---|
| Int8 | ✓ | 8 | −2⁷ | 2⁷ − 1 |
| Int8U | 8 | 0 | 2⁸ − 1 | |
| Int16 | ✓ | 16 | −2¹⁵ | 2¹⁵ − 1 |
| Int16U | 16 | 0 | 2¹⁶ − 1 | |
| Int32 | ✓ | 32 | −2³¹ | 2³¹ − 1 |
| Int32U | 32 | 0 | 2³² − 1 | |
| Int64 | ✓ | 64 | −2⁶³ | 2⁶³ − 1 |
| Int64U | 64 | 0 | 2⁶⁴ − 1 | |
| Int128 | ✓ | 128 | −2¹²⁷ | 2¹²⁷ − 1 |
| Int128U | 128 | 0 | 2¹²⁸ − 1 | |
| BigInt | ✓ | ∞ | −∞ | ∞ |
The default type for an integer literal is 64-bits:
wit> 1.class
Int64
Unsigned integers are input and output using the 0x prefix and hexadecimal (base 16) digits 0–9a–f (the capitalized digits A–F also work). The size of the unsigned value is determined by the number of hex digits used:
wit> 0x1.class
Int8U
wit> 0x123.class
Int16U
wit> 0x1234567.class
Int32U
wit> 0x123456789abcdef.class
Int64U
This behavior is based on the observation that when one uses unsigned hex literals for integer values, one typically is using them to represent a fixed numeric byte sequence, rather than just an integer value.
Binary and octal literals are also supported:
wit> 0b10
0x02
wit> 0b10.class
Int8U
wit> 0o10
0x08
wit> 0o10.class
Int8U
The minimum and maximum representable values of primitive numeric types such as integers are given by the min/0 and max/0 methods:
# wit> (Int32: min max)
# wit> Int32{min max}
wit> (Int32.min, Int32.max)
(−2147483648, 2147483647)
wit> [Int8, Int16, Int32, Int64, Int128, Int8U, Int16U, Int32U, Int64U, Int128U].each do |type|
puts "[type.lpad(7)]: ([type.min], [type.max])"
Int8: (-128, 127)
Int16: (-32768, 32767)
Int32: (-2147483648, 2147483647)
Int64: (-9223372036854775808, 9223372036854775807)
Int128: (-170141183460469231731687303715884105728, 170141183460469231731687303715884105727)
UInt8: (0, 255)
UInt16: (0, 65535)
UInt32: (0, 4294967295)
UInt64: (0, 18446744073709551615)
UInt128: (0, 340282366920938463463374607431768211455)
The values returned by min/0 and max/0 are always of the receiver's type.
2.9.2 Decimals
Decimal literals (or rationals) are described by the following lexical definitions:
(* @todo Roman numeral characters U+2160–217F, counting rods U+1D360 to U+1D37F *)
Digit = "0"…"9" .
Letter = "\p{letter}" .
Letters = Letter { Letter | "_" } .
Currency = "\p{currency symbol}" .
SuperNZ = "¹" | "²" | "³" | "⁴" | "⁵" | "⁶" | "⁷" | "⁸" | "⁹" .
Super = "⁰" | "¹" | "²" | "³" | "⁴" | "⁵" | "⁶" | "⁷" | "⁸" | "⁹" .
Supers = ["⁻" | "⁺"] SuperNZ { Super } .
Prefix = Currency .
Suffix = Units | Degrees | Percents .
Unit = (Letters ["-" Letters] | [Letters] "/" Letters) [ "^" Exponent | Supers ] .
Units = ["⋅"] Unit { "⋅" Unit } .
Degrees = "℃" | "℉" | "°C" | "°F" | "°" [Letter] .
Percents = "%" | "‰" | "‱" | "٪" | "؉" | "؊" | "﹪" | "%" | "percent" .
Exponent = ["+" | "-" | "−"] Digits .
Scientific = "x10^" Exponent
| "×10^" Exponent
| "x10" Supers
| "×10" Supers
| "e" Exponent
| "E" Exponent
.
Precision = "±" Digits ["." Digits]
| "±" Digits "/" Digits
| "(" Digits ")"
.
Digits = Digit { Digit | "_" } .
Decimal = [Prefix] Digits "." Digits [Precision] [Suffix]
| [Prefix] Digits "/" Digits [Precision] [Suffix]
| [Prefix] Digits [Precision] Suffix
| Prefix Digits [Precision] [Suffix]
| [Prefix] Digits ["." Digits] [Precision] Scientific [Suffix]
.
Literal = Decimal " "
| "ℎ" (* Planck's constant *)
| "ℏ" (* Reduced Planck constant *)
| "ℇ" (* Eulers constant, irrational *)
| "π" (* Pi, irrational *)
| "ϕ" (* Phi, irrational *)
.
Tungsten decimal literals can be annotated with semantic meaning that is available at run-time:
- units of measurement (available at run-time)
- precision (or error)
- currency
- percents
Tungsten allows you to create new units of measurement, auto-generating the conversions to other units.
Tungsten ships with all dimensions from the International System of Units, abbreviated SI from the French Le Système International d'Unités.
Example: literal definition of Planck's constant: ℎ = 6.626_069_57(29)×10²³J·s.
Note: Trailing zeros are meaningful, as they indicate the precision associated with the number. Decimal literals will be normalized to a standard form before returned: e.g., "x10^2" => "×10²".
Decimal literals can be defined with semantic meaning:
0.000_000_000_1
0.08
0.0800
wit> 22/7
=> 22/7
wit> 22/2
=> 11
wit> $3.50 - 25¢
=> $3.25
wit> $499 - 15% # woah, same as $499 * 0.85
=> $424.15
wit> 20% - 15%
=> 5%
wit> 1cm * 1cm * 1cm
=> 1mL
wit> 10ft * 10ft
=> 100sqft
wit> 1ft + 12inches
=> 2ft
wit> 2ft.to_s
=> "2 feet"
wit> 299_792_458m/s
wit> 10ft·lbs
wit> 2m + 2lbs
error UnitMismatch
wit> 3ft - 1m
=> -3+3/8inches
wit> 73/100±1/100
wit> 1.602_176_487(40)x10^-19C
wit> 1.602_176_487±0.000_000_040x10^-19C
wit> V = (1kg * 1m^2) / (1amp * 1s^3)
wit> 100MV # megavoltage of lightning
wit> rate = 1/s
wit> 40°20′50″
wit> 3′5″ # 3 feet 5 inches (of length), or 3 minutes and 5 seconds (of time)
wit> 3m5s
wit> ℎ
=> 6.626_069_57(29)×10²³J·s
wit> ℏ
=> 1.054_571_726(47)×10³⁴J·s
# exact calculations using irrational numbers
wit> 2π - 1π - 1π
=> 0
wit> 3π / 2
=> 1.5π
wit> 512GiB %% bytes
=> 549_755_813_888·bytes
Tungsten.register_unit "km", alias: "kilometer", equals: 1000m
Tungsten.register_unit "cm", alias: "centimeter", equals: 1m⁻²
Tungsten.register_unit "in", aliases: ["inch", "\N[DOUBLE PRIME]"], equals: 2.54cm, as: "inch"
SI Base Units (meter kilogram second)
| Name | Abbr | Measure |
|---|---|---|
| metre | m | length |
| kilogram | kg | mass |
| second | s | time |
| ampere | A | electric current |
| kelvin | K | thermodynamic temperature |
| mole | mol | amount of substance |
| candela | cd | luminous intensity |
Note: The kilogram is the only prefixed SI Base Unit. The prefixes for mass refer to the gram as their base.
CGS Base Units (centimeter gram second)
| Name | Abbr | Measure |
|---|---|---|
| centimeter | cm | length |
| gram | g | mass |
| second | s | time |
| centimeter per second | cm/s | velocity |
| gal | Gal | acceleration |
| dyne | dyn | force |
| erg | erg | energy |
| erg per second | erg/s | power |
| barye | Ba | pressure |
| poise | P | dynamic viscosity |
| stokes | St | kinematic viscosity |
| kayser | cm⁻¹ | wavenumber |
Source: wikipedia.org/wiki/CGS
Significant figures
A significant figure is a digit in a number that adds to its precision. This includes all nonzero numbers, zeroes between significant digits, and zeroes indicated to be significant. Leading and trailing zeroes are not significant because they exist only to show the scale of the number. Therefore, 1,230,400 has five significant figures—1, 2, 3, 0, and 4; the two zeroes serve only as placeholders and add no precision to the original number.
When a number is converted into normalized scientific notation, it is scaled down to a number between 1 and 10. All of the significant digits remain, but all of the place holding zeroes are incorporated into the exponent. Following these rules, 1,230,400 becomes 1.2304 x 10⁶.
Ambiguity of the last digit
It is customary in scientific measurements to record all the significant digits from the measurements, and to guess one additional digit if there is any information at all available to the observer to make a guess. The resulting number is considered more valuable than it would be without that extra digit, and it is considered a significant digit because it contains some information leading to greater precision in measurements and in aggregations of measurements (e.g., when adding them or multiplying them together).
Additional information about precision can be conveyed through additional notations. In some cases, it may be useful to know how exact the final significant digit is. For instance, the accepted value of the unit of elementary charge can properly be expressed as 1.602_176_487(40)x10^-19C, which is shorthand for 1.602_176_487±0.000_000_040x10^-19C.
Metric prefixes
| Prefix | Symbol | 1000m | 10n | Decimal | English word | Since |
|---|---|---|---|---|---|---|
| yotta | Y | 1000⁸ | 10²⁴ | 1 000 000 000 000 000 000 000 000 | septillion | 1991 |
| zetta | Z | 1000⁷ | 10²¹ | 1 000 000 000 000 000 000 000 | sextillion | 1991 |
| exa | E | 1000⁶ | 10¹⁸ | 1 000 000 000 000 000 000 | quintillion | 1975 |
| peta | P | 1000⁵ | 10¹⁵ | 1 000 000 000 000 000 | quadrillion | 1975 |
| tera | T | 1000⁴ | 10¹² | 1 000 000 000 000 | trillion | 1960 |
| giga | G | 1000³ | 10⁹ | 1 000 000 000 | billion | 1960 |
| mega | M | 1000² | 10⁶ | 1 000 000 | million | 1960 |
| kilo | k | 1000¹ | 10³ | 1 000 | thousand | 1795 |
| hecto | h | 10002/3 | 10² | 1 00 | hundred | 1795 |
| deca | da | 10001/3 | 10¹ | 1 0 | ten | 1795 |
| 1000⁰ | 10⁰ | 1 | one | - |
| Prefix | Symbol | 1000m | 10n | Decimal | English word | Since |
|---|---|---|---|---|---|---|
| 1000⁰ | 10⁰ | 1 | one | - | ||
| deci | d | 1000-1/3 | 10⁻¹ | 0.1 | tenth | 1795 |
| centi | c | 1000-2/3 | 10⁻² | 0.01 | hundredth | 1795 |
| milli | m | 1000⁻¹ | 10⁻³ | 0.001 | thousandth | 1795 |
| micro | µ, mc | 1000⁻² | 10⁻⁶ | 0.000 001 | millionth | 1960 |
| nano | n | 1000⁻³ | 10⁻⁹ | 0.000 000 001 | billionth | 1960 |
| pico | p | 1000⁻⁴ | 10⁻¹² | 0.000 000 000 001 | trillionth | 1960 |
| femto | f | 1000⁻⁵ | 10⁻¹⁵ | 0.000 000 000 000 001 | quadrillionth | 1964 |
| atto | a | 1000⁻⁶ | 10⁻¹⁸ | 0.000 000 000 000 000 001 | quitillionth | 1964 |
| zepto | z | 1000⁻⁷ | 10⁻²¹ | 0.000 000 000 000 000 000 001 | sextillionth | 1991 |
| yocto | y | 1000⁻⁸ | 10⁻²⁴ | 0.000 000 000 000 000 000 000 001 | septillionth | 1991 |
Note: dag,dkg: decagram, mcg: microgram, megagram: tonne (t), megatonne or megaton -> teragram (Tg)
Binary prefixes
| Prefix | Symbol | 2n | Derivation | Decimal |
|---|---|---|---|---|
| kibi | Ki | 2¹⁰ | kilo: (10³)¹ | 1 024 |
| mebi | Mi | 2²⁰ | mega: (10³)² | 1 048 576 |
| gibi | Gi | 2³⁰ | giga: (10³)³ | 1 073 741 824 |
| tebi | Ti | 2⁴⁰ | tera: (10³)⁴ | 1 099 511 627 776 |
| pebi | Pi | 2⁵⁰ | peta: (10³)⁵ | 1 125 899 906 842 624 |
| exbi | Ei | 2⁶⁰ | exa: (10³)⁶ | 1 152 921 504 606 846 976 |
| zebi | Zi | 2⁷⁰ | zetta: (10³)⁷ | 1 180 591 620 717 411 303 424 |
| yobi | Yi | 2⁸⁰ | yobi: (10³)⁸ | 1 208 925 819 614 629 174 706 176 |
2.9.3 Floating points
Floating point literals are described by the following lexical definitions:
Digit = "0"…"9" .
Digits = Digit { ["_"] Digit } .
Float = "~" Digits ["." Digits] Exponent .
Exponent = ("e" | "E") ["+" | "-" | "−"] Digit { Digit } .
Floating-point Types
| Type | Precision | Bits | IEEE 754 | sn | exp | sig |
|---|---|---|---|---|---|---|
| Float16 | half | 16 | binary16 | 1 | 5 | 11 |
| Float32 | single | 32 | binary32 | 1 | 8 | 23 |
| Float64 | double | 64 | binary64 | 1 | 11 | 52 |
| Float128 | quad | 128 | binary128 | 1 | 15 | 112 |
| Float256 | octuple | 256 | binary256 | 1 | 19 | 236 |
Floating-point zero
Floating-point numbers have two zeros, positive zero and negative zero. They are equal to each other but have different binary representations.
wit> ~+0.0e0 == ~-0.0e0
true
wit> ~+0.0e0.bits
"0000000000000000000000000000000000000000000000000000000000000000"
wit> ~-0.0e0.bits
"1000000000000000000000000000000000000000000000000000000000000000"
wit> ~210.0e0
~2.1e2
Special floating-point values
There are three specified standard floating-point values that do not correspond to any point on the real number line:
| Float16 | Float32 | Float64 | Float128 | Float256 | Name | Description |
|---|---|---|---|---|---|---|
| Inf16 | Inf32 | Inf, ∞ | Inf128 | Inf256 | positive infinity | A value greater than all finite floating-point values |
| −Inf16 | −Inf32 | −Inf, −∞ | -Inf128 | -Inf256 | negative infinity | A value less than all finite floating-point values |
| NaN16 | NaN32 | NaN | NaN128 | NaN256 | not a number | A value not equal to any floating-point value (including itself) |
For further discussion of how these non-finite floating-point values are ordered with respect to each other and other floats, see Numeric Comparisons. By the IEEE 754 standard, these floating-point values are the results of certain arithmetic operations.
wit> 1 / Inf
~0.0e0
wit> ~1.0 / 0
Inf
wit> −~1.0 / 0
−Inf
wit> ~0.1 / 0
Inf
wit> ~0.0 / 0
NaN
wit> ~1.0 + Inf
Inf
wit> ~1.0 − Inf
−Inf
wit> Inf + Inf
Inf
wit> Inf − Inf
NaN
wit> Inf * Inf
Inf
wit> Inf / Inf
NaN
wit> ~0.0 * Inf
NaN
The #min and #max methods are available for floating-point types:
wit> (Float16.min, Float16.max)
(−Inf16, Inf16)
wit> (Float32.min, Float32.max)
(−Inf32, Inf32)
wit> (Float64.min, Float64.max)
(−Inf, Inf)
(−∞, ∞)
Machine epsilon
Most real numbers cannot be represented exactly with floating-point numbers, and so for many purposes it is important to know the distance between two adjacent representable floating-point numbers, which is often known as machine epsilon.
Tungsten provides .eps, which gives the distance between 1.0 and the next larger representable floating-point value:
wit> Float32.eps
~1.1920929e-7
wit> Float64.eps
~2.220446049250313e-16
These values are ~2.0^-23 and ~2.0^-52 as Float32 and Float64 values, respectively. The #eps method is also available on instances of floating-point numbers and gives the absolute difference between that value and the next representable floating-point value. That is, x.eps yields a value of the same type as x such that x + x.eps is the next representable floating-point value larger than x:
wit> ~1.0.eps
~2.220446049250313e-16
wit> ~1000.0.eps
~1.1368683772161603e-13
wit> ~1e-27.eps
~1.793662034335766e-43
wit> ~0.0.eps
~5.0e-324
The distance between two adjacent representable floating-point values is not constant, but is smaller for smaller values and larger for larger values. In other words, the representable floating-point numbers are densest in the real number line near zero, and grow sparser exponentially as one moves farther away from zero. By definition, 1.0.eps is the same as Float64.eps since 1.0 is a 64-bit floating-point value.
Tungsten also provides the #next and #prev methods which return the next larger or smaller representable floating-point number to the receiver, respectively:
wit> x = ~1.25e0
~1.25e0
wit> x.next
~1.2500001e0
wit> x.prev
~1.2499999e0
wit> x.prev.bits
"00111111100111111111111111111111"
wit> x.bits
"00111111101000000000000000000000"
wit> x.next.bits
"00111111101000000000000000000001"
This example highlights the general principal that the adjacent representable floating-point numbers also have adjacent binary integer representations.
Rounding modes
If a number doesn't have an exact floating-point representation, it must be rounded to an appropriate representable value, however, if wanted, the manner in which this rounding is done can be changed according to the rounding modes presented in the IEEE 754 standard:
wit> ~1.1e0 + ~1.0e-1
~1.2000000000000002
wit> with_rounding(Float64, :round_down) -> ~1.1e0 + ~1.0e-1
~1.2
The default mode used is always :round_nearest, which rounds to the nearest representable value, with ties rounded towards the nearest value with an even least significant bit.
Background and References
Floating-point arithmetic entails many subtleties which can be surprising to users who are unfamiliar with the low-level implementation details. However, these subtleties are described in detail in most books on scientific computation, and also in the following references:
- The definitive guide to floating-point arithmetic is the IEEE 754 Standard; however, it is not available for free online.
- For a brief but lucid presentation of how floating-point numbers are presented, see John D. Cook's floating-point articles on the subject.
- Also recommended is Bruce Dawson's series of blog posts on floating-point numbers.
- For an excellent, in-depth discussion of floating-point numbers and issues of numerical accuracy encountered when computing with them, see David Goldberg's paper What Every Computer Scientist Should Know About Floating-Point Arithmetic.
- For even more extensive documentation of the history of, rationale for, and issues with floating-point numbers, as well as discussion of many other topics in numerical computing, see the collected writings of William Kahan, commonly known as the "Father of Floating-Point". Of particular interest may be An Interview with the Old Man of Floating-Point.
2.9.4 Imaginary literals
Imaginary literals are described by the following lexical definitions:
Imaginary = Float "i" .
An imaginary literal yields a complex number with a real part of 0.0. Complex numbers are represented as a pair of floating point numbers and have the same restrictions on their range. To create a complex number with a nonzero real part, add a floating point number to it, e.g., (3 + 4i).
2.9.5 Literal zero and one
Tungsten provides methods which return literal 0 and 1 corresponding to a specified type or the type of a given variable.
| Method | Description |
| ------ | ------------------------------------------------ |
| x.zero | Literal zero of type `x` or type of variable `x` |
| x.one | Literal one of type `x` or type of variable `x` |
Examples:
wit> Float32.zero
~0.0e0
wit> ~1.0.zero
~0.0
wit> Int32.one
1
wit> BigFloat.one
~1e+00 with 256 bits of precision
2.10 Temporal, network, and structured literals
Besides strings and numbers, the lexer recognizes several families of domain literals — colors, dates, network addresses, and durations — each as a single token that yields a value of a dedicated built-in type. Because these forms overlap syntactically with comments, subtraction, and method chains, the lexer disambiguates them by strict adjacency (no interior spaces) and, in most cases, by digit count.
Note: The reference interpreter is the authoritative implementation of these literals. The self-hosted native compiler recognizes the common cases but does not yet lex every form — MAC addresses, microsecond and ISO-8601 durations, and the fully-expanded (non-::) IPv6 form are reference-interpreter-only — and it validates digit counts and octet ranges without range-checking calendar or clock fields. Divergences are noted per form below.
2.10.1 Color literals
A color literal is a # immediately followed by exactly three, four, six, or eight hexadecimal digits, and not followed by a further hexadecimal digit or identifier character.
Hex = "0"…"9" | "a"…"f" | "A"…"F" .
Color = "#" ( Hex Hex Hex
| Hex Hex Hex Hex
| Hex Hex Hex Hex Hex Hex
| Hex Hex Hex Hex Hex Hex Hex Hex ) .
The three- and four-digit forms are shorthand: each nibble is doubled (#RGB becomes #RRGGBB, #RGBA becomes #RRGGBBAA). Six digits are RRGGBB; eight are RRGGBBAA. A literal with no alpha channel is fully opaque (α = 255).
#FF0000 # => Color [255, 0, 0, 255]
#F00 # => Color [255, 0, 0, 255] shorthand
#FF000080 # => Color [255, 0, 0, 128] with alpha
#F008 # => Color [255, 0, 0, 136] shorthand with alpha
Because a color begins with # — the comment character (§2.5) — the two are told apart by what follows: a run of exactly 3, 4, 6, or 8 hexadecimal digits not glued to further word characters is a color; every other #… is a comment. Thus #FF (two digits) and #FFFFF (five) are comments, and #FF0000abcd is a comment because the trailing letters run past eight hex digits.
Token: COLOR. Runtime type: Color.
2.10.2 Date and month literals
A date literal is a four-digit year, a hyphen, and then either a two-digit month with a two-digit day (a calendar date) or a three-digit ordinal day-of-year.
Year = Digit Digit Digit Digit .
Month = Digit Digit .
Day = Digit Digit .
Ordinal = Digit Digit Digit .
Date = Year "-" ( Month "-" Day | Ordinal ) .
MonthOf = Year "-" Month . # no day component
YYYY-MM-DD # => Date calendar date
YYYY-DDD # => Date ordinal day-of-year
YYYY-MM # => Month year and month
A year followed by - and a two-digit month but no -DD yields a Month value rather than a Date.
Disambiguation from subtraction. The date scanner fires only when each hyphen is immediately adjacent to the digits on both sides. YYYY-MM-DD is a date; YYYY - MM - DD, with spaces around the operators, is integer subtraction (§2.3).
Compiler divergence: the reference interpreter range-checks the fields (months 01–12, days 01–31, ordinals 001–366) and additionally accepts ISO week dates such as YYYY-Www-D; the native compiler checks only digit counts, so it accepts an out-of-range date like YYYY-99-99.
Token: DATE, or MONTH for the day-less form. Runtime types: Date, Month.
2.10.3 DateTime literals
A datetime literal is a calendar date, the letter T, and a time. Hours and minutes are required; seconds, fractional seconds, and a timezone are optional.
Time = Hour ":" Minute [ ":" Second [ "." Fraction ] ] [ Zone ] .
Zone = "Z" | ( "+" | "-" ) Hour [ ":" Minute ] .
DateTime = Date "T" Time .
YYYY-MM-DDT14:30 # date and time, no zone
YYYY-MM-DDT14:30:00Z # UTC
YYYY-MM-DDT09:00:00-08:00 # with offset
YYYY-MM-DDT14:30:00.500+05:30
Compiler divergence: the reference interpreter range-checks the clock fields (hours 00–23, minutes 00–59, seconds 00–60 for leap seconds), caps fractional seconds at three digits, and accepts the 24:00 end-of-day form; the native compiler checks only digit counts on the time and leaves the fraction unbounded.
Token: DATETIME.
2.10.4 IP-address literals
An IPv4 literal is four dot-separated octets, each in the range 0–255, with an optional :port (0–65535).
Octet = Digit [ Digit [ Digit ] ] . # value 0…255
Port = Digits . # value 0…65535
IPv4 = Octet "." Octet "." Octet "." Octet [ ":" Port ] .
192.168.1.1
10.0.0.1:8080 # with port
255.255.255.0
The IPv4 scanner runs before the floating-point path: a one-to-three-digit integer ≤ 255 immediately followed by . and a digit begins an address attempt, which succeeds only when exactly four octets are present. A three-part form such as 1.2.3 therefore backtracks to a decimal 1.2 followed by .3 (see §2.10.8).
Both engines also recognize IPv6 literals (RFC 5952) in ::-compressed form (::1, 2001:db8::1, bare ::) and the IPv4-mapped form (::ffff:1.2.3.4); the native compiler prints them fully expanded (::1 → 0:0:0:0:0:0:0:1). Per RFC 5952 §4.3 an IPv6 literal must be lowercase: fe80::1 is an address, FE80::1 is not — reserving an uppercase leading letter for class references (Tungsten:JSON). An IPv6 literal also never follows a word character, so Foo::Bar stays a name/scope form, not an address. Reference-interpreter-only: the fully-expanded input form with no :: (2001:db8:0:0:0:0:0:1) — which the compiler leaves as colon-separated fragments to avoid mis-lexing hash keys and namespaces — plus zone identifiers, bracketed-with-port forms, and MAC addresses.
Token: IP4 / IP6. Runtime types: IPv4, IPv6.
2.10.5 CIDR literals
A CIDR literal is an IPv4 address, a slash, and a prefix length of 0–32.
CIDR = IPv4 "/" Prefix . # Prefix 0…32
10.0.0.0/8
192.168.0.0/24
0.0.0.0/0
A prefix greater than 32 is not a CIDR; the /prefix is left as a division operator applied to the address.
Both engines also recognize IPv6 CIDR (2001:db8::/32, ::/0, prefix 0–128) for the ::-compressed address forms.
Token: CIDR4 (reference-only CIDR6). Runtime type: CIDR.
2.10.6 Duration literals
A duration literal is a compact sequence of number-and-unit components in descending order of magnitude, drawn from y, mo, w, d, h, m, s, ms, and ns.
Unit = "y" | "mo" | "w" | "d" | "h" | "m" | "s" | "ms" | "ns" .
Duration = ( Digits Unit ) { Digits Unit } . # components largest to smallest
5m30s # 5 minutes, 30 seconds
2h30m
1y2mo3d
500ms # a single component is a duration only for ms, ns, mo
Two or more components always form a duration. A single component whose unit is ambiguous with a unit-of-measurement (y, w, d, h, m, s) is instead a Quantity (§2.9.2); only the unambiguous single units ms, ns, and mo form a one-component duration. Components must be written from largest to smallest.
Reference-interpreter-only: microsecond durations (µs / μs) and ISO-8601 durations (P1Y2M3DT4H5M6S, PT1.5H, P3W) are recognized by the reference lexer only.
Angle literals combining degrees, arcminutes, and arcseconds — 40°20′50″ — are not recognized: ° is a unit character (§2.9.2), but ′ and ″ are not lexed.
Token: DURATION. Runtime type: Duration.
2.10.7 UUID literals
A UUID literal is the canonical hyphenated 8-4-4-4-12 hexadecimal form, with a version nibble of 1–8 and an RFC 4122 variant nibble.
Version = "1"…"8" .
Variant = "8" | "9" | "a" | "A" | "b" | "B" .
UUID = Hex⁸ "-" Hex⁴ "-" Version Hex³ "-" Variant Hex³ "-" Hex¹² .
550e8400-e29b-41d4-a716-446655440000
Both engines recognize UUID literals. A sequence that is not a valid UUID — wrong field lengths, or a version nibble outside 1–8 — is instead read as hexadecimal integers joined by - operators.
Token: UUID. Runtime type: UUID.
2.10.8 A note on version-like sequences
Tungsten has no version or semantic-version literal. A sequence such as 1.2.3 is read by the ordinary number machinery as a decimal 1.2 followed by .3 (a call to member 3), and cannot form an IPv4 address because that path requires four octets (§2.10.4).
2.11 Boolean literals
Tungsten represents boolean values with two objects literals: true and false.
Boolean = True | False .
True = "true" | "on" | yes" .
False = "false" | "off" | "no" .
Token: BOOLEAN
2.12 Nil literal
The Nil type has only one possible value: nil.
Nil = "nil" .
Token: NIL
2.13 Regular expression literals
Regex = "/" Characters "/" .
Regular expressions participate in pattern matching through =~ and through regex arms in case expressions.
On a successful match, $1, $2, ... denote the corresponding parenthesized capture groups. Capture variables are scoped to the same statement as the regex literal that introduced the match. A newline or semicolon ends the capture-variable lexical scope.
Examples:
if /^--(.+)=(.+)$/ =~ arg then [$1.to_sym, $2]
case arg
/^--(.+)=(.+)$/ => [$1.to_sym, $2]
2.14 Collection literals
2.14.1 Tuples
Tungsten tuple literals are described by the following lexical definitions:
Tuple = "(" Expression { "," Expression } ")" .
2.14.2 Arrays
Tungsten array literals are described by the following lexical definitions:
Array = "[" Expression { "," Expression } "]" .
2.14.3 Hashes
Tungsten hash literals are described by the following lexical definitions:
Hash = "{" Pair { "," Pair } "}" .
Pair = Key ":" Expression | '"' Key '"' Space ":" Space Expression .
Key = .
Space = " " { " " } .
Examples
hash = { one: 1, two: 2 }
hash = { "one" : 1, "two" : 2 }
hash = {
one: 1
two: 2
}
2.14.4 Sets
Set = "<(" Expression { "," Expression } ")>" .
2.14.5 Multisets
Multiset = "<{" Expression { "," Expression } "}>" .
2.14.6 Word and symbol arrays
Two percent-literal forms build arrays of short strings or symbols without quotes or commas. Only the [ ] delimiter is accepted, and elements are separated by whitespace (spaces, tabs, or newlines).
WordArray = "%w[" { Whitespace Word } Whitespace "]" .
SymbolArray = "%i[" { Whitespace Word } Whitespace "]" .
%w[red green blue] # => ["red", "green", "blue"]
%i[get post put] # => [:get, :post, :put]
Multi-line forms are allowed, with newlines acting as separators. There is no escape mechanism, so an element cannot itself contain ].
2.15 Operators and delimiters
The following character sequences are operators and/or punctuation:
. , ; : .. ... … ` ! @ # $ ? + - * / % ** // %% ^^ -- ++ ~~ && || ~ & | ^ <- -> => #-> #->>
= == === !== != ≠ =~ !~ !~~ < > <= >= ≤ ≥ <=> += -= /= *= %= ^= &= |= ~= &&= ||=
{ } ( ) [ ] << >> <" "> <[ ]> <( )>
→ ←
Certain symbols serve more than one purpose in the grammar.
The augmented assignment operators, serve lexically as delimiters, but also perform an operation.
Any printing ASCII character not listed above as an operator, delimiter, or literal introducer is unused by Tungsten; its occurrence outside string literals and comments is an error.
A physical line is a sequence of characters terminated by an end-of-line sequence. In source files, any of the standard platform line termination sequences can be used – Unix (LF), Windows (CR LF), or the old Macintosh (CR). All line termination sequences can be used interchangeably, regardless of platform.