Encoding

The Internet Object format uses UTF-8 as the default and mandatory encoding for all text representation. This ensures universal compatibility and reliable interchange of data across different platforms, systems, and programming languages.

UTF-8 Requirement

UTF-8 encoding is mandatory for all Internet Object implementations for the following reasons:

Universal Compatibility: UTF-8 is supported by virtually all modern systems and programming languages
ASCII Compatibility: All ASCII characters (0-127) are represented identically in UTF-8
Full Unicode Support: Can represent all Unicode characters and symbols
Byte-Order Independence: No endianness concerns unlike UTF-16 and UTF-32
Self-Synchronizing: Corruption in one character doesn't affect parsing of subsequent characters

Alternative Encodings

While UTF-8 is mandatory, implementations may optionally support additional encodings for specific use cases:

Encoding

Support Level

Notes

UTF-8

Mandatory

Default and required for all implementations

UTF-16

Optional

May be useful for systems with native UTF-16 support

UTF-32

Optional

Fixed-width encoding, larger file sizes

ASCII

Optional

Subset compatibility (limited to basic characters)

ISO-8859-1

Optional

Legacy support for Latin-1 character set

If UTF-8 doesn't work for your specific needs, choose an alternative encoding that best fits your situation while maintaining compatibility. You can use external libraries or tools to convert between UTF-8 and your preferred encoding if your implementation doesn't support it directly.

All parsers and serializers must ensure that UTF-8 is supported, as it is the only mandatory encoding for Internet Objects.

Unicode Support

Internet Object fully supports the Unicode character set through UTF-8 encoding:

Character Range

Basic Multilingual Plane (BMP): U+0000 to U+FFFF
Supplementary Planes: U+10000 to U+10FFFF
Private Use Areas: Supported but implementation-specific
Control Characters: Handled according to Unicode standards

Unicode Normalization

Recommended Form: NFC (Normalization Form Canonical Composed)
Parser Behavior: Should handle all normalized forms correctly
String Comparison: Implementations should normalize for consistent comparison
Storage: Internal representation may use any normalized form

Byte Order Mark (BOM)

BOM Handling

UTF-8 BOM: EF BB BF (U+FEFF) at the beginning of files
Parser Behavior: BOM is treated as whitespace and ignored during parsing
Recommendation: BOM is optional and not required for UTF-8
Compatibility: Including BOM won't cause parsing errors

Example

# File with BOM (invisible to parser)
{
  name: "Example with BOM",
  content: "This file starts with UTF-8 BOM"
}

Implementation Guidelines

For Parser Implementers

UTF-8 Detection: Automatically detect UTF-8 encoding
BOM Handling: Skip UTF-8 BOM if present at file start
Error Handling: Provide clear errors for invalid UTF-8 sequences
Normalization: Consider Unicode normalization for string operations
Validation: Validate character sequences according to Unicode standards

For Generator Implementers

UTF-8 Output: Always generate valid UTF-8 sequences
BOM Decision: Consistently include or exclude BOM based on target system
Character Encoding: Properly encode all Unicode characters
Escape Sequences: Generate escape sequences for control characters when needed
Validation: Verify output contains valid UTF-8 before writing

Character Handling

Whitespace Characters

Internet Object recognizes these whitespace characters:

Space: U+0020 (ASCII space)
Tab: U+0009 (horizontal tab)
Line Feed: U+000A (LF, \n)
Carriage Return: U+000D (CR, \r)
Form Feed: U+000C (FF, \f)

Line Endings

All standard line ending conventions are supported:

Unix/Linux: LF (\n)
Windows: CRLF (\r\n)
Classic Mac: CR (\r)
Mixed: Combinations are handled gracefully

Control Characters

Printable Range: U+0020 to U+007E (ASCII printable)
Control Characters: U+0000 to U+001F, U+007F to U+009F
Handling: Control characters in strings must be properly escaped
Null Character: U+0000 should be avoided in text content

Best Practices

File Storage

Encoding Declaration: Specify UTF-8 encoding in file metadata when possible
File Extension: Use .io extension to indicate Internet Object format
BOM Consistency: Be consistent about BOM usage within a project
Validation: Validate UTF-8 encoding before processing

Cross-Platform Compatibility

Always UTF-8: Use UTF-8 for maximum compatibility
Line Endings: Allow any line ending style during parsing
Character Validation: Validate character sequences are legal Unicode
Encoding Detection: Implement robust encoding detection mechanisms
Escape Sequences: Use standard escape sequences for maximum interoperability

Character Validation

Surrogate Pairs: Properly handle UTF-16 surrogate pairs in escape sequences
Private Use: Define handling of private use area characters
Noncharacters: Decide treatment of Unicode noncharacter code points
Overlong Encoding: Reject overlong UTF-8 sequences

UTF-8 Requirement

Alternative Encodings

Unicode Support

Character Range

Unicode Normalization

Byte Order Mark (BOM)

BOM Handling

Example

Implementation Guidelines

For Parser Implementers

For Generator Implementers

Character Handling

Whitespace Characters

Line Endings

Control Characters

Best Practices

File Storage

Cross-Platform Compatibility

Character Validation

See Also