Files
datum/docs/string.md
Marco Cetica f8a77a6056
All checks were successful
clang-build / clang-build (push) Successful in 30s
gcc-build / gcc-build (push) Successful in 19s
Added String type documentation
2026-03-16 09:42:15 +01:00

85 lines
4.3 KiB
Markdown

# String Technical Details
In this document you can find a quick overview of the technical aspects (internal design, memory layout, etc.) of the `String` data structure.
`String` is an immutable, null-terminated string data type with partial UTF-8 support.
This means that methods return a new string instance rather than modifying the string in-place.
Internally, this data structure is represented by the following layout:
```c
typedef struct {
char *data;
size_t byte_size;
size_t byte_capacity;
size_t char_count;
} string_t;
```
where the `data` field represent the actual string, `byte_size`
indicates the actual size (in bytes), `byte_capacity` represents the total number
of allocated memory (in bytes) and `char_count` represents the number of symbols.
As mentioned earlier, this data type provides partial UTF-8 support. It is able
to recognize UTF-8 byte sequences as individual Unicode code points and has
full support for Unicode symbols such as emojis. However, it does not support
localization. In particular, it does not perform local-aware conversions. For instance, uppercase/lowercase transformations are limited to ASCII characters only. As a result, the German scharfes S (`ß`) is not convert to `SS`, the Spanish `Ñ` is not converted to `ñ` and the Italian `é` (and its variants) is not treated as a single symbol but rather as a base letter combined with an accent.
At the time being, `String` supports the following methods:
- `string_result_t string_new(c_str)`: create a new string;
- `string_result_t string_clone(str)`: clone an existing string;
- `string_result_t string_concat(x, y)`: concatenate two strings together;
- `string_result_t string_contains(haystack, needle)`: search whether the `haystack` string contains `needle`;
- `string_result_t string_slice(str, start, end)`: return a slice (a new string) from `str` between `start` and `end` indices (inclusive);
- `string_result_t string_eq(x, y, case_sensitive)`: check whether `x` and `y` are equal;
- `string_result_t string_get_at(str, position)`: get the UTF-8 symbol indexed by `position` from `str`;
- `string_result_t string_set_at(str, position, utf8_char)`: write a UTF-8 symbol into `str` at index `position`;
- `string_result_t string_to_lower(str)`: convert a string to lowercase;
- `string_result_t string_to_upper(str)`: convert a string to uppercase;
- `string_result_t string_reverse(str)`: reverse a string;
- `string_result_t string_trim(str)`: remove leading and trailing white space from a string;
- `string_result_t string_split(str, delim)`: split a string into an array of `string_t` by specifying a separator;
- `string_result_t string_destroy(str)`: remove a string from memory;
- `string_result_t string_split_destroy(split, count)`: remove an array of strings from memory;
- `size_t string_size(str)`: return string character count.
As you can see from the previous function signatures, most methods that operate on the `String`
data type return a custom type called `string_result_t` which is defined as follows:
```c
typedef enum {
STRING_OK = 0x0,
STRING_ERR_ALLOCATE,
STRING_ERR_INVALID,
STRING_ERR_INVALID_UTF8,
STRING_ERR_OVERFLOW
} string_status_t;
typedef struct {
string_status_t status;
uint8_t message[RESULT_MSG_SIZE];
union {
string_t *string;
char *symbol;
int64_t idx;
bool is_equ;
struct {
string_t **strings;
size_t count;
} split;
} value;
} string_result_t;
```
Each method that returns such type indicates whether the operation was successful or not
by setting the `status` field and by providing a descriptive message on the `message`
field. If the operation was successful (that is, `status == STRING_OK`) you can either
move on with the rest of your program or read the returned value from the sum data type.
The sum data type (i.e., the `value` union) defines five different variables.
Each of them has an unique scope as described below:
- `string`: result of `new`, `clone`, `slice`, `reverse` and `trim` functions;
- `symbol`: result of `get_at` function;
- `idx`: result of `contains` function;
- `is_eq`: result of `equ` function. It's true when two strings are equal, false otherwise;
- `split`: result of `split` function. It contains an array of `string_t` and its number of elements.