Ruby  3.4.0dev (2024-11-05 revision 348a53415339076afc4a02fcd09f3ae36e9c4c61)
Functions
string.h File Reference

(348a53415339076afc4a02fcd09f3ae36e9c4c61)

Routines to manipulate encodings of strings. More...

#include "ruby/internal/dllexport.h"
#include "ruby/internal/value.h"
#include "ruby/internal/encoding/encoding.h"
#include "ruby/internal/attr/nonnull.h"
#include "ruby/internal/intern/string.h"
Include dependency graph for string.h:
This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Functions

VALUE rb_enc_str_new (const char *ptr, long len, rb_encoding *enc)
 Identical to rb_str_new(), except it additionally takes an encoding. More...
 
VALUE rb_enc_str_new_cstr (const char *ptr, rb_encoding *enc)
 Identical to rb_enc_str_new(), except it assumes the passed pointer is a pointer to a C string. More...
 
VALUE rb_enc_str_new_static (const char *ptr, long len, rb_encoding *enc)
 Identical to rb_enc_str_new(), except it takes a C string literal. More...
 
VALUE rb_enc_interned_str (const char *ptr, long len, rb_encoding *enc)
 Identical to rb_enc_str_new(), except it returns a "f"string. More...
 
VALUE rb_enc_interned_str_cstr (const char *ptr, rb_encoding *enc)
 Identical to rb_enc_str_new_cstr(), except it returns a "f"string. More...
 
long rb_enc_strlen (const char *head, const char *tail, rb_encoding *enc)
 Counts the number of characters of the passed string, according to the passed encoding. More...
 
char * rb_enc_nth (const char *head, const char *tail, long nth, rb_encoding *enc)
 Queries the n-th character. More...
 
VALUE rb_obj_encoding (VALUE obj)
 Identical to rb_enc_get_index(), except the return type. More...
 
VALUE rb_enc_str_buf_cat (VALUE str, const char *ptr, long len, rb_encoding *enc)
 Identical to rb_str_cat(), except it additionally takes an encoding. More...
 
VALUE rb_enc_uint_chr (unsigned int code, rb_encoding *enc)
 Encodes the passed code point into a series of bytes. More...
 
VALUE rb_external_str_new_with_enc (const char *ptr, long len, rb_encoding *enc)
 Identical to rb_external_str_new(), except it additionally takes an encoding. More...
 
VALUE rb_str_export_to_enc (VALUE obj, rb_encoding *enc)
 Identical to rb_str_export(), except it additionally takes an encoding. More...
 
VALUE rb_str_conv_enc (VALUE str, rb_encoding *from, rb_encoding *to)
 Encoding conversion main routine. More...
 
VALUE rb_str_conv_enc_opts (VALUE str, rb_encoding *from, rb_encoding *to, int ecflags, VALUE ecopts)
 Identical to rb_str_conv_enc(), except it additionally takes IO encoder options. More...
 
int rb_enc_str_coderange (VALUE str)
 Scans the passed string to collect its code range. More...
 
long rb_str_coderange_scan_restartable (const char *str, const char *end, rb_encoding *enc, int *cr)
 Scans the passed string until it finds something odd. More...
 
int rb_enc_str_asciionly_p (VALUE str)
 Queries if the passed string is "ASCII only". More...
 
long rb_memsearch (const void *x, long m, const void *y, long n, rb_encoding *enc)
 Looks for the passed string in the passed buffer. More...
 

Detailed Description

Routines to manipulate encodings of strings.

Author
Ruby developers ruby-.nosp@m.core.nosp@m.@ruby.nosp@m.-lan.nosp@m.g.org
Warning
Symbols prefixed with either RBIMPL or rbimpl are implementation details. Don't take them as canon. They could rapidly appear then vanish. The name (path) of this header file is also an implementation detail. Do not expect it to persist at the place it is now. Developers are free to move it anywhere anytime at will.
Note
To ruby-core: remember that this header can be possibly recursively included from extension libraries written in C++. Do not expect for instance __VA_ARGS__ is always available. We assume C99 for ruby itself but we don't assume languages of extension libraries. They could be written in C++98.

Definition in file string.h.

Function Documentation

◆ rb_enc_interned_str()

VALUE rb_enc_interned_str ( const char *  ptr,
long  len,
rb_encoding enc 
)

Identical to rb_enc_str_new(), except it returns a "f"string.

It can also be seen as a routine identical to rb_interned_str(), except it additionally takes an encoding.

Parameters
[in]ptrA memory region of len bytes length.
[in]lenLength of ptr, in bytes, not including the terminating NUL character.
[in]encEncoding of ptr.
Exceptions
rb_eArgErrorlen is negative.
Returns
A found or created instance of rb_cString, of len bytes length, of enc encoding, whose contents are identical to that of ptr.
Precondition
At least len bytes of continuous memory region shall be accessible via ptr.
Note
enc can be a null pointer.

Definition at line 12506 of file string.c.

Referenced by rb_enc_interned_str_cstr().

◆ rb_enc_interned_str_cstr()

VALUE rb_enc_interned_str_cstr ( const char *  ptr,
rb_encoding enc 
)

Identical to rb_enc_str_new_cstr(), except it returns a "f"string.

It can also be seen as a routine identical to rb_interned_str_cstr(), except it additionally takes an encoding.

Parameters
[in]ptrA memory region of len bytes length.
[in]encEncoding of ptr.
Returns
A found or created instance of rb_cString of enc encoding, whose contents are identical to that of ptr.
Precondition
At least len bytes of continuous memory region shall be accessible via ptr.
Note
enc can be a null pointer.

Definition at line 12528 of file string.c.

◆ rb_enc_nth()

char* rb_enc_nth ( const char *  head,
const char *  tail,
long  nth,
rb_encoding enc 
)

Queries the n-th character.

Like rb_enc_strlen() this function can be fast or slow depending on the contents. Don't expect characters to be uniformly distributed across the entire string.

Parameters
[in]headLeftmost pointer to the string.
[in]tailRightmost pointer to the string.
[in]nthRequested index of characters.
[in]encEncoding of the string.
Returns
Pointer to the first byte of the character that is nth character ahead of head, or tail if there is no such character (OOB etc). The definition of "character" depends on the passed enc.

Definition at line 2921 of file string.c.

Referenced by rb_str_ellipsize(), and rb_str_format().

◆ rb_enc_str_asciionly_p()

int rb_enc_str_asciionly_p ( VALUE  str)

Queries if the passed string is "ASCII only".

An ASCII only string is a string who doesn't have any non-ASCII characters at all. This doesn't necessarily mean the string is in ASCII encoding. For instance a String of CP932 encoding can quite much be ASCII only, depending on its contents.

Parameters
[in]strString in question.
Return values
1It doesn't have non-ASCII characters.
0It has characters that are out of ASCII.

Definition at line 899 of file string.c.

Referenced by rb_inspect(), and rb_reg_quote().

◆ rb_enc_str_buf_cat()

VALUE rb_enc_str_buf_cat ( VALUE  str,
const char *  ptr,
long  len,
rb_encoding enc 
)

Identical to rb_str_cat(), except it additionally takes an encoding.

Parameters
[out]strDestination object.
[in]ptrContents to append.
[in]lenLength of src, in bytes.
[in]encEncoding of ptr.
Exceptions
rb_eArgErrorlen is negative.
rb_eEncCompatErrorenc is not compatible with str.
Returns
The passed dst.
Postcondition
The contents of ptr is copied, transcoded into dst's encoding, then pasted into dst's end.

Definition at line 3597 of file string.c.

Referenced by rb_reg_regsub().

◆ rb_enc_str_coderange()

int rb_enc_str_coderange ( VALUE  str)

Scans the passed string to collect its code range.

Because a Ruby's string is mutable, its contents change from time to time; so does its code range. A long-lived string tends to fall back to RUBY_ENC_CODERANGE_UNKNOWN. This API scans it and re-assigns a fine-grained code range constant.

Parameters
[out]strA string.
Returns
An enum ruby_coderange_type.

Definition at line 880 of file string.c.

Referenced by rb_econv_append(), rb_str_buf_append(), and rb_str_comparable().

◆ rb_enc_str_new()

VALUE rb_enc_str_new ( const char *  ptr,
long  len,
rb_encoding enc 
)

Identical to rb_str_new(), except it additionally takes an encoding.

Parameters
[in]ptrA memory region of len bytes length.
[in]lenLength of ptr, in bytes, not including the terminating NUL character.
[in]encEncoding of ptr.
Exceptions
rb_eNoMemErrorFailed to allocate len+1 bytes.
rb_eArgErrorlen is negative.
Returns
An instance of rb_cString, of len bytes length, of enc encoding, whose contents are verbatim copy of ptr.
Precondition
At least len bytes of continuous memory region shall be accessible via ptr.
Note
enc can be a null pointer. It can also be seen as a routine identical to rb_usascii_str_new() then.

Definition at line 1042 of file string.c.

Referenced by rb_enc_str_new_cstr(), rb_enc_uint_chr(), rb_external_str_new_with_enc(), and rb_intern3().

◆ rb_enc_str_new_cstr()

VALUE rb_enc_str_new_cstr ( const char *  ptr,
rb_encoding enc 
)

Identical to rb_enc_str_new(), except it assumes the passed pointer is a pointer to a C string.

It can also be seen as a routine identical to rb_str_new_cstr(), except it additionally takes an encoding.

Parameters
[in]ptrA C string.
[in]encEncoding of ptr.
Exceptions
rb_eNoMemErrorFailed to allocate memory.
Returns
An instance of rb_cString, of enc encoding, whose contents are verbatim copy of ptr.
Precondition
ptr must not be a null pointer.
Because ptr is a C string it makes no sense for enc to be something like UTF-32.
Note
enc can be a null pointer. It can also be seen as a routine identical to rb_usascii_str_new_cstr() then.

Definition at line 1082 of file string.c.

◆ rb_enc_str_new_static()

VALUE rb_enc_str_new_static ( const char *  ptr,
long  len,
rb_encoding enc 
)

Identical to rb_enc_str_new(), except it takes a C string literal.

It can also be seen as a routine identical to rb_str_new_static(), except it additionally takes an encoding.

Parameters
[in]ptrA C string literal.
[in]lenstrlen(ptr).
[in]encEncoding of ptr.
Exceptions
rb_eArgErrorlen out of range of size_t.
Precondition
ptr must be a C string constant.
Returns
An instance of rb_cString, of enc encoding, whose backend storage is the passed C string literal.
Warning
It is a very bad idea to write to a C string literal (often immediate SEGV shall occur). Consider return values of this function be read-only.
Note
enc can be a null pointer. It can also be seen as a routine identical to rb_usascii_str_new_static() then.

Definition at line 1135 of file string.c.

◆ rb_enc_strlen()

long rb_enc_strlen ( const char *  head,
const char *  tail,
rb_encoding enc 
)

Counts the number of characters of the passed string, according to the passed encoding.

This has to be complicated. The passed string could be invalid and/or broken. This routine would scan from the beginning til the end, byte by byte, to seek out character boundaries. Could be super slow.

Parameters
[in]headLeftmost pointer to the string.
[in]tailRightmost pointer to the string.
[in]encEncoding of the string.
Returns
Number of characters exist in head .. tail. The definition of "character" depends on the passed enc.

Definition at line 2251 of file string.c.

Referenced by rb_str_format().

◆ rb_enc_uint_chr()

VALUE rb_enc_uint_chr ( unsigned int  code,
rb_encoding enc 
)

Encodes the passed code point into a series of bytes.

Parameters
[in]codeCode point.
[in]encTarget encoding scheme.
Exceptions
rb_eRangeErrorenc does not glean code.
Returns
An instance of rb_cString, of enc encoding, whose sole contents is code represented in enc.
Note
No way to encode code points bigger than UINT_MAX.

Definition at line 3803 of file numeric.c.

Referenced by rb_io_ungetc().

◆ rb_external_str_new_with_enc()

VALUE rb_external_str_new_with_enc ( const char *  ptr,
long  len,
rb_encoding enc 
)

Identical to rb_external_str_new(), except it additionally takes an encoding.

However the whole point of rb_external_str_new() is to encode a string into default external encoding. Being able to specify arbitrary encoding just ruins the designed purpose the function meseems.

Parameters
[in]ptrA memory region of len bytes length.
[in]lenLength of ptr, in bytes, not including the terminating NUL character.
[in]encTarget encoding scheme.
Exceptions
rb_eArgErrorlen is negative.
Returns
An instance of rb_cString. In case encoding conversion from "default internal" to enc is fully defined over the given contents, then the return value is a string of enc encoding, whose contents are the converted ones. Otherwise the string is a junk.
Warning
It doesn't raise on a conversion failure and silently ends up in a corrupted output. You can know the failure by querying valid_encoding? of the result object.

Definition at line 1276 of file string.c.

Referenced by rb_external_str_new(), rb_external_str_new_cstr(), rb_filesystem_str_new(), rb_filesystem_str_new_cstr(), rb_locale_str_new(), and rb_locale_str_new_cstr().

◆ rb_memsearch()

long rb_memsearch ( const void *  x,
long  m,
const void *  y,
long  n,
rb_encoding enc 
)

Looks for the passed string in the passed buffer.

Parameters
[in]xQuery string.
[in]mNumber of bytes of x.
[in]yBuffer that potentially includes x.
[in]nNumber of bytes of y.
[in]encEncoding of both x and y.
Return values
-1Not found.
otherwiseFound index in y.
Note
This API can match at a non-character-boundary.

Definition at line 252 of file re.c.

◆ rb_obj_encoding()

VALUE rb_obj_encoding ( VALUE  obj)

Identical to rb_enc_get_index(), except the return type.

Parameters
[in]objObject in question.
Exceptions
rb_eTypeErrorobj is incapable of having an encoding.
Returns
obj's encoding.

Definition at line 1148 of file encoding.c.

◆ rb_str_coderange_scan_restartable()

long rb_str_coderange_scan_restartable ( const char *  str,
const char *  end,
rb_encoding enc,
int *  cr 
)

Scans the passed string until it finds something odd.

Returns the number of bytes scanned. As the name implies this is suitable for repeated call. One of its application is IO#readlines. The method reads from its receiver's read buffer, maybe more than once, looking for newlines. But "newline" can be different among encodings. This API is used to detect broken contents to properly mark them as such.

Parameters
[in]strString to scan.
[in]endEnd of str.
[in]encstr's encoding.
[out]crReturn buffer.
Returns
Distance between str and first such byte where broken.
Postcondition
cr has the code range type.

Definition at line 764 of file string.c.

Referenced by rb_econv_append(), and rb_str_set_len().

◆ rb_str_conv_enc()

VALUE rb_str_conv_enc ( VALUE  str,
rb_encoding from,
rb_encoding to 
)

Encoding conversion main routine.

Parameters
[in]strString to convert.
[in]fromSource encoding.
[in]toDestination encoding.
Returns
A copy of str, with conversion from from to to applied.
Note
from can be a null pointer. str's encoding is taken then.
to can be a null pointer. No-op then.

Definition at line 1270 of file string.c.

Referenced by rb_dir_getwd(), rb_str_encode_ospath(), and rb_str_export_to_enc().

◆ rb_str_conv_enc_opts()

VALUE rb_str_conv_enc_opts ( VALUE  str,
rb_encoding from,
rb_encoding to,
int  ecflags,
VALUE  ecopts 
)

Identical to rb_str_conv_enc(), except it additionally takes IO encoder options.

The extra arguments can be constructed using io_extract_modeenc() etc.

Parameters
[in]strString to convert.
[in]fromSource encoding.
[in]toDestination encoding.
[in]ecflagsA set of enum ruby_econv_flag_type.
[in]ecoptsOptional hash.
Returns
A copy of str, with conversion from from to to applied.
Note
from can be a null pointer. str's encoding is taken then.
to can be a null pointer. No-op then.
ecopts can be RUBY_Qnil, which is equivalent to passing an empty hash.

Definition at line 1154 of file string.c.

Referenced by rb_str_conv_enc().

◆ rb_str_export_to_enc()

VALUE rb_str_export_to_enc ( VALUE  obj,
rb_encoding enc 
)

Identical to rb_str_export(), except it additionally takes an encoding.

Parameters
[in]objTarget object.
[in]encTarget encoding.
Exceptions
rb_eTypeErrorNo implicit conversion to String.
Returns
Converted ruby string of enc encoding.

Definition at line 1375 of file string.c.

Referenced by rb_str_export(), and rb_str_export_locale().