1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
|
.\"
.\" This file and its contents are supplied under the terms of the
.\" Common Development and Distribution License ("CDDL"), version 1.0.
.\" You may only use this file in accordance with the terms of version
.\" 1.0 of the CDDL.
.\"
.\" A full copy of the text of the CDDL should have accompanied this
.\" source. A copy of the CDDL is also available via the Internet at
.\" http://www.illumos.org/license/CDDL.
.\"
.\"
.\" Copyright 2020 Robert Mustacchi
.\"
.Dd September 20, 2021
.Dt MBRTOC16 3C
.Os
.Sh NAME
.Nm mbrtoc16 ,
.Nm mbrtoc32 ,
.Nm mbrtowc ,
.Nm mbrtowc_l
.Nd convert characters to wide characters
.Sh SYNOPSIS
.In wchar.h
.Ft size_t
.Fo mbrtowc
.Fa "wchar_t *restrict pwc"
.Fa "const char *restrict str"
.Fa "size_t len"
.Fa "mstate_t *restrict ps"
.Fc
.In wchar.h
.In xlocale.h
.Ft size_t
.Fo mbrtowc
.Fa "wchar_t *restrict pwc"
.Fa "const char *restrict str"
.Fa "size_t len"
.Fa "mstate_t *restrict ps"
.Fa "locale_t loc"
.Fc
.In uchar.h
.Ft size_t
.Fo mbrtoc16
.Fa "char16_t *restrict p16c"
.Fa "const char *restrict str"
.Fa "size_t len"
.Fa "mbstate_t *restrict ps"
.Fc
.Ft size_t
.Fo mbrtoc32
.Fa "char32_t *restrict p32c"
.Fa "const char *restrict str"
.Fa "size_t len"
.Fa "mbstate_t *restrict ps"
.Fc
.Sh DESCRIPTION
The
.Fn mbrtoc16 ,
.Fn mbrtoc32 ,
.Fn mbrtowc ,
and
.Fn mbrtowc_l
functions convert character sequences, which may contain multi-byte
characters, into different character formats.
The functions work in the following formats:
.Bl -tag -width mbrtowc_l
.It Fn mbrtoc16
A UTF-16 code sequence, where every code point is represented by one or
two
.Vt char16_t .
The UTF-16 encoding will encode certain Unicode code points as a pair of
two 16-bit code sequences, commonly referred to as a surrogate pair.
.It Fn mbrtoc32
A UTF-32 code sequence, where every code point is represented by a
single
.Vt char32_t .
.It Fn mbrtowc , Fn mbrtowc_l
Wide characters, being a 32-bit value where every code point is
represented by a single
.Vt wchar_t .
While the
.Vt wchar_t
and
.Vt char32_t
are different types, in this implementation, they are similar encodings.
.El
.Pp
The functions consume up to
.Fa len
characters from the string
.Fa str
and accumulate them in
.Fa ps
until a valid character is found, which is influenced by
the
.Dv LC_CTYPE
category of the current locale.
For example, in the
.Sy C
locale, only ASCII characters are recognized, while in a
.Sy UTF-8
based locale like
.Sy en_US.UTF-8 ,
UTF-8 multi-byte character sequences that represent Unicode code points
are recognized.
The
.Fn mbrtowc_l
function uses the locale passed in
.Fa loc
rather than the locale of the current thread.
.Pp
When a valid character sequence has been found, it is converted to
either a 16-bit character sequence for
.Fn mbrtoc16
or a 32-bit character sequence for
.Fn mbrtoc32
and will be stored in
.Fa p16c
and
.Fa p32c
respectively.
.Pp
The
.Fa ps
argument represents a multi-byte conversion state which can be used
across multiple calls to a given function
.Pq but not mixed between functions .
These allow for characters to be consumed from subsequent buffers, e.g.
different values of
.Fa str .
The functions may be called from multiple threads as long as they use
unique values for
.Fa ps .
If
.Fa ps
is
.Dv NULL ,
then a function-specific buffer will be used for the conversion state;
however, this is stored between all threads and its use is not
recommended.
.Pp
When using these functions, more than one character may be output for a
given set of consumed input characters.
An example of this is when a given code point is represented as a set of
surrogate pairs in UTF-16, which require two 16-bit characters to
represent a code point.
When this occurs, the functions return the special return value
.Sy -3 .
.Pp
The functions all have a special behavior when
.Dv NULL
is passed for
.Fa str .
They instead will treat it as though
.Fa pwc ,
.Fa p16c ,
or
.Fa p32c
were
.Dv NULL ,
.Fa str
had been passed as the empty string, "" and the length,
.Fa len ,
would appear as the value 1.
In other words, the functions would be called as:
.Bd -literal -offset indent
mbrtowc(NULL, "", 1, ps)
mbrtowc_l(NULL, "", 1, ps)
mbrtoc16(NULL, "", 1, ps)
mbrtoc32(NULL, "", 1, ps)
.Ed
.Ss Locale Details
Not all locales in the system are Unicode based locales.
For example, ISO 8859 family locales have code points with values that
do not match their counterparts in Unicode.
When using these functions with non-Unicode based locales, the code
points returned will be those determined by the locale.
They will not be converted to the corresponding Unicode code point.
For example, if using the Euro sign in ISO 8859-15, these functions
might return the code point 0xa4 and not the Unicode value 0x20ac.
.Pp
Regardless of the locale, the characters returned will be encoded as
though the code point were the corresponding value in Unicode.
This means that if a locale returns a value that would be a surrogate
pair in the UTF-16 encoding, it will still be encoded as a UTF-16
character.
.Pp
This behavior of the
.Fn mbrtoc16
and
.Fn mbrtoc32
functions should not be relied upon, is not portable, and subject to
change for non-Unicode locales.
.Sh RETURN VALUES
The
.Fn mbrtoc16 ,
.Fn mbrtoc32 ,
.Fn mbrtowc ,
and
.Fn mbrtowc_l
functions return the following values:
.Bl -tag -width (size_t)-3
.It Sy 0
.Fa len
or fewer bytes of
.Fa str
were consumed and the null wide character was written into the wide
character buffer
.Po
.Fa pwc ,
.Fa p16c ,
.Fa p32c
.Pc .
.It Sy between 1 and len
The specified number of bytes were consumed and a single character was
written into the wide character buffer
.Po
.Fa pwc ,
.Fa p16c ,
.Fa p32c
.Pc .
.It Sy (size_t)-1
An encoding error has occurred.
The next
.Fa len
bytes of
.Fa str
do not contribute to a valid character.
.Va errno
has been set to
.Er EILSEQ .
No data was written into the wide character buffer
.Po
.Fa pwc ,
.Fa p16c ,
.Fa p32c
.Pc .
.It Sy (size_t)-2
.Fa len
bytes of
.Fa str
were consumed, but a complete multi-byte character sequence has not been
found and no data was written into the wide character buffer
.Po
.Fa pwc ,
.Fa p16c ,
.Fa p32c
.Pc .
.It Sy (size_t)-3
A character has been written into the wide character buffer
.Po
.Fa pwc ,
.Fa p16c ,
.Fa p32c
.Pc .
This character was from a previous call (such as another part of a
UTF-16 surrogate pair) and no input was consumed.
This is limited to the
.Fn mbrtoc16
and
.Fn mbrtoc32
functions.
.El
.Sh EXAMPLES
.Sy Example 1
Using the
.Fn mbrtoc32
function to convert a multibyte string.
.Bd -literal
#include <locale.h>
#include <stdlib.h>
#include <string.h>
#include <err.h>
#include <stdio.h>
#include <uchar.h>
int
main(void)
{
mbstate_t mbs;
char32_t out;
size_t ret;
const char *uchar_str = "\exe5\ex85\ex89";
(void) memset(&mbs, 0, sizeof (mbs));
(void) setlocale(LC_CTYPE, "en_US.UTF-8");
ret = mbrtoc32(&out, uchar_str, strlen(uchar_str), &mbs);
if (ret != strlen(uchar_str)) {
errx(EXIT_FAILURE, "failed to convert string, got %zd",
ret);
}
(void) printf("Converted %zu bytes into UTF-32 character "
"0x%x\n", ret, out);
return (0);
}
.Ed
.Pp
When compiled and run, this produces:
.Bd -literal -offset indent
$ ./a.out
Converted 3 bytes into UTF-32 character 0x5149
.Ed
.Pp
.Sy Example 2
Handling surrogate pairs from the
.Fn mbrtoc16
function.
.Bd -literal
#include <locale.h>
#include <stdlib.h>
#include <string.h>
#include <err.h>
#include <stdio.h>
#include <uchar.h>
int
main(void)
{
mbstate_t mbs;
char16_t first, second;
size_t ret;
const char *uchar_str = "\exf0\ex9f\ex92\exa9";
(void) memset(&mbs, 0, sizeof (mbs));
(void) setlocale(LC_CTYPE, "en_US.UTF-8");
ret = mbrtoc16(&first, uchar_str, strlen(uchar_str), &mbs);
if (ret != strlen(uchar_str)) {
errx(EXIT_FAILURE, "failed to convert string, got %zd",
ret);
}
ret = mbrtoc16(&second, "", 0, &mbs);
if (ret != (size_t)-3) {
errx(EXIT_FAILURE, "didn't get second surrogate pair, "
"got %zd", ret);
}
(void) printf("UTF-16 surrogates: 0x%x 0x%x\n", first, second);
return (0);
}
.Ed
.Pp
When compiled and run, this produces:
.Bd -literal -offset indent
$ ./a.out
UTF-16 surrogates: 0xd83d 0xdca9
.Ed
.Sh ERRORS
The
.Fn mbrtoc16 ,
.Fn mbrtoc32 ,
.Fn mbrtowc ,
and
.Fn mbrtowc_l
functions will fail if:
.Bl -tag -width Er
.It Er EINVAL
The conversion state in
.Fa ps
is invalid.
.It Er EILSEQ
An invalid character sequence has been detected.
.El
.Sh MT-LEVEL
The
.Fn mbrtoc16 ,
.Fn mbrtoc32 ,
.Fn mbrtowc ,
and
.Fn mbrtowc_l
functions are
.Sy MT-Safe
as long as different
.Vt mbstate_t
structures are passed in
.Fa ps .
If
.Fa ps
is
.Dv NULL
or different threads use the same value for
.Fa ps ,
then the functions are
.Sy Unsafe .
.Sh INTERFACE STABILITY
.Sy Committed
.Sh SEE ALSO
.Xr c16rtomb 3C ,
.Xr c32rtomb 3C ,
.Xr newlocale 3C ,
.Xr setlocale 3C ,
.Xr uselocale 3C ,
.Xr wcrtomb 3C ,
.Xr uchar.h 3HEAD ,
.Xr environ 5
|