[rabbitmq-discuss] Ascii text for strings
Tony Garnock-Jones
tonyg at lshift.net
Fri Jan 11 09:55:25 GMT 2008
Hi,
Berlin Brown wrote:
> According to the amqp protocol, are ascii strings supported or for
> most operations:
If your string contains only 7-bit ASCII characters, i.e. codepoints
less than 128, then a UTF-8 encoding of that string is the same byte
sequence as the raw ASCII encoding.
> For example, in the Connection.start_ok AMQP operation:
> You have to send the locale and the login properties (shortstr):
You can pretty much choose to send ASCII anytime UTF-8 is expected.
You will of course have to deal with potentially *receiving* byte
sequences encoding characters with codepoints greater than or equal to
128... but, depending on your application, this could be as simple as a
gentleman's agreement between all participants to avoid non-ASCII
codepoints...
> Must that data be sent as UTF-8 string?
It must. §4.2.5.3 of AMQP 0-8 reads
"
AMQP strings are variable length and represented by an integer length
followed by zero or more octets of data. AMQP defines two string types:
Short strings, stored as an 8-bit unsigned integer length followed
by zero or more octets of data. Short strings can carry up to 255
octets of UTF-8 data, but may not contain binary zero octets.
Long strings, stored as a 32-bit unsigned integer length followed by
zero or more octets of data. Long strings can contain any data.
"
> The protocol spec seems to be
> light detailing unicode support.
You're right about this. Generally, whenever longstr is used *as a
string*, the consensus among the current implementors seems to be to
assume UTF-8. In other places, of course, it's used as a chunk of binary
data, with no assumed encoding at all.
> Or does it matter, can the server
> detect the unicode type?
It *could*, if certain restrictions were placed on the content, but in
practice, this is not done.
One way of doing it (probably best applicable to longstr, leaving
shortstr unambiguously restricted to UTF-8) would be to first assume
Unicode (because otherwise we're completely doomed), and examine the
bytes for the presence of a UTF-32, UTF-16 or UTF-8 BOM. If present,
these give a clear indication of the coding and endianness. If absent,
one could either heuristically guess the encoding based on the presence
and position of NULs (not recommended) or simply assume UTF-8.
Tony
--
[][][] Tony Garnock-Jones | Mob: +44 (0)7905 974 211
[][] LShift Ltd | Tel: +44 (0)20 7729 7060
[] [] http://www.lshift.net/ | Email: tonyg at lshift.net
More information about the rabbitmq-discuss
mailing list