[rabbitmq-discuss] Ascii text for strings

Fri Jan 11 09:55:25 GMT 2008

Hi,

Berlin Brown wrote:
> According to the amqp protocol, are ascii strings supported or for
> most operations:

If your string contains only 7-bit ASCII characters, i.e. codepoints
less than 128, then a UTF-8 encoding of that string is the same byte
sequence as the raw ASCII encoding.

> For example, in the Connection.start_ok AMQP operation:
> You have to send the locale and the login properties (shortstr):

You can pretty much choose to send ASCII anytime UTF-8 is expected.

You will of course have to deal with potentially *receiving* byte
sequences encoding characters with codepoints greater than or equal to
128... but, depending on your application, this could be as simple as a
gentleman's agreement between all participants to avoid non-ASCII
codepoints...

> Must that data be sent as UTF-8 string?

It must. §4.2.5.3 of AMQP 0-8 reads

"
AMQP strings are variable length and represented by an integer length
followed by zero or more octets of data. AMQP defines two string types:

    Short strings, stored as an 8-bit unsigned integer length followed
    by zero or more octets of data. Short strings can carry up to 255
    octets of UTF-8 data, but may not contain binary zero octets.

    Long strings, stored as a 32-bit unsigned integer length followed by
    zero or more octets of data. Long strings can contain any data.
"

> The protocol spec seems to be
> light detailing unicode support.

You're right about this. Generally, whenever longstr is used *as a
string*, the consensus among the current implementors seems to be to
assume UTF-8. In other places, of course, it's used as a chunk of binary
data, with no assumed encoding at all.

>  Or does it matter, can the server
> detect the unicode type?

It *could*, if certain restrictions were placed on the content, but in
practice, this is not done.

One way of doing it (probably best applicable to longstr, leaving
shortstr unambiguously restricted to UTF-8) would be to first assume
Unicode (because otherwise we're completely doomed), and examine the
bytes for the presence of a UTF-32, UTF-16 or UTF-8 BOM. If present,
these give a clear indication of the coding and endianness. If absent,
one could either heuristically guess the encoding based on the presence
and position of NULs (not recommended) or simply assume UTF-8.

Tony
-- 
 [][][] Tony Garnock-Jones     | Mob: +44 (0)7905 974 211
   [][] LShift Ltd             | Tel: +44 (0)20 7729 7060
 []  [] http://www.lshift.net/ | Email: tonyg at lshift.net