Message body encodings

Background

EMG typically tries its best to forward messages as transparently as possible. However, just as phone numbers sometimes need to be modified to use the format expected by the downstream server, so does the message body contents. Originally this was done using mapping files, but with an increasing proportion of messages requiring the Unicode character set, or simply being encoded using UTF-8, this strategy eventually became too complex to be usable. As of version 7.2.25, EMG therefore uses a more future proof and flexible strategy.

Main ideas

Higher priority is now given to making sure that the message is shown correctly on the recipients’ handsets, and less on trying to forward the same byte values. When a message is received, EMG now converts the message body to the Unicode character set, encoded using UTF-8. When the message is being sent, EMG converts the message to the encoding that would lead to the fewest number of bytes, among the encoding supported by the protocol used on that connector. Binary messages are of course sent as is.

The encoding used by the sender is stored along with the message and logged in the connector log file as option 226. This way the connector option DISABLE_UCS2_DOWNGRADE can still be supported, avoiding messages to be sent as GSM-7 or Latin-1 if they were received as UCS-2 or UTF-8. Other messages are not affected. Normally this option should not be needed.

Some characters in GSM-7 cannot be encoded using Latin-1, and vice versa. So, for the SMS protocols which only support one of these encodings (OIS, UCP, GSM), messages with such characters need to be encoded using UCS-2 in order to avoid data loss. As this could lead to a higher number of pdus and a higher cost, this feature must be enabled using the new connector option ENABLE_UCS2_UPGRADE. Previously the conversion between GSM-7 and Latin-1 was done anyway, replacing the missing characters with a space character. The same data loss occurs when setting the global option DEFAULT_CHARCODE_TEXT to either GSM-7 or Latin-1, which is why this option is now ignored.

In order to support emojis, EMG actually uses UTF-16 and not UCS-2. These encodings are identical except for the values between 0xd800 and 0xdfff which are used to encode characters between 0x10000 and 0x10FFFF. We still use the name “UCS-2” in this text, though.

Supported encodings

The encodings EMG will select from when sending a message are shown below. The HEX encoding is used for binary messages. If the connector option DEFAULT_CHARCODE_TEXT is set to LATIN1, GSM-7 is disabled, and vice versa. This is another case when the connector option ENABLE_UCS2_UPGRADE is useful.

Protocol(s)Encodings
DLL, SMTPLatin-1, UTF-8
HTTP-JSONHEX, UTF-8
HTTP (all other variants)Latin-1, UTF-8
CIMD2, SMPPGSM-7, Latin-1, UCS-2
OIS, UCP, GSMGSM-7, UCS-2
All MMS protocolsLatin-1, UCS-2
EBEHEX, Latin-1, UTF-8, UCS-2

As before, the connector options DCS_FOR_GSM7=n and DCS_FOR_LATIN1=n can be used to tell EMG which value to use for the data_coding field.

Future ideas

The GSM specification 23.038 lists several alternative extension tables containing various national characters. These could be used instead of UCS-2 in order to possibly reduce the number of pdus. However, as we have not yet found an operator which actually supports them, EMG does not support them yet either.

Similarly, other character encodings can now be added relatively easy in future versions of EMG, at least the ones that stay in the range from 0 to 255.