Characters Sets =============== **The good news**: in general Bots handles character sets well. Incoming files are read using the character-set specified in the ``incoming`` syntax, outgoing files are written using the character-set as specified in the ``outgoing`` syntax. You do not need to do anything in the mapping, Bots does the character-set conversion automatically. * Lots of information about this topic can be found on `wikipedia `_. * Information about the different character-sets in python is at `python site `_. * Outgoing character sets can be set in syntax; this can be done on envelope level, messagetype level of per partner. * Sometimes character-set conversion is not possible, eg uft-8->ascii, as uft8 can contain characters not in utf8. * For edifact there is an option to do character-set mapping (eg é->e); this is also possible for x12 (bit more work). * Note that x12 can be tricky: the separators used can not be in the data. There is no good workaround for this, best way is to change the separators used. .. rubric:: Specifics of different character-sets * Most used in edi is ascii, iso-8859-1, uft-8 (xml!). * Most familiar is ascii. Note that ascii has only 128 characters! One character is one byte. * The extended asccii character-sets. * One character is always one byte. * The upper 128 (above ascii) are used as special characters (eg éëè). * These different character-sets are about displaying (on screen, print etc). If the texts is in iso-8859-1, but displayed as eg IBM850 looks *wrong*. Note that the content is still the same, it's only the display that is different. * In general these extended ascii character-sets will not be problematic in translations, as segment ID's/tags are in ascii and one character is one byte. In Bots the content of the message will just be fetched and passed to the outgoing message. So if a iso-8859-3 is handled as iso-8859-1 that will generally not be problematic. * Examples: iso-8859-1, windows-1252, iso-8859-2, iso-8859-3, IBM850, UNOC, latin1. * Unicode. * Examples: utf-8, utf-16, utf-32, UCS-2 * Unicode is designed at accommodate much more characters, think of eg Greek, Japanese, Chinese and Klingon characters. * One character is not one byte. (the different Unicode characters-sets use different representation schemes. * Much used is utf-8. Advantage of utf-8 is that the first 128 ascii characters are one byte; this is way uft-8 is *upward compatible* with ascii: a file with only ascii characters is the same in ascii and uft-8. * In Microsoft environments often utf-16 and utf-32 is used. * Fixed files can not have utf-8 character-set, as one character might be more than one byte. * EBCDIC :related to (extended) ascii: one character is one byte. There are some variations of EBCDIC (eg extended EBCDIC). A nasty situation is eg when one partner sends Unicode (eg utf-8), and another sends extended ascii (eg iso-8859-1). These extended characters work quite differently: Bots has to know what character-set is used before reading them, as these character-sets are treated quite differently. Best way to solve this is to have receive these files via different route(parts). Bots does no *guessing of character-sets*, as this is not appropriate for business data. .. rubric:: X12 Incoming Character set as in envelope is used (if not set here default value of grammar.py is used (ascii)). Some typical issues and solutions: * If partner sends x12 as eg iso-8859-1, just specify this in the syntax of the envelope (``usersys/grammars/x12/x12.py``). Note that this is a system-wide setting, this will be used for all incoming 12. Should not be a problem, iso-8859-1 is a superset of ascii. The character-set of outgoing files should also be set to handle the extended characters. Also check your ERP software: what can it handle? * Same solution works for utf-8 .. rubric:: EDIFACT Incoming Edifact has its own character-sets: UNOA, UNOB, UNOC, etc. In default bots setup: * UNOA and UNOB have own character-set mapping ``in usersys/charsets``. * Other edifact character-sets are aliased in ``config/bots.ini``, section charsets. * Default UNOA and UNOB are not 100% strict but allow some extra characters like ````. * There are some variations of default UNOA/UNOB in `sourceforge downloads `_. Some typical issues and solutions: * Edifact files have UNOA in them, but in fact they send more (UNOC). Solution: use ``unoa_like_unoc.py`` from downloads, save in ``usersys/charsets`` and rename to ``unoa.py`` * Partner sends UNOC character-set, but my system can only handle ascii. Solution: use ``unoa_like_unoc.py`` from downloads; if you open this file you can see a character mapping (note that incoming is a different character mapping as outgoing). You can map eg incoming é->e (etcetc) * Our system uses iso-8859-1, but partner can handle only UNOB. Solution: use ``unoa_like_unoc.py`` from downloads; if you open this file you can see a character mapping (note that incoming is a different character mapping as outgoing). You can map eg outgoing é->e (etc etc)