English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
The character set determines how the bytes representing the text of your HTML document are translated into readable characters. It can be based on ISO010646 code points explain numeric or hexadecimal character references ("〹" or "ሴ 2.0 is consistent and independent of the selected character set.
To display HTML pages correctly, the browser must know which character set to use.
The character set used in the early days of the World Wide Web was ASCII. ASCII supports 0-9 numbers, uppercase and lowercase English alphabet, and some special characters.
Complete ASCII Reference Manual.
Since many countries use characters that do not belong to ASCII, the default character set of modern browsers is ISO-8859-1.
Complete ISO-8859-1 Reference Manual.
If the web page uses a character set different from ISO-8859-1 character sets, should be specified in the <meta> tag.
ISO character sets are international standards for different alphabets/Standard character sets defined by language.
The following lists different character sets used worldwide:
Character Set | Description | Scope of use |
---|---|---|
ISO-8859-1 | Latin alphabet part 1 | North America, Western Europe, Latin America, Caribbean, Canada, Africa |
ISO-8859-2 | Latin alphabet part 2 | East Europe |
ISO-8859-3 | Latin alphabet part 3 | SE Europe, Esperanto, other miscellaneous |
ISO-8859-4 | Latin alphabet part 4 | Scandinavian/Baltic (and other languages not included in ISO-8859-1 of which part) |
ISO-8859-5 | Latin/Cyrillic part 5 | languages using the ancient Slavic alphabet, such as Bulgarian, Belarusian, Russian, Macedonian |
ISO-8859-6 | Latin/Arabic part 6 | languages using the Arabic alphabet |
ISO-8859-7 | Latin/Greek part 7 | Modern Greek, as well as mathematical symbols derived from Greek |
ISO-8859-8 | Latin/Hebrew part 8 | languages using Hebrew |
ISO-8859-9 | Latin 5 part 9 | Turkish. In addition to the Turkish characters replacing the Icelandic script, the others are the same as ISO-8859-1 . |
ISO-8859-10 | Latin 6 | Laplandic, Germanic, Inuit North American languages |
ISO-8859-15 | Latin 9 (also known as Latin 0) | with ISO 8859-1 Similarly, the euro symbol and some other characters have replaced some less commonly used symbols |
ISO-2022-JP | Latin/Japanese part 1 | Japanese |
ISO-2022-JP-2 | Latin/Japanese part 2 | Japanese |
ISO-2022-KR | Latin/Korean part 1 | Korean |
Since all the character sets listed above have capacity limits and are not compatible with multilingual environments, the Unicode Consortium has developed the Unicode standard.
The Unicode standard covers all characters, punctuation, and symbols in the world.
Unicode can handle text data processing, storage, and exchange on any platform, program, or language.
The Unicode Consortium has developed the Unicode standard. Their goal is to replace the existing character sets with the standard Unicode Transformation Format (UTF).
The Unicode standard has been successful, in XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0 In WML, Unicode has been implemented. Unicode is also supported in many operating systems and all modern browsers.
The Unicode Consortium collaborates with leading standard development organizations, such as ISO, W3C and ECMA.
Unicode can be compatible with different character sets. The most commonly used encoding method is UTF-8 and UTF-16:
Character Set | Description |
---|---|
UTF-8 | UTF8 characters can be 1-4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 Backward compatible with ASCII. UTF-8 is the preferred encoding for web pages and emails. |
UTF-16 | 16 bit Unicode transformation format is a Unicode variable character encoding that can encode the entire Unicode instruction table. UTF-16 Mainly used in operating systems and environments such as Microsoft's Windows 2000/XP/2003/Vista/CE and Java and .NET bytecode environments. |
Tip: the leading 256 Unicode character set characters correspond to 256 ISO-8859-1 characters.
Tip: All HTML 4 processors have supported UTF-8All XHTML and XML processors support UTF-8 and UTF-16!