Bytes and Bits are two units for storing logical information. A bit is can be thought as one hole, which can be filled with one of two values: 0 or 1.
A byte is a grouping of eight bits. In terms of math, a byte is capable of representing 256 different values (28).
Let’s think about a language, say English. It has some characters (a, b, c, … etc.) which are represented in a computer by bytes. The total number of characters in English is not more than 256, so every character can be represented by using a different 8-bit sequence.
Strings are simply a collection of characters. Normally in PHP string operations operate on strings of single-byte character. For example: you may want to compare the strings “Hello” and “Hi”. With strcmpr(), the two strings will be compared assuming each every character in the string takes one byte.
But think about a language which has more than 256 characters (for example Japanese), or when we want to represent characters from multiple languages at the same time. One byte storage for each character is not enough. This is where the multi-byte concept comes in.
A string of Japanese text may cause the strcmpr() function to return a wrong or garbage value since the assumption that one byte represents one character no longer holds true. When we work with multibyte-encoded strings, the manipulation of these strings needs special functions rather than the common single-byte string functions. To deal with multi byte strings in PHP, mbstring provides the multi byte specific string functions.
UTF stands for Unicode Transformation Format and is an encoding system that aims to represent every character in every language in one character set. There are different versions of UTF, some of which are shown below:
Encoding Format | Description |
UTF- 1 | Compatible with ISO-2022, obsolete from the Unicode Standard. |
UTF-7 | 7-bit encoding system, mainly was used in e-mail but not part of the Unicode standard. |
UTF-8 | 8-bit encoding system, variable-width, and is ASCII-compatible. |
UTF-EBCDIC | 8-bit encoding system, variable-width, and is EBCDIC-compatible. |
UTF-16 | 16-bit encoding system, variable width. |
UTF-32 | 32-bit encoding system, fixed-width. |
We find ourselves using UTF-8 most of the time when working with multibyte text, so let’s focus on that for a moment. UTF-8 encodes characters in multiple bytes using the following scheme:
So, how does it know whether it a character is stored in one byte or multiple bytes? For this it looks at the high-order bit of the first byte.
Code | Meaning |
0xxxxxxx | A Single byte code |
110xxxxx | One more byte follows this byte |
1110xxxx | Two more byte follows this byte |
11110xxx | Three more byte follows this byte |
111110xx | Four more byte follows this byte |
1111110x | Five more byte follows this byte |
10xxxxxx | Continuation of multi byte character |
Each continued byte in a multiple-byte sequence then starts with 1 and 0 in its two most high-order bits to provide a way to detect corrupt data.
For commonly used string functions, like strlen(), strops(), and substr(), there are multibyte equivalent functions. You should use the equivalent functions when working with multibyte strings.
Table 4: Single byte equivalent multi byte string functions
Single byte | Multibyte | Description |
strlen() | mb_strlen() | Get string length |
strpos() | mb_strpos() | Find position of first occurrence of string in a string |
substr() | mb_substr() | Return part of a string |
strtolower() | mb_strtolower() | Make a string lowercase |
strtoupper() | mb_strtoupper() | Make a string uppercase |
substr_count() | mb_substr_count() | Count the number of substring occurrences |
split() | mb_split() | Split string into array by regular expression |
mail() | mb_send_mail() | Send encoded mail |
ereg() | mb_ereg() | Regular expression match |
eregi() | mb_eregi() | Case insensitive regular expression match |
encoding (Character encoding)
Example Code: Here is an example code of how to use mb_strlen function. Here input string is a Chinese word and three different character encoding options are used.
$ str =”大大”; echo mb_strlen ( $ str , 'utf8' ). echo mb_strlen ( $ str , ‘gbk’ ). echo mb_strlen ( $ str , ' gb2312').
Constraints: UTF-8 has some constraints, like-
Enable mbstring from php.ini:
Runtime Configuration: To enable some mbstring functions, some more setting should be changed.
Table 5: Configurations in php.ini
Name | Default Value | Changable Option |
mbstring.language | neutral | PHP_INI_SYSTEM | PHP_INI_PERDIR |
mbstring.detect_order | NULL | PHP_INI_ALL |
mbstring.http_input | pass | PHP_INI_ALL |
mbstring.http_output | pass | PHP_INI_ALL |
mbstring.internal_encoding | NULL | PHP_INI_ALL |
mbstring.script_encoding | NULL | PHP_INI_ALL |
mbstring.substitute_character | NULL | PHP_INI_ALL |
mbstring.func_overload | 0 | PHP_INI_SYSTEM | PHP_INI_PERDIR |
mbstring.encoding_translation | 0 | PHP_INI_SYSTEM | PHP_INI_PERDIR |
Explanation of the configuration options:
The “Changeable option” determines the changeable mode value. It describes how and from where the mbstring options can be changed. Here goes the meaning for the mode values:
Table 6: Different change mode
Mode | Meaning |
PHP_INI_SYSTEM | We can set the entry using php.ini or httpd.conf |
PHP_INI_PERDIR | We can set the entry using php.ini, .htaccess, httpd.conf or .user.ini |
PHP_INI_ALL | We can set the entry from anywhere |
PHP_INI_USER | We can set the entry using user script. |
How to change from user script:
We can use the following code to set internal encoding of mbstring from user script:
<?php ini_set('mbstring.internal_encoding', 'UTF-8'); ?>
How to change from php.ini:
We can edit php.ini file to set some mbstring options.
; Set default language mbstring.language = Neutral; Set default language to Neutral(UTF-8) (default) mbstring.language = English; Set default language to English ; Enabled HTTP input encoding translation. mbstring.encoding_translation = On ; Set default HTTP input character encoding mbstring.http_input = pass ; No conversion. mbstring.http_input = auto ; Set HTTP input to auto
Using mbstring functions sometimes may cause some harassment to you. I will discuss here some problems of using multibyte function overload. Let us think a scenario.
You have enabled mbstring.func_overload option in your php.ini file. Your work is going fine. You are overloading single byte string function by multi byte string functions. But what will happen if you need an external library which frequently uses some string function?
There is a solution of this problem. You can use mbstring.internal_coding. When you call some external library, it will use single byte encoding and when back to your project, multibytes encoding will be implemented. But what happen if there is a callback between your project and external library? It fails here.
So, you have to keep in mind these issues while using mbstring options.
To develop any international web application, use of mbstring is a must. Otherwise your application will be limited to some certain nations and languages. As a developer, I suggest you to get some knowledge on this domain and make yourself efficient as a web programmer.