I want to extract Urdu phrases out of a user-submitted string in PHP. For this, I tried the following test code:
$pattern = "#([\x{0600}-\x{06FF}]+\s*)+#u";
if (preg_match_all($pattern, $string, $matches, PREG_SET_ORDER)) {
print_r($matches);
} else {
echo 'No matches.';
}
Now if, for example, $string contains
In his books (some of which include دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), Ibn-e-Insha has told amusing stories of his travels.
I get the following output:
Array
(
[0] => Array
(
[0] => دنیا گول ہے
[1] => ہے
)
[1] => Array
(
[0] => آوارہ گرد کی ڈائری
[1] => ڈائری
)
[2] => Array
(
[0] => ابن بطوطہ کے تعاقب میں
[1] => میں
)
)
Even though I get my desired matches (دنیا گول ہے, آوارہ گرد کی ڈائری, and ابن بطوطہ کے تعاقب میں), I also get undesired ones (ہے, ڈائری, and میں -- each of which is actually the last word of its phrase). Can anyone please point out how I can avoid the undesired matches?