Page 1 of 1

[solved] Parsing strings with regular expressions - Match word with character #

Posted: Mon Dec 29, 2008 3:06 am
by nhaack
Hi there,

I am currently working on some enhancements for the twitter plug-in. I want to enable it to make urls, usernames and hashtags links using regular expressions. My solutions works fine for the URLs, however, concerning usernames and hashtags I somehow have the same problem. I am already half-way through.. I just can't put my finger on this one.

I wrote a little function for that:

Code: Select all


function twitter_message_enhancer($message){
	$twitter_message_enhancer = eregi_replace("([[:alnum:]]+)://([^[:space:]]*)([[:alnum:]#?/&=])","<a href=\"\\1://\\2\\3\" target=\"_blank\">\\1://\\2\\3</a>", $message); 
	$twitter_message_enhancer = eregi_replace('([@]([[:alnum:]]+))', '<a href="http://twitter.com/\2" target="blank">\1</a>', $twitter_message_enhancer); 
	$twitter_message_enhancer = eregi_replace('([#]([[:alnum:]]+))', '<a href="http://search.twitter.com/search?q=%23\2" target="_blank">\1</a>', $twitter_message_enhancer); 
 	return $twitter_message_enhancer;
}

It works already in most cases. But my problem arises from not knowing how to match only if the checked string element begins with an "#" or "@".

Code: Select all


([#]([[:alnum:]]+)) 

just matches any string-part beginning with the character "#" (e.g. "word1 word2 #word3" replaces #word3 but in "wor#d1 word2 #word3" #d1 gets replaced as well. I though of using a space in front as a matching criterion, but then... what if a message begins with a "#" and there is no leading white space?

I tried using ^ to indicate that the string should start with "#" as follows:

Code: Select all


(^[#]([[:alnum:]]+)) or 
([^#]([[:alnum:]]+)) 

but it doesn't work. I think that this is the wrong approach as the string would most likely, as sort of a sub-element always begin with a "#". Is there a way to indicate the beginning of a word in the parsing process?

I am clueless about this at the moment  ???

As said, it works fine for most cases, but in rare situations, twitter users post urls with the "#" inside, in this case, the replace rules would first match the URL, the second rule would replace the stuff after "#" with another url. Thus, you have two interwoven links which obviously doesn't work. (the same applies for @username text-elements and posted e-mail addresses).

As a Plan B, I could go for an explode at spaces for these replacements... walk through the array of words and replace if matched - afterwards implode again with space as glue, but it is not that elegant ;)

Best
Nils

Re: Parsing strings with regular expressions - Match word beginning with character #

Posted: Mon Dec 29, 2008 11:01 am
by JeremyBASS
just a thought based off "wor#d1 word2 #word3"

Code: Select all


/^[#][:alnum:]+$/

so

Code: Select all


(^[#]([[:alnum:]]+$))


??

I didn't get to test it..

cheers

Re: Parsing strings with regular expressions - Match word beginning with character #

Posted: Mon Dec 29, 2008 4:41 pm
by nhaack
Hi Jeremy,

thanks for the hint. i tried it with the $ at the end. But then it seemed to be too strict. But it pointed me to the right direction ;)

At the end, I'll do two eregi_replace for the hash tags and "@" in front of usernames.

First

Code: Select all


( [#]([[:alnum:]]+))

to match if there is a match right at the beginning of the string.

Second

Code: Select all


(^[#]([[:alnum:]]+)) 

to match hash tags or user names within the text that begin as a "new" word (when they have a space in front).

I found a great site for evaluating regular expressions (it is in German though): http://regexp-evaluator.de/

Thanks and best
Nils