String Replacement With Dictionary, Complications With Punctuation

December 24, 2023 Post a Comment

I'm trying to write a function process(s,d) to replace abbreviations in a string with their full meaning by using a dictionary. where s is the string input and d is the dictionary.

Solution 1:

Here is a way to do it with a single regex:

In [24]: d = {'ASAP':'as soon as possible', 'AFAIK': 'as far as I know'}

In [25]: s = 'I will do this ASAP, AFAIK.  Regards, X'

In [26]: re.sub(r'\b' + '|'.join(d.keys()) + r'\b', lambda m: d[m.group(0)], s)
Out[26]: 'I will do this as soon as possible, as far as I know.  Regards, X'

Unlike versions based on str.replace(), this observes word boundaries and therefore won't replace abbreviations that happen to appear in the middle of other words (e.g. "etc" in "fetch").

Also, unlike most (all?) other solutions presented thus far, it iterates over the input string just once, regardless of how many search terms there are in the dictionary.

Solution 2:

You can do something like this:

def process(s,d):
    forkeyin d:
        s = s.replace(key,d[key])
    return s

Solution 3:

Here is a working solution: use re.split(), and split by word boundaries (preserving the interstitial characters):

''.join( d.get( word, word ) for word in re.split( '(\W+)', s ) )

One significant difference that this code has from Vaughn's or Sheena's answer is that this code takes advantage of the O(1) lookup time of the dictionary, while their solutions look at every key in the dictionary. This means that when s is short and d is very large, their code will take significantly longer to run. Furthermore, parts of words will still be replaced in their solutions: if d = { "lol": "laugh out loud" } and s="lollipop" their solutions will incorrectly produce "laugh out loudlipop".

Solution 4:

use regular expressions:

re.sub(pattern,replacement,s)

In your application:

ret = s
for key in d:
    ret = re.sub(r'\b'+key+r'\b',d[key],ret)
return ret

\b matches the beginning or end of a word. Thanks Paul for the comment

Solution 5:

Instead of splitting by spaces, use:

split("\W")

It will split by anything that's not a character that would be part of a word.

Python Playground