Skip to content Skip to sidebar Skip to footer

Extract All Domains From Text

I need to extract domains from a string. I have a valid regex, that has been tested however I cannot get it to work with the following code. Probably something obvious that I'm mis

Solution 1:

Remove the anchors, and make the groups not capture:

r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'

The ^ and $ locked your expression to match whole strings only. re.findall() also changes behaviour when the pattern contains capturing groups; you want to list the whole match here which requires there to be no such groups. (...) is a capturing group, (?:...) is a non-capturing group.

Demo:

>>> myregex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'>>> re.findall(myregex, mytext)
['foo.com', 'bar.net', 'foobar.net']

Solution 2:

The problem here is that your regex includes ^ at the beginning and $ at the end, meaning it only matches a domain that both starts and ends the string (ie just a domain).

For example, it will match "www.stackoverflow.com" but not "this is a question on www.stackoverflow.com" or "www.stackoverflow.com is great".

It should work fine if you just remove ^ and $ from the regex. Here's a small example

Solution 3:

The problem is the inclusion of ^ at the start and $ at the end of the regex. This makes it match only when the domain is the entire string. Here you want to see matches within the string. Try changing it like so

myregex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'

EDIT

@Martijn pointed out that non-capturing groups needed to be used here to get the specified output.

Post a Comment for "Extract All Domains From Text"