Extract All Domains From Text
Solution 1:
Remove the anchors, and make the groups not capture:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
The ^
and $
locked your expression to match whole strings only. re.findall()
also changes behaviour when the pattern contains capturing groups; you want to list the whole match here which requires there to be no such groups. (...)
is a capturing group, (?:...)
is a non-capturing group.
Demo:
>>> myregex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'>>> re.findall(myregex, mytext)
['foo.com', 'bar.net', 'foobar.net']
Solution 2:
The problem here is that your regex includes ^ at the beginning and $ at the end, meaning it only matches a domain that both starts and ends the string (ie just a domain).
For example, it will match "www.stackoverflow.com" but not "this is a question on www.stackoverflow.com" or "www.stackoverflow.com is great".
It should work fine if you just remove ^ and $ from the regex. Here's a small example
Solution 3:
The problem is the inclusion of ^
at the start and $
at the end of the regex. This makes it match only when the domain is the entire string. Here you want to see matches within the string. Try changing it like so
myregex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
EDIT
@Martijn pointed out that non-capturing groups needed to be used here to get the specified output.
Post a Comment for "Extract All Domains From Text"