Skip to content Skip to sidebar Skip to footer

Python: How To Get String Between Matches?

I have FILE = open('file.txt', 'r') #long text file TEXT = FILE.read() #long identification code with dots (.) and slashes (-) regex = 'process \d\d\d\d\d\d\d\-\d\d\.\d\d\d\d\.\d+

Solution 1:

You can use this regex:

(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*?)(?=Process)|(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*)

Working demo

enter image description here)

Match information

MATCH 1
1.  [0-105] `Process 1234567-89.1234.12431242.12.1234 -  text title and long text description with no assured pattern `
MATCH 2
1.  [105-168]   `Process 2234567-89.1234.12431242.12.1234 : chars and more text `
MATCH 3
1.  [168-221]   `Process 3234567-89.1234.12431242.12.1234 - more text `
MATCH 4
2.  [221-267]   `Process 3234567-89.1234.12431242.12.1234 (...)`

You can use this code:

sample_input = "Process 1234567-89.1234.12431242.12.1234 -  text title and long text description with no assured pattern Process 2234567-89.1234.12431242.12.1234 : chars and more text Process 3234567-89.1234.12431242.12.1234 - more text process 3234567-89.1234.12431242.12.1234 (...)"
m = re.match(r"(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*?)(?=Process)|(Process \d{7}\-\d{2}\.\d{4}\.\d+\.\d{2}\.\d{4}.*)", sample_input)
m.group(1)       # The first parenthesized subgroup.
m.groups()       # Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern

Solution 2:

Suppose you have a string some_str = 'abcARelevant_SubstringAcba' and you want the string between the first A and the second A; i.e. the desired output is 'Relevant_Substring'.

You can find the indices of occurrences of A in some_str with the following line: inds = [a.start() for a in re.finditer('A', some_str)]

So now inds = [3, 22]. Now some_str[inds[0]+1:inds[1] will contain 'Relevant_Substring'.

This should be extensible to your issue.

EDIT: Here's a concrete example.

Suppose you have a file "file.txt" that contains the following text:

Stuff I don't want.
0
Stuff I do want.
1
More stuff I don't want.

You want to use all digits (0-9) as separators. Therefore, both 0 and 1 above will act as separators. Try the following code:

import re
with open("file.txt", "r") as file:
    data = file.read()
patt = re.compile('[0-9]')
inds = [a.start() for a in re.finditer(patt, data)]
print data[inds[0]+1:inds[1]]

This should print out Stuff I do want.

Solution 3:

You don't need re to find a string between two chars:

some_str = 'abcARelevant_SubstringAcba'print some_str.split("A",2)[1]
Relevant_Substring

Post a Comment for "Python: How To Get String Between Matches?"