Searching Txt Files In Python

November 17, 2024 Post a Comment

I am a new programmer and we are working on a Graduate English project where we are trying to parse a gigantic dictionary text file (500 MB). The file is set up with html-like tags

Solution 1:

After opening the file, iterate through the lines like this:

input_file = open('huge_file.txt', 'r')
for input_line in input_file:
   # process the line however you need - consider learning some basic regular expressions

This will allow you to easily process the file by reading it in line by line as needed rather than loading it all into memory at once

Solution 2:

I don't know regular expressions well, but you can solve this problem without them, using the string method find() and line slicing.

answer = ''

with open('yourFile.txt','r') as open_file, open('output_file','w') as output_file:
    for each_line in open_file:
        if each_line.find('[A>]'):
            start_position = each_line.find('[A>]')
            start_position = start_position + 3
            end_position = each_line[start_position:].find('[/W]')

            answer = each_line[start_position:end_position] + '\n'
            output_file.write(answer)

Let me explain what is happening:

Create an empty 'list' using = []. This will hold your answers.
Use the with... statement. This allows you to open your file as an alias (I chose open_file). This ensures automatic closing of your file whether or not your program runs correctly.
We use the 'for line in file:' idiom to tackle the file one line at a time. The 'line' variable can be named anything (i.e. for x in file, for pizza in file) and will always contain each line as a string. When it gets to the end of the file, it automatically stops.
the 'if each_line.find('[A>]'):' statement simply tests if the starting tag is in that line. If it is not, none of the indented code that follows will run, and the loop will re-start, moving to the next line.
We use string slicing, where we can cut out the part of the string we want. What we do is search for the first tag by position (which we already know is in this line), then search for the stop tag by position. Once we have those, we can simply cut out the part we want.
I buffed up the position in two ways. 1 I added 3 to the start position so it would skip over the [A>] - thus instead of giving '[A>] THIS IS MY STRING...' it just gives 'THIS IS MY STRING...' I then searched for the end position by looking for its first occurence AFTER the [A>] tag, inc ase the [/W] tag occurrs more than once each line.
We set the answer to the string slice, and a new line character ('\n') so each string appears on its own line. We use the output method .write('stringToWrite') to write each string, one at a time.

Solution 3:

You're getting a memory error with readlines() because given the filesize you're likely reading in more data than your memory can reasonably handle. Since this file is an XML file, you should be able to read through it iterparse(), which will parse the XML lazily without taking up excess memory. Here's some code I used to parse Wikipedia dumps:

for event, elem in parser:
    if event == 'start'and root == None:
        root = elem
    elif event == 'end'and elem.tag == namespace + 'title':
        page_title = elem.text
        #This clears bits of the tree we no longer use.
        elem.clear()
    elif event == 'end'and elem.tag == namespace + 'text':
        page_text = elem.text
        #Clear bits of the tree we no longer use
        elem.clear()

        #Now lets grab all of the outgoing links and store them in a list
        key_vals = []


        #Eliminate duplicate outgoing links.
        key_vals = set(key_vals)
        key_vals = list(key_vals)

        count += 1if count % 1000 == 0:
            printstr(count) + ' records processed.'elif event == 'end'and elem.tag == namespace + 'page':
        root.clear()

Here's roughly how it works:

We create parser to progress through the document.
As we loop through each element of the document, we look for elements with the tag you are looking for (in your example it was 'A').
We store that data and process it. Any element we are done processing we clear, because as we go through the document it remains in memory, so we want to remove anything we no longer need.

Solution 4:

You should look into a tool called "Grep". You can give it a pattern to match and a file, and it will print out occurences in the file and line numbers, if you want. Very useful and probably can be interfaced with Python.

Solution 5:

Instead of parsing the file by hand why not parse it as XML to have better control of the data? You mentioned that the data is HTML-like so I assume it is parseable as an XML document.

Python Playground