Python get lines between two strings

if one wants to keep the start and end lines/keywords while extracting the lines between 2 strings.

Please find below the code snippet that I used to extract sql statements from a shell script

def process_lines(in_filename, out_filename, start_kw, end_kw):
    try:
        inp = open(in_filename, 'r', encoding='utf-8', errors='ignore')
        out = open(out_filename, 'w+', encoding='utf-8', errors='ignore')
    except FileNotFoundError as err:
        print(f"File {in_filename} not found", err)
        raise
    except OSError as err:
        print(f"OS error occurred trying to open {in_filename}", err)
        raise
    except Exception as err:
        print(f"Unexpected error opening {in_filename} is",  repr(err))
        raise
    else:
        with inp, out:
            copy = False
            for line in inp:
                # first IF block to handle if the start and end on same line
                if line.lstrip().lower().startswith(start_kw) and line.rstrip().endswith(end_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    copy = False
                    continue
                elif line.lstrip().lower().startswith(start_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    continue
                elif line.rstrip().endswith(end_kw):
                    if copy:  # keep the ends with keyword
                        out.write(line)
                    copy = False
                    continue
                elif copy:
                    # write
                    out.write(line)


if __name__ == '__main__':
    infile = "/Users/testuser/Downloads/testdir/BTEQ_TEST.sh"
    outfile = f"{infile}.sql"
    statement_start_list = ['database', 'create', 'insert', 'delete', 'update', 'merge', 'delete']
    statement_end = ";"
    process_lines(infile, outfile, tuple(statement_start_list), statement_end)

    • file
    • extraction

    A common approach to this is using a state machine that reads the text until the marker is encountered, then starts a “recording mode”, and extracts the text until the marker is encountered. This process can repeat if multiple sections may appear in the file and have to be extracted.

    inRecordingMode = False
    for line in file:
        if not inRecordingMode:
            if line.startswith(''):
                inRecordingMode = True
        elif line.startswith(''):
            inRecordingMode = False
        else:
            yield line
    

    For simple cases, this could also be solved with a regular expression.

    • Extract Values between two strings in a text file using python
    • printing lines between start and end point
    • python - Read file from and to specific lines of text

    I have a log file which is output by a script, the log file is rotated daily. It will contain the strings

    Transfer started at timestamp 
    

    and

    Transfer completed successfully at timestamp
    

    repeatedly, as the mentioned transfer will take place hourly. The timestamps will have been previously created with date.

    • I want to capture the last instance of these two strings, and everything in between, into a separate file.
    • If the started string is found near the end of the log file, with no following completed string, I want to capture everything up to EOF and output an error message to say that the end string was not found.

    I'm guessing I'll need to use sed or awk but am really inexperienced with them. I want to use the command in a bash script, and understand what each part is doing, so some explanation would be very useful.

    An example chunk of log file:

    ERROR - Second tech sync failed with rsync error code 255 at Fri May 27 13:50:4$
    --------------------------------------------------------------------
    After_sync script completed successfully with no errors.
    Main script finished at Fri May 27 13:50:43 BST 2016 with PID of 18808.
    --------------------------------------------------------------------
    Transfer started at Fri May 27 13:50:45 BST 2016
    Logs transferred successfully.
    Images transferred successfully.
    Hashes transferred successfully.
    37 approvals pending.
    Transfer completed successfully at Fri May 27 14:05:16 BST 2016
    --------------------------------------------------------------------
    Local repository verification started at Fri May 27 14:35:02 BST 2016
    ...
    

    The desired output:

    Transfer started at Fri May 27 13:50:45 BST 2016
    Logs transferred successfully.
    Images transferred successfully.
    Hashes transferred successfully.
    37 approvals pending.
    Transfer completed successfully at Fri May 27 14:05:16 BST 2016
    

    However, if the log file was like this:

    ERROR - Second tech sync failed with rsync error code 255 at Fri May 27 13:50:4$
    --------------------------------------------------------------------
    After_sync script completed successfully with no errors.
    Main script finished at Fri May 27 13:50:43 BST 2016 with PID of 18808.
    --------------------------------------------------------------------
    Transfer started at Fri May 27 13:50:45 BST 2016
    Logs transferred successfully.
    Images transferred successfully.
    Hashes transferred successfully.
    

    I would want to output:

    Transfer started at Fri May 27 13:50:45 BST 2016
    Logs transferred successfully.
    Images transferred successfully.
    Hashes transferred successfully.
    ERROR: transfer not complete by end of log file