Fastest method to filter a huge file

Question

I have a syslog file with 4GB size every minute, and below is 2 lines of 4.5 million line in each minute,

and i want to generate a new file with only few columns eventtime|srcip|dstip, so the result will be as following

1548531299|X.X.X.X|X.X.X.X

please note that the position of the fields is random.
I've tried some filters but still consuming more than 40 minutes to handle one file on a powerful VM machine with 4 Cores and 16GB ram.

So is there a method to handle such big file and filter the required column in a fast method ?

{Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="ERB-03" devid="5KDTB18800169" logid="0000000011" type="traffic" subtype="forward" level="warning" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=3XXXX srcintf="GGI-cer.405" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="hh-BB.100" dstintfrole="undefined" sessionid=xxxxxxx proto=6 action="ip" policyid=5 policytype="policy" service="HTTP" appcat="unscanned" crscore=5 craction=xxxxxx crlevel="low"

Jan 26 22:35:00 172.20.23.148 date=2019-01-26 time=22:34:59 devname="XXX-XX-FGT-03" devid="XX-XXXXXXXX" logid="0000000013" type="traffic" subtype="forward" level="notice" vd="Users" eventtime=1548531299 srcip=X.X.X.X srcport=XXXXX srcintf="XXX-Core.123" srcintfrole="undefined" dstip=X.X.X.X dstport=XX dstintf="sXX-CC.100" dstintfrole="undefined" sessionid=1234567 cvdpkt=0 appcat="unscanned" crscore=5 craction=123456 crlevel="low"}

score 2 · Accepted Answer · answered Jan 28 '19 at 19:08

Perl to the rescue

Save the following script as filter.pl and make it executable (chmod +x):

#!/usr/bin/env perl

use strict;
use warnings;

while( <> ) {
    if ( /^(?=.*eventtime=(\S+))(?=.*srcip=(\S+))(?=.*dstip=(\S+)).*$/ ) {
        print "$1|$2|$3\n";
    }
}

Then run

pduck@ubuntu:~> time ./filter.pl < input.txt > output.txt

real    0m44,984s
user    0m43,965s
sys     0m0,973s

The regex uses a lookaround pattern, in this case a positive lookahead, to match the three values eventtime, srcip, and dstip in any order.

I duplicated your two input lines until I got a file with 4 GB and approximately 9 million lines. I ran the code on an SSD.

score 2 · Answer 2 · 2019-01-30T22:11:14.683

If you want a really fast solution I suggest flex tool. Flex generates C. The following is capable of processing examples like the one presented accepting free order fields. So create a file, named f.fl with the following content:

%option main
%%
  char e[100], s[100], d[100];

eventtime=[^ \n]*   { strcpy(e,yytext+10); }
srcip=[^ \n]*       { strcpy(s,yytext+6);  }
dstip=[^ \n]*       { strcpy(d,yytext+6);  }
\n                  { if (e[0] && s[0] && d[0] )printf("%s|%s|%s\n",e,s,d); 
                      e[0]=s[0]=d[0]=0 ;}
.                   {}
%%

To test try:

$ flex -f -o f.c f.fl 
$ cc -O2 -o f f.c
$ ./f < input > output

Here is the time comparison:

$ time ./f < 13.5-milion-lines-3.9G-in-file > out-file
real    0m35.689s
user    0m34.705s
sys     0m0.908s

sudodus · Answer 3 · 2019-01-31T07:21:52.780

I duplicated your two 'input' lines to a file size 3867148288 bytes (3.7 GiB) and I could process it with grep in 8 minutes and 24 seconds (reading from and writing to a HDD. It should be faster using an SSD or ramdrive).

In order to minimize the time used, I used only standard features of grep, and did not post-process it, so the output format is not what you specify, but might be useful anyway. You can test this command

time grep -oE -e 'eventtime=[0-9]* ' \
 -e 'srcip=[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]' \
 -e 'dstip=[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]\.[[:alnum:]]' \
infile > outfile

Output from your two lines:

$ cat outfile
eventtime=1548531298 
srcip=X.X.X.Y
dstip=X.X.X.X
eventtime=1548531299 
srcip=X.X.X.Z
dstip=X.X.X.Y

The output file contains 25165824 lines corresponding to 8388608 (8.3 million) lines in the input file.

$ wc -l outfile
25165824 outfile
$ <<< '25165824/3' bc
8388608

My test indicates that grep can process approximately 1 million lines per minute.

Unless your computer is much faster than mine. this is not fast enough, and I think you have to consider something that is several times faster, probably filtering before writing the log file, but it would be best to completely avoid output of what is not necessary (and avoid filtering).

The input file is made by duplication, and maybe the system 'remembers' that it has seen the same lines before and makes things faster, so I don't know how fast it will work with a real big file with all the unpredicted variations. You have to test it.

Edit1: I ran the same task in a Dell M4800 with an Intel 4th generation i7 processor and an SSD. It finished in 4 minutes and 36 seconds, at almost double speed, 1.82 million lines per minute.

$ <<< 'scale=2;25165824/3/(4*60+36)*60/10^6' bc
1.82

Still too slow.

Edit2: I simplified the grep patterns and ran it again in the Dell.

time grep -oE -e 'eventtime=[^\ ]*' \
 -e 'srcip=[^\ ]*' \
 -e 'dstip=[^\ ]*' \
infile > out

It finished after 4 minutes and 11 seconds, a small improvement to 2.00 million lines per minute

$ <<< 'scale=2;25165824/3/(4*60+11)*60/10^6' bc
2.00

Edit 3: @JJoao's, perl extension speeds up grep to 39 seconds corresponding to 12.90 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD).

$ time grep -oP '\b(eventtime|srcip|dstip)=\K\S+' infile >out-grep-JJoao

real    0m38,699s
user    0m31,466s
sys     0m2,843s

This perl extension is experiental according to info grep but works in my Lubuntu 18.04.1 LTS.

‘-P’ ‘--perl-regexp’ Interpret the pattern as a Perl-compatible regular expression (PCRE). This is experimental, particularly when combined with the ‘-z’ (‘--null-data’) option, and ‘grep -P’ may warn of unimplemented features. *Note Other Options::.

I also compiled a C program according to @JJoao's flex method, and it finshed in 53 seconds corresponding to 9.49 million lines per minute in the computer, where the ordinary grep reads 1 million lines per minute (reading from and writing to an HDD). Both methods are fast, but grep with the perl extension is fastest.

$ time ./filt.jjouo < infile > out-flex-JJoao

real    0m53,440s
user    0m48,789s
sys 0m3,104s

Edit 3.1: In the Dell M4800 with an SSD I had the following results,

time ./filt.jjouo < infile > out-flex-JJoao

real    0m25,611s
user    0m24,794s
sys 0m0,592s

time grep -oP '\b(eventtime|srcip|dstip)=\K\S+' infile >out-grep-JJoao

real    0m18,375s
user    0m17,654s
sys 0m0,500s

This corresponds to

19.66 million lines per minute for the flex application
27.35 million lines per minute for grep with the perl extension

Edit 3.2: In the Dell M4800 with an SSD I had the following results, when I used the option -f to the flex preprocessor,

flex -f -o filt.c filt.flex
cc -O2 -o filt.jjoao filt.c

The result was improved, and now the flex application shows the highest speed

flex -f ...

$ time ./filt.jjoao < infile > out-flex-JJoao

real    0m15,952s
user    0m15,318s
sys 0m0,628s

This corresponds to

31.55 million lines per minute for the flex application.

pa4080 · Answer 4 · 2019-01-28T09:41:04.643

Here is one possible solution, based on this answer, provided by @PerlDuck a while ago:

#!/bin/bash
while IFS= read -r LINE
do
    if [[ ! -z ${LINE} ]]
    then
        eval $(echo "$LINE" | sed -e 's/\({\|}\)//g' -e 's/ /\n/g' | sed -ne '/=/p')
        echo "$eventtime|$srcip|$dstip"
    fi
done < "$1"

I do not know how it will behave on such large file, IMO an awk solution will be much faster. Here is how it works with the provided input file example:

$ ./script.sh in-file
1548531299|X.X.X.X|X.X.X.X
1548531299|X.X.X.X|X.X.X.X

Here is the result of a productivity time test, performed on a regular i7, equipped with SSD and 16GB RAM:

$ time ./script.sh 160000-lines-in-file > out-file

real    4m49.620s
user    6m15.875s
sys     1m50.254s

Fastest method to filter a huge file

4 Answers4

Perl to the rescue