Search This Blog

Sunday, January 5, 2014

How to divide and split file a part based on regex

We have a following file.
 
# cat text-to-split.txt
       1  aaa
       2  b1
       3  bb2
       4  bbb3
       5  c
       6  c1
       7  cc2
       8  aaa
       9  b1
      10  bb2
      11  c
      12  c1
      13  cc2
      14  aaa
      15  b1
      16  c
      17  cc2
      18  aaa
      19  b1
      20  aaa
      21  c1
      22  ccc

Problem

How to split and divide file based on its content?

Analisis and results description

Example 1 : Single split line in whole file

The file will be divided on each line matching a single patter.
 
# csplit -k text-to-split.txt '%aaa%' '/aaa/' '{*}'
74
62
41
22
35
root@perf1:~/split# for i in xx0*; do echo $i; cat -n $i; done
xx00
     1       1  aaa
     2       2  b1
     3       3  bb2
     4       4  bbb3
     5       5  c
     6       6  c1
     7       7  cc2
xx01
     1       8  aaa
     2       9  b1
     3      10  bb2
     4      11  c
     5      12  c1
     6      13  cc2
xx02
     1      14  aaa
     2      15  b1
     3      16  c
     4      17  cc2
xx03
     1      18  aaa
     2      19  b1
xx04
     1      20  aaa
     2      21  c1
     3      22  ccc

Example 2 : multiple split line

The csplit takes a variable number of regular expressions.
It scans the file and once a line matches the regex it splits the file at this point.
It evaluates then the next regular expression and continue to scan remaining file data.
When a match is found the file is split at this point again.
The last regex is used to split the remaining file until we read EOF.

In this example we:
  • Jump to line containing b1 (don't copy and save the data - %)
  • Continue searching for aaa and split when found.
  • Continue searching for c1 and split when found.
  • Use the last regex (c1) if file still have data.
csplit -k text-to-split.txt '%b1%' '/aaa/' '/c1/' '{*}'
63
41
96
23
root@perf1:~/split# for i in xx0*; do echo $i; cat -n $i; done
xx00
     1       2  b1
     2       3  bb2
     3       4  bbb3
     4       5  c
     5       6  c1
     6       7  cc2
xx01
     1       8  aaa
     2       9  b1
     3      10  bb2
     4      11  c
xx02
     1      12  c1
     2      13  cc2
     3      14  aaa
     4      15  b1
     5      16  c
     6      17  cc2
     7      18  aaa
     8      19  b1
     9      20  aaa
xx03
     1      21  c1
     2      22  ccc

References

http://rtomaszewski.blogspot.co.uk/2013/05/openssl-cheat-sheet.html

No comments:

Post a Comment