First a little background information. I wanted to import an 8 Gig Wikipedia table dump for a project that I was working on. The table was too big to import all of the records on my Linode server. I only needed certain rows and columns, so I tried parsing the dump file in PHP. This worked, but it was much too slow. So then I went looking for text processor programs. This turned up awk, grep, lex and sed. awk, grep and sed are all line based processors. These would not work because the dump file has multiple database records per line. That left lex and its successor flex.
Flex is a lexical analyzer that looks for patterns and outputs whatever you want when a pattern is matched. The output code is written in C and Flex generates a C program that you compile into an executable. I wanted to convert the dump file to a tab-separated value (TSV) file because that is easy to import in PHP. I also wanted to only read certain rows and columns, so I piped the output into awk to strip rows and columns.
See the attached "Mysl Flex" file below for the flex configuration.
Some things to note about the configuration:
- The standard C fileio routines only support a maximum file size of 2 Gig. The line #define _FILE_OFFSET_BITS 64 enables large file support.
- Three types of output are supported: 1) A filename based on the table name. 2) A specified filename. 3) stdout
- The {TABLESTART} pattern section sets the state to INTABLE and preps the output file if necessary.
- The <INTABLE> state section controls the table row processing.
- The<INROW> state section controls the row field processing.
- The <INSTRING> state section controls the string value ' escaping.
- The . and \n patterns cause the non-record insert data in the file to be ignored.
The command to generate the C code is: flex -Cf -8 mysql.flex
- The -Cf optimizes for speed.
- The -8 enables 8-bit support. This is needed for Unicode and hi-ASCII characters.
The command to compile the C code is: gcc lex.yy.c -lfl -o mysqlparse
Usage: ./mysqlparse <inputfile> <outputfile>
- <inputfile> - Mysqldump file to convert to tsv formatted rows
- <outputfile> - (Optional) tsv output file. If not specified, output files are written to the same directory as <inputfile> and are named using the mysql table name + .tsv; A dash (-) means write to stdout
See the attached "Template Links Awk" file below for the awk configuration.
Some things to note about the configuration:
- The BEGIN { FS = "\t"; OFS = "\t" } line sets the input and output field delimiters to a tab character.
- The {if ($2 == 10) print $1, $3} line outputs the first and third fields when the second field equals 10.
Example command chain: ./mysqlparse dump.sql - | awk -f templatelinks.awk - >output.tsv