Avoid querying the position in the text stream using Qt's pos() function
to update the progress dialog. Instead keep track of the stream position
manually. This is possible here because we don't ever seek in the file.
In result, this speeds up the CSV import dramatically.
This commit bundles a number of smaller optimisations in the CSV parser
and import code. They do add up to a noticible speed gain though (at
least on some systems and configurations).
We were separating the CSV import into two steps: parsing the CSV file
and inserting the parsed data. This had the advantages that it keeps the
parsing code and the database code nicely separated and that we have
full knowledge of the CSV file when we start inserting the data into the
database. However, this made it necessary to keep the entire parser
results in RAM. For large CSV files this uses enormous amounts of
memory.
This commit changes the import to parse the first 20 lines and analyse
them. This should give us a good impression of what to expect from the
rest of the file. Based on that information we then parse the file row
by row and insert each row into the database as soon as it is parsed.
This means we only have to keep one row at a time in memory while more
or less keeping the possibility to analyse the file before inserting
data.
On my system this does seem to change the runtime for small files which
take a little longer now (<5%), though these measurements aren't
conclusive. For large files it, however, it changes memory consumption
from using all memory and starting to swap within seconds to almost no
memory consumption at all. And not having to swap speeds things up a
lot.
When parsing a CSV file we used to check the column count for each row
and track the highest number of columns that we found. This information
then could be used to create an INSERT statement large enough for all
the data.
This column number tracking code is removed by this commit. Instead it
analyses the first 20 rows only. It does that while generating the field
list.
Performance-wise this should take a (very) little longer but makes it
easier to improve the performance in other ways later which should more
than compensate this commit.
Feature-wise this should fix some (technically invalid) corner-case CSV
files with fewer fields in the title row than in the other rows. It
should also break some other (technically invalid) corner-case CSV files
if they are imported into an existing table and have less columns than
the existing table in their first 20 rows but later on the exact same
number. Both cases, I think, don't matter too much.
We're reading CSV files not all at once but in chunks. And when we're
encountering a \r char we're checking if it is followed by a \n char. So
far so good. But now it might happen that we're hitting a \r char that's
right at the end of the current buffer. In this case the lookahead check
isn't working as expected because there isn't more data available yet.
This commit fixes the issue by checking for these conditions and loading
an extra byte when needed.
See issue #1033.