Update validation function to support name mapping during primary key
validation, allowing users to specify primary keys using either
original CSV column names or mapped target column names when using
the --map option. Ensure validation is properly skipped when schema
files are provided since --pk and --schema parameters are mutually
exclusive.
Update test expectations in import-create-tables.bats to match the
existing detailed error message format that includes available columns,
rather than changing the error messages to match the old test
expectations. Move new validation tests from import-tables.bats to
import-create-tables.bats as requested in code review.
The two failing tests "try to table import with nonexistent --pk arg"
and "try to table import with one valid and one nonexistent --pk arg"
now pass because our early validation catches these errors before
InferSchema, producing more helpful error messages that include
available columns. Updated the test expectations to match our
validation's error format instead of the old generic "column not found"
message.
Refs: #1083
Add early validation to check if specified primary keys exist in the
import file's schema before processing rows. This prevents users from
waiting for large files to be processed only to discover that their
primary key column names are invalid.
Changes:
- Add validatePrimaryKeysAgainstSchema function to check primary key
existence against file schema
- Integrate validation into newImportDataReader for create operations
- Provide helpful error messages listing available columns when
primary keys are not found
- Add unit tests covering various validation scenarios
- Add BATS integration tests for CSV, PSV, and large file scenarios
The validation only runs for create operations when primary keys are
explicitly specified and no schema file is provided. This ensures
fast failure while maintaining backward compatibility.
Before: Users waited minutes for large files to process before seeing
"provided primary key not found" errors
After: Users get immediate feedback with helpful column suggestions
Refs: #1083
A push to a remote works by uploading the missing content and then adding
references to it in the remote datastore. If the remote is running a GC during
the push, it is possible for the newly added data to be collected and no longer
be available when the references are added.
This should cause a transient failure which is safe is retry. There were a
couple bugs which could instead cause a panic. This makes some changes to
safeguard against those case.