Data Import Log Retriever Guide#

The Data Import Log Retriever (get-di-logs) is a diagnostic tool that retrieves error logs and associated MARC records from failed Data Import jobs by querying FOLIO’s PostgreSQL database directly.

Overview#

When MARC records fail to import through FOLIO’s Data Import system, the error information is stored in the database but can be difficult to access through the UI. This tool:

  • Queries the PostgreSQL database for failed records

  • Extracts the original MARC record that caused the error

  • Generates a TSV error report with error messages

  • Creates a MARC file containing all failed records for reprocessing

  • Supports SSH tunneling for secure database access

Note

This tool requires direct PostgreSQL database access, which is typically only available to system administrators or in hosted environments with proper credentials.

Prerequisites#

PostgreSQL Dependencies#

The get-di-logs command requires the PostgreSQL optional dependencies:

# Using pip
pip install 'folio_data_import[postgres]'

# Using uv
uv add 'folio_data_import[postgres]'

Required Access#

  1. FOLIO API credentials - Username/password with read access

  2. PostgreSQL database access - Direct connection to the FOLIO database

  3. SSH access (optional) - If the database is only accessible via a bastion/jump host

Job Execution IDs#

You need the Job Execution IDs from the Data Import jobs you want to analyze. These can be found:

  • In FOLIO’s Data Import logs UI

  • In the marc_import_job_ids.txt file generated by the marc command

  • Via the FOLIO API at /change-manager/jobExecutions

Basic Usage#

Command Line#

folio-data-import get-di-logs \
  --folio-url https://okapi.folio.example.com \
  --folio-tenant my_tenant \
  --folio-username admin \
  --db-config db_config.json \
  --job-ids-file marc_import_job_ids.txt

Using the Standalone Script#

folio-di-logs \
  --folio-url https://okapi.folio.example.com \
  --folio-tenant my_tenant \
  --folio-username admin \
  --db-config db_config.json \
  --job-ids-file marc_import_job_ids.txt

Configuration Files#

Database Configuration (db_config.json)#

The database configuration file specifies how to connect to the PostgreSQL database:

{
  "host": "folio-db.cluster-abc123.us-east-1.rds.amazonaws.com",
  "port": 5432,
  "database": "folio",
  "user": "folio_readonly",
  "password": "your_password_here"
}

Field

Required

Default

Description

host

Yes

-

PostgreSQL server hostname

port

No

5432

PostgreSQL port

database

Yes

-

Database name (usually folio)

user

Yes

-

Database username

password

No

-

Database password (can be omitted for passwordless auth)

SSH Tunnel Configuration (ssh_config.json)#

If the database is only accessible through an SSH bastion host, provide an SSH configuration:

{
  "ssh_tunnel": true,
  "ssh_host": "bastion.example.com",
  "ssh_user": "tunnel_user",
  "ssh_private_key_path": "~/.ssh/id_rsa",
  "use_ssh_config": true,
  "ssh_path": "/usr/bin/ssh"
}

Field

Required

Default

Description

ssh_tunnel

Yes

false

Enable SSH tunneling

ssh_host

If tunneling

-

SSH bastion/jump host

ssh_user

No

-

SSH username (uses SSH config if not set)

ssh_private_key_path

No

-

Path to SSH private key

use_ssh_config

No

false

Use ~/.ssh/config for connection settings

ssh_path

No

ssh

Path to SSH binary

AWS SSM Session Manager Example#

For AWS environments using SSM Session Manager:

{
  "ssh_tunnel": true,
  "ssh_host": "i-0abc123def456789",
  "use_ssh_config": true
}

With corresponding ~/.ssh/config:

Host i-*
    ProxyCommand sh -c "aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
    User ec2-user

Job IDs File#

A plain text file with one Job Execution ID per line:

a1b2c3d4-e5f6-7890-abcd-ef1234567890
b2c3d4e5-f6a7-8901-bcde-f12345678901
c3d4e5f6-a7b8-9012-cdef-123456789012

Lines starting with # or empty lines are ignored.

Command-Line Parameters#

Connection Parameters#

Parameter

Environment Variable

Description

--folio-url

FOLIO_URL

FOLIO Gateway/Okapi URL

--folio-tenant

FOLIO_TENANT

FOLIO tenant ID

--folio-username

FOLIO_USERNAME

FOLIO username

--folio-password

FOLIO_PASSWORD

FOLIO password

Database Parameters#

Parameter

Default

Description

--db-config

Required

Path to database configuration JSON file

--ssh-config

(none)

Path to SSH tunnel configuration JSON file

Input/Output Parameters#

Parameter

Default

Description

--job-ids-file

marc_import_job_ids.txt

Path to file containing Job Execution IDs

--report-file-path

di_error_report.tsv

Path for the error report TSV output

--marc-file-path

di_error_records.mrc

Path for the MARC records output

Other Options#

Parameter

Default

Description

--no-progress

false

Disable progress bar display

--debug

false

Enable debug logging

Output Files#

Error Report (TSV)#

A tab-separated file with error details:

'Error Log'	'MARC Record'
'"Error processing record: Invalid indicator value"'	'00732nam a2200241...'
'"Duplicate record detected"'	'00845cam a2200289...'

The report can be opened in Excel or any spreadsheet application for analysis.

MARC Records File#

A binary MARC21 file (.mrc) containing all the records that failed to import. This file can be:

  • Edited to fix issues using MarcEdit or similar tools

  • Reprocessed through Data Import after corrections

  • Analyzed with pymarc or other MARC processing tools

Workflow Examples#

Basic Error Retrieval#

# After running a MARC import that had errors
folio-data-import get-di-logs \
  --db-config db_config.json \
  --job-ids-file marc_import_job_ids.txt \
  --report-file-path errors_2024.tsv \
  --marc-file-path failed_records_2024.mrc

With SSH Tunnel#

folio-data-import get-di-logs \
  --db-config db_config.json \
  --ssh-config ssh_config.json \
  --job-ids-file job_ids.txt

Using Environment Variables#

export FOLIO_URL="https://okapi.folio.example.com"
export FOLIO_TENANT="my_tenant"
export FOLIO_USERNAME="admin"
export FOLIO_PASSWORD="secret"

folio-data-import get-di-logs \
  --db-config db_config.json \
  --job-ids-file job_ids.txt

Complete Migration Error Analysis Workflow#

# 1. Run the MARC import
folio-data-import marc \
  --marc-file-path migration_records.mrc \
  --import-profile-name "Default - Create Instance and SRS MARC Bib" \
  --job-ids-file-path migration_job_ids.txt

# 2. After import completes, retrieve any errors
folio-data-import get-di-logs \
  --db-config db_config.json \
  --job-ids-file migration_job_ids.txt \
  --report-file-path migration_errors.tsv \
  --marc-file-path migration_failed.mrc

# 3. Review errors, fix MARC records, and re-import
folio-data-import marc \
  --marc-file-path migration_failed_fixed.mrc \
  --import-profile-name "Default - Create Instance and SRS MARC Bib"

Programmatic Usage#

The DILogRetriever class can be used directly in Python scripts:

import folioclient
from folio_data_import.DILogRetriever import DILogRetriever
from folio_data_import._postgres import PostgresConfig, SSHTunnelConfig
from folio_data_import._progress import NoOpProgressReporter

# Create FOLIO client
folio_client = folioclient.FolioClient(
    gateway_url="https://okapi.folio.example.com",
    tenant_id="my_tenant",
    username="admin",
    password="secret",
)

# Database configuration
db_config = PostgresConfig(
    host="folio-db.example.com",
    port=5432,
    database="folio",
    user="folio",
    password="db_password",
)

# SSH tunnel (optional)
ssh_config = SSHTunnelConfig(ssh_tunnel=False)

# Create retriever
retriever = DILogRetriever(
    folio_client=folio_client,
    db_config=db_config,
    ssh_tunnel_config=ssh_config,
    progress_reporter=NoOpProgressReporter(),
)

# Retrieve errors
job_ids = ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"]
error_logs = retriever.retrieve_errors_with_marc(job_ids)

# Process results
for error_message, marc_record in error_logs:
    print(f"Error: {error_message}")
    print(f"Title: {marc_record.title}")

# Or generate report files
retriever.generate_error_report_and_marc_file(
    error_logs=error_logs,
    report_file_path="errors.tsv",
    marc_file_path="failed.mrc",
)

Troubleshooting#

“PostgreSQL support requires the ‘postgres’ optional dependencies”#

Install the PostgreSQL dependencies:

pip install 'folio_data_import[postgres]'

Connection Refused / Timeout#

  • Verify the database host and port are correct

  • Check firewall rules allow the connection

  • If using SSH tunneling, verify the SSH configuration

  • Ensure the database is accepting connections from your IP

“ssh_host is required when ssh_tunnel is enabled”#

Your ssh_config.json has "ssh_tunnel": true but is missing ssh_host. Add the bastion host:

{
  "ssh_tunnel": true,
  "ssh_host": "your-bastion-host.example.com"
}

No Records Found#

  • Verify the Job Execution IDs are correct

  • Check that the jobs actually had errors (not just warnings)

  • Ensure you’re connecting to the correct tenant’s database

Permission Denied#

The database user needs SELECT permissions on:

  • {tenant}_mod_source_record_manager.journal_records

  • {tenant}_mod_source_record_manager.incoming_records

Security Considerations#

Warning

This tool requires direct database access. Follow your organization’s security policies:

  • Store credentials securely (use environment variables or secrets management)

  • Use read-only database accounts when possible

  • Prefer SSH tunneling over direct database exposure

  • Rotate passwords regularly

  • Audit access to database credentials

See Also#