# Data Import Log Retriever Guide

The Data Import Log Retriever (`get-di-logs`) is a diagnostic tool that retrieves error logs and associated MARC records from failed Data Import jobs by querying FOLIO's PostgreSQL database directly.

## Overview

When MARC records fail to import through FOLIO's Data Import system, the error information is stored in the database but can be difficult to access through the UI. This tool:

- Queries the PostgreSQL database for failed records
- Extracts the original MARC record that caused the error
- Generates a TSV error report with error messages
- Creates a MARC file containing all failed records for reprocessing
- Supports SSH tunneling for secure database access

```{note}
This tool requires direct PostgreSQL database access, which is typically only available to system administrators or in hosted environments with proper credentials.
```

## Prerequisites

### PostgreSQL Dependencies

The `get-di-logs` command requires the PostgreSQL optional dependencies:

```bash
# Using pip
pip install 'folio_data_import[postgres]'

# Using uv
uv add 'folio_data_import[postgres]'
```

### Required Access

1. **FOLIO API credentials** - Username/password with read access
2. **PostgreSQL database access** - Direct connection to the FOLIO database
3. **SSH access** (optional) - If the database is only accessible via a bastion/jump host

### Job Execution IDs

You need the Job Execution IDs from the Data Import jobs you want to analyze. These can be found:

- In FOLIO's Data Import logs UI
- In the `marc_import_job_ids.txt` file generated by the `marc` command
- Via the FOLIO API at `/change-manager/jobExecutions`

## Basic Usage

### Command Line

```bash
folio-data-import get-di-logs \
  --folio-url https://okapi.folio.example.com \
  --folio-tenant my_tenant \
  --folio-username admin \
  --db-config db_config.json \
  --job-ids-file marc_import_job_ids.txt
```

### Using the Standalone Script

```bash
folio-di-logs \
  --folio-url https://okapi.folio.example.com \
  --folio-tenant my_tenant \
  --folio-username admin \
  --db-config db_config.json \
  --job-ids-file marc_import_job_ids.txt
```

## Configuration Files

### Database Configuration (`db_config.json`)

The database configuration file specifies how to connect to the PostgreSQL database:

```json
{
  "host": "folio-db.cluster-abc123.us-east-1.rds.amazonaws.com",
  "port": 5432,
  "database": "folio",
  "user": "folio_readonly",
  "password": "your_password_here"
}
```

| Field | Required | Default | Description |
|-------|----------|---------|-------------|
| `host` | Yes | - | PostgreSQL server hostname |
| `port` | No | `5432` | PostgreSQL port |
| `database` | Yes | - | Database name (usually `folio`) |
| `user` | Yes | - | Database username |
| `password` | No | - | Database password (can be omitted for passwordless auth) |

### SSH Tunnel Configuration (`ssh_config.json`)

If the database is only accessible through an SSH bastion host, provide an SSH configuration:

```json
{
  "ssh_tunnel": true,
  "ssh_host": "bastion.example.com",
  "ssh_user": "tunnel_user",
  "ssh_private_key_path": "~/.ssh/id_rsa",
  "use_ssh_config": true,
  "ssh_path": "/usr/bin/ssh"
}
```

| Field | Required | Default | Description |
|-------|----------|---------|-------------|
| `ssh_tunnel` | Yes | `false` | Enable SSH tunneling |
| `ssh_host` | If tunneling | - | SSH bastion/jump host |
| `ssh_user` | No | - | SSH username (uses SSH config if not set) |
| `ssh_private_key_path` | No | - | Path to SSH private key |
| `use_ssh_config` | No | `false` | Use `~/.ssh/config` for connection settings |
| `ssh_path` | No | `ssh` | Path to SSH binary |

#### AWS SSM Session Manager Example

For AWS environments using SSM Session Manager:

```json
{
  "ssh_tunnel": true,
  "ssh_host": "i-0abc123def456789",
  "use_ssh_config": true
}
```

With corresponding `~/.ssh/config`:

```
Host i-*
    ProxyCommand sh -c "aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
    User ec2-user
```

### Job IDs File

A plain text file with one Job Execution ID per line:

```text
a1b2c3d4-e5f6-7890-abcd-ef1234567890
b2c3d4e5-f6a7-8901-bcde-f12345678901
c3d4e5f6-a7b8-9012-cdef-123456789012
```

Lines starting with `#` or empty lines are ignored.

## Command-Line Parameters

### Connection Parameters

| Parameter | Environment Variable | Description |
|-----------|---------------------|-------------|
| `--folio-url` | `FOLIO_URL` | FOLIO Gateway/Okapi URL |
| `--folio-tenant` | `FOLIO_TENANT` | FOLIO tenant ID |
| `--folio-username` | `FOLIO_USERNAME` | FOLIO username |
| `--folio-password` | `FOLIO_PASSWORD` | FOLIO password |

### Database Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--db-config` | **Required** | Path to database configuration JSON file |
| `--ssh-config` | (none) | Path to SSH tunnel configuration JSON file |

### Input/Output Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--job-ids-file` | `marc_import_job_ids.txt` | Path to file containing Job Execution IDs |
| `--report-file-path` | `di_error_report.tsv` | Path for the error report TSV output |
| `--marc-file-path` | `di_error_records.mrc` | Path for the MARC records output |

### Other Options

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--no-progress` | `false` | Disable progress bar display |
| `--debug` | `false` | Enable debug logging |

## Output Files

### Error Report (TSV)

A tab-separated file with error details:

```
'Error Log'	'MARC Record'
'"Error processing record: Invalid indicator value"'	'00732nam a2200241...'
'"Duplicate record detected"'	'00845cam a2200289...'
```

The report can be opened in Excel or any spreadsheet application for analysis.

### MARC Records File

A binary MARC21 file (`.mrc`) containing all the records that failed to import. This file can be:

- Edited to fix issues using MarcEdit or similar tools
- Reprocessed through Data Import after corrections
- Analyzed with pymarc or other MARC processing tools

## Workflow Examples

### Basic Error Retrieval

```bash
# After running a MARC import that had errors
folio-data-import get-di-logs \
  --db-config db_config.json \
  --job-ids-file marc_import_job_ids.txt \
  --report-file-path errors_2024.tsv \
  --marc-file-path failed_records_2024.mrc
```

### With SSH Tunnel

```bash
folio-data-import get-di-logs \
  --db-config db_config.json \
  --ssh-config ssh_config.json \
  --job-ids-file job_ids.txt
```

### Using Environment Variables

```bash
export FOLIO_URL="https://okapi.folio.example.com"
export FOLIO_TENANT="my_tenant"
export FOLIO_USERNAME="admin"
export FOLIO_PASSWORD="secret"

folio-data-import get-di-logs \
  --db-config db_config.json \
  --job-ids-file job_ids.txt
```

### Complete Migration Error Analysis Workflow

```bash
# 1. Run the MARC import
folio-data-import marc \
  --marc-file-path migration_records.mrc \
  --import-profile-name "Default - Create Instance and SRS MARC Bib" \
  --job-ids-file-path migration_job_ids.txt

# 2. After import completes, retrieve any errors
folio-data-import get-di-logs \
  --db-config db_config.json \
  --job-ids-file migration_job_ids.txt \
  --report-file-path migration_errors.tsv \
  --marc-file-path migration_failed.mrc

# 3. Review errors, fix MARC records, and re-import
folio-data-import marc \
  --marc-file-path migration_failed_fixed.mrc \
  --import-profile-name "Default - Create Instance and SRS MARC Bib"
```

## Programmatic Usage

The `DILogRetriever` class can be used directly in Python scripts:

```python
import folioclient
from folio_data_import.DILogRetriever import DILogRetriever
from folio_data_import._postgres import PostgresConfig, SSHTunnelConfig
from folio_data_import._progress import NoOpProgressReporter

# Create FOLIO client
folio_client = folioclient.FolioClient(
    gateway_url="https://okapi.folio.example.com",
    tenant_id="my_tenant",
    username="admin",
    password="secret",
)

# Database configuration
db_config = PostgresConfig(
    host="folio-db.example.com",
    port=5432,
    database="folio",
    user="folio",
    password="db_password",
)

# SSH tunnel (optional)
ssh_config = SSHTunnelConfig(ssh_tunnel=False)

# Create retriever
retriever = DILogRetriever(
    folio_client=folio_client,
    db_config=db_config,
    ssh_tunnel_config=ssh_config,
    progress_reporter=NoOpProgressReporter(),
)

# Retrieve errors
job_ids = ["a1b2c3d4-e5f6-7890-abcd-ef1234567890"]
error_logs = retriever.retrieve_errors_with_marc(job_ids)

# Process results
for error_message, marc_record in error_logs:
    print(f"Error: {error_message}")
    print(f"Title: {marc_record.title}")

# Or generate report files
retriever.generate_error_report_and_marc_file(
    error_logs=error_logs,
    report_file_path="errors.tsv",
    marc_file_path="failed.mrc",
)
```

## Troubleshooting

### "PostgreSQL support requires the 'postgres' optional dependencies"

Install the PostgreSQL dependencies:

```bash
pip install 'folio_data_import[postgres]'
```

### Connection Refused / Timeout

- Verify the database host and port are correct
- Check firewall rules allow the connection
- If using SSH tunneling, verify the SSH configuration
- Ensure the database is accepting connections from your IP

### "ssh_host is required when ssh_tunnel is enabled"

Your `ssh_config.json` has `"ssh_tunnel": true` but is missing `ssh_host`. Add the bastion host:

```json
{
  "ssh_tunnel": true,
  "ssh_host": "your-bastion-host.example.com"
}
```

### No Records Found

- Verify the Job Execution IDs are correct
- Check that the jobs actually had errors (not just warnings)
- Ensure you're connecting to the correct tenant's database

### Permission Denied

The database user needs SELECT permissions on:

- `{tenant}_mod_source_record_manager.journal_records`
- `{tenant}_mod_source_record_manager.incoming_records`

## Security Considerations

```{warning}
This tool requires direct database access. Follow your organization's security policies:

- Store credentials securely (use environment variables or secrets management)
- Use read-only database accounts when possible
- Prefer SSH tunneling over direct database exposure
- Rotate passwords regularly
- Audit access to database credentials
```

## See Also

- {doc}`marc_data_import_guide` - MARC import documentation
- {doc}`troubleshooting` - General troubleshooting guide
