Skip to main content

Files (CSV, JSON, Excel, Feather, Parquet)

This page contains the setup guide and reference information for the Files source connector.

Prerequisites

  • A file hosted on AWS S3, GCS, HTTPS, or an SFTP server

Setup guide

For Airbyte Cloud users: Please note that locally stored files cannot be used as a source in Airbyte Cloud.

Step 1: Set up the connector in Airbyte

  1. From the Airbyte UI, click the Sources tab, then click + New source and select Files (CSV, JSON, Excel, Feather, Parquet) from the list of available sources.
  2. Enter a Source name of your choosing.
  3. For Dataset Name, enter the name of the final table to replicate this file into (should include letters, numbers, dashes and underscores only).
  4. For File Format, select the format of the file to replicate from the dropdown menu (Warning: some formats may be experimental. Please refer to the table of supported formats).

Step 2: Select the provider and set provider-specific configurations:

  1. For Storage Provider, use the dropdown menu to select the Storage Provider or Location of the file(s) which should be replicated, then configure the provider-specific fields as needed:

HTTPS: Public Web [Default]

  • User-Agent (Optional)

Set this to active if you want to add the User-Agent header to requests (inactive by default).

GCS: Google Cloud Storage

  • Service Account JSON (Required for private buckets)

To access private buckets stored on Google Cloud, this connector requires a service account JSON credentials file with the appropriate permissions. A detailed breakdown of this topic can be found at the Google Cloud service accounts page. Please generate the "credentials.json" file and copy its content to this field, ensuring it is in JSON format. If you are accessing publicly available data, this field is not required.

S3: Amazon Web Services

  • AWS Access Key ID (Required for private buckets)
  • AWS Secret Access Key (Required for private buckets)

To access private buckets stored on AWS S3, this connector requires valid credentials with the necessary permissions. To access these keys, refer to the AWS IAM documentation. More information on setting permissions in AWS can be found here. If you are accessing publicly available data, these fields are not required.

AzBlob: Azure Blob Storage

  • Storage Account (Required)

This is the globally unique name of the storage account that the desired blob sits within. See the Azure documentation for more details.

If you are accessing private storage, you must also provide one of the following security credentials with the necessary permissions:

SSH: Secure Shell / SCP: Secure Copy Protocol / SFTP: Secure File Transfer Protocol

  • Host (Required)

Enter the hostname or IP address of the remote server where the file trasfer will take place.

  • User (Required)

Enter the username associated with your account on the remote server.

  • Password (Optional)

If required by the remote server, enter the password associated with your user account. Otherwise, leave this field blank.

  • Port (Optional)

Specify the port number to use for the connection. The default port is usually 22. However, if your remote server uses a non-standard port, you can enter the appropriate port number here.

Local Filesystem (Airbyte Open Source only)

  • Storage
caution

Currently, the local storage URL for reading must start with the local mount "/local/".

Please note that if you are replicating data from a locally stored file on Windows OS, you will need to open the .env file in your local Airbyte root folder and change the values for:

  • LOCAL_ROOT
  • LOCAL_DOCKER_MOUNT
  • HACK_LOCAL_ROOT_PARENT

Please set these to an existing absolute path on your machine. Colons in the path need to be replaced with a double forward slash, //. LOCAL_ROOT & LOCAL_DOCKER_MOUNT should be set to the same value, and HACK_LOCAL_ROOT_PARENT should be set to their parent directory.

Step 3: Complete the connector setup

  1. For URL, enter the URL path of the file to be replicated.
note

When connecting to a file located in Google Drive, please note that you need to utilize the Download URL format: https://drive.google.com/uc?export=download&id=[DRIVE_FILE_ID]. [DRIVE_FILE_ID] should be replaced with the unique string found in the Share URL specific to Google Drive. You can find the Share URL by visiting https://drive.google.com/file/d/[DRIVE_FILE_ID]/view?usp=sharing.

When connecting to a file using Azure Blob Storage, please note that we account for the base URL. Therefore, you should only need to include the path to your specific file (eg container/file.csv).

  1. For Reader Options (Optional), you may choose to enter a string in JSON format. Depending on the file format of your source, this will provide additional options and tune the Reader's behavior. Please refer to the next section for a breakdown of the possible inputs. This field may be left blank if you do not wish to configure custom Reader options.
  2. Click Set up source and wait for the tests to complete.

Reader Options

The Reader in charge of loading the file format is currently based on Pandas IO Tools. It is possible to customize how to load the file into a Pandas DataFrame as part of this Source Connector. This is doable in the reader_options that should be in JSON format and depends on the chosen file format. See pandas' documentation, depending on the format:

For example, if the format CSV is selected, then options from the read_csv functions are available.

  • It is therefore possible to customize the delimiter (or sep) to in case of tab separated files.
  • Header line can be ignored with header=0 and customized with names
  • If a file has no header, it is required to set header=null; otherwise, the first record will be missing
  • Parse dates for in specified columns
  • etc

We would therefore provide in the reader_options the following json:

{ "sep" : "\t", "header" : null, "names": ["column1", "column2"], "parse_dates": ["column2"]}

In case you select JSON format, then options from the read_json reader are available.

For example, you can use the {"orient" : "records"} to change how orientation of data is loaded (if data is [{column -> value}, … , {column -> value}])

If you need to read Excel Binary Workbook, please specify excel_binary format in File Format select.

caution

This connector does not support syncing unstructured data files such as raw text, audio, or videos.

Supported sync modes

FeatureSupported?
Full Refresh SyncYes
Incremental SyncNo
Replicate Incremental DeletesNo
Replicate Folders (multiple Files)No
Replicate Glob Patterns (multiple Files)No
note

This source produces a single table for the target file as it replicates only one file at a time for the moment. Note that you should provide the dataset_name which dictates how the table will be identified in the destination (since URL can be made of complex characters).

File / Stream Compression

CompressionSupported?
GzipYes
ZipYes
Bzip2No
LzmaNo
XzNo
SnappyNo

Storage Providers

Storage ProvidersSupported?
HTTPSYes
Google Cloud StorageYes
Amazon Web Services S3Yes
SFTPYes
SSH / SCPYes
local filesystemLocal use only (inaccessible for Airbyte Cloud)

File Formats

FormatSupported?
CSVYes
JSON/JSONLYes
HTMLNo
XMLNo
ExcelYes
Excel Binary WorkbookYes
Fixed Width FileYes
FeatherYes
ParquetYes
PickleNo
YAMLYes

Changing data types of source columns

Normally, Airbyte tries to infer the data type from the source, but you can use reader_options to force specific data types. If you input {"dtype":"string"}, all columns will be forced to be parsed as strings. If you only want a specific column to be parsed as a string, simply use {"dtype" : {"column name": "string"}}.

Examples

Here are a list of examples of possible file inputs:

Dataset NameStorageURLReader ImplService AccountDescription
epidemiologyHTTPShttps://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csvCOVID-19 Public dataset on BigQuery
hr_and_financialsGCSgs://airbyte-vault/financial.csvsmart_open or gcfs{"type": "service_account", "private_key_id": "XXXXXXXX", ...}data from a private bucket, a service account is necessary
landsat_indexGCSgcp-public-data-landsat/index.csv.gzsmart_openUsing smart_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers)

Examples with reader options:

Dataset NameStorageURLReader ImplReader OptionsDescription
landsat_indexGCSgs://gcp-public-data-landsat/index.csv.gzGCFS{"compression": "gzip"}Additional reader options to specify a compression option to read_csv
GDELTS3s3://gdelt-open-data/events/20190914.export.csv{"sep": "\t", "header": null}Here is TSV data separated by tabs without header row from AWS Open Data
server_logslocal/local/logs.log{"sep": ";"}After making sure a local text file exists at /tmp/airbyte_local/logs.log with logs file from some server that are delimited by ';' delimiters

Example for SFTP:

Dataset NameStorageUserPasswordHostURLReader OptionsDescription
Test RebextSFTPdemopasswordtest.rebext.net/pub/example/readme.txt{"sep": "\r\n", "header": null, "names": \["text"], "engine": "python"}We use python engine for read_csv in order to handle delimiter of more than 1 character while providing our own column names.

Please see (or add) more at airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py for further usages examples.

Performance Considerations and Notes

In order to read large files from a remote location, this connector uses the smart_open library. However, it is possible to switch to either GCSFS or S3FS implementations as it is natively supported by the pandas library. This choice is made possible through the optional reader_impl parameter.

  • Note that for local filesystem, the file probably have to be stored somewhere in the /tmp/airbyte_local folder with the same limitations as the CSV Destination so the URL should also starts with /local/.
  • Please make sure that Docker Desktop has access to /tmp (and /private on a MacOS, as /tmp has a symlink that points to /private. It will not work otherwise). You allow it with "File sharing" in Settings -> Resources -> File sharing -> add the one or two above folder and hit the "Apply & restart" button.
  • The JSON implementation needs to be tweaked in order to produce more complex catalog and is still in an experimental state: Simple JSON schemas should work at this point but may not be well handled when there are multiple layers of nesting.

Reference

Config fields reference

Field
Type
Property name
string
dataset_name
string
format
string
url
object
provider
string
reader_options

Changelog

Expand to review
VersionDatePull RequestSubject
0.5.32024-06-2640215Replaced deprecated AirbyteLogger with logging.Logger
0.5.22024-06-0639192[autopull] Upgrade base image to v1.2.2
0.5.12024-05-0337799Add fastparquet engine for parquet file reader.
0.5.02024-03-1936267Pin airbyte-cdk version to ^0
0.4.12024-03-0435800Add PyAirbyte support on Python 3.11
0.4.02024-02-1532354Add Zip File Support
0.3.172024-02-1334678Add Fixed-Width File Support
0.3.162024-02-1235186Manage dependencies with Poetry
0.3.152023-10-1931599Upgrade to airbyte/python-connector-base:1.0.1
0.3.142023-10-1330984Prevent local file usage on cloud
0.3.132023-10-1231341Build from airbyte/python-connector-base:1.0.0
0.3.122023-09-1930579Add ParserError handling for discovery
0.3.112023-06-0827157Force smart open log level to ERROR
0.3.102023-06-0727107Make source-file testable in our new airbyte-ci pipelines
0.3.92023-05-1826275Add ParserError handling
0.3.82023-05-1726210Bugfix for https://github.com/airbytehq/airbyte/pull/26115
0.3.72023-05-1626131Re-release source-file to be in sync with source-file-secure
0.3.62023-05-1626115Add retry on SSHException('Error reading SSH protocol banner')
0.3.52023-05-1626117Check if reader options is a valid JSON object
0.3.42023-05-1025965fix Pandas date-time parsing to airbyte type
0.3.32023-05-0425819GCP service_account_json is a secret
0.3.22023-05-0125641Handle network errors
0.3.12023-04-2725575Fix OOM; read Excel files in chunks using openpyxl
0.3.02023-04-2425445Add datatime format parsing support for csv files
0.2.382023-04-1223759Fix column data types for numerical values
0.2.372023-04-0624525Fix examples in spec
0.2.362023-03-2724588Remove traceback from user messages.
0.2.352023-03-0324278Read only file header when checking connectivity; read only a single chunk when discovering the schema.
0.2.342023-03-0323723Update description in spec, make user-friendly error messages and docs.
0.2.332023-01-0421012Fix special characters bug
0.2.322022-12-2120740Source File: increase SSH timeout to 60s
0.2.312022-11-1719567Source File: bump 0.2.31
0.2.302022-11-1019222Use AirbyteConnectionStatus for "check" command
0.2.292022-11-0818587Fix pandas read_csv header none issue.
0.2.282022-10-2718428Add retry logic for Connection reset error - 104
0.2.272022-10-2618481Fix check for wrong format
0.2.262022-10-1818116Transform Dropbox shared link
0.2.252022-10-1417994Handle UnicodeDecodeError during discover step.
0.2.242022-10-0317504Validate data for HTTPS while check_connection
0.2.232022-09-2817304Migrate to per-stream state.
0.2.222022-09-1516772Fix schema generation for JSON files containing arrays
0.2.212022-08-2615568Specify pyxlsb library for Excel Binary Workbook files
0.2.202022-08-2315870Fix CSV schema discovery
0.2.192022-08-1915768Convert 'nan' to 'null'
0.2.182022-08-1615698Cache binary stream to file for discover
0.2.172022-08-1115501Cache binary stream to file
0.2.162022-08-1015293Add support for encoding reader option
0.2.152022-08-0515269Bump smart-open version to 6.0.0
0.2.122022-07-1214535Fix invalid schema generation for JSON files
0.2.112022-07-129974Add support to YAML format
0.2.92022-02-019974Update airbyte-cdk 0.1.47
0.2.82021-12-068524Update connector fields title/description
0.2.72021-10-287387Migrate source to CDK structure, add SAT testing.
0.2.62021-08-265613Add support to xlsb format
0.2.52021-07-264953Allow non-default port for SFTP type
0.2.42021-06-093973Add AIRBYTE_ENTRYPOINT for Kubernetes support
0.2.32021-06-013771Add Azure Storage Blob Files option
0.2.22021-04-162883Fix CSV discovery memory consumption
0.2.12021-04-032726Fix base connector versioning
0.2.02021-03-092238Protocol allows future/unknown properties
0.1.102021-02-182118Support JSONL format
0.1.92021-02-021768Add test cases for all formats
0.1.82021-01-271738Adopt connector best practices
0.1.72020-12-161331Refactor Python base connector
0.1.62020-12-081249Handle NaN values
0.1.52020-11-301046Add connectors using an index YAML file