Skip to main content

Chroma

This page guides you through the process of setting up the Chroma destination connector.

Features

FeatureSupported?(Yes/No)Notes
Full Refresh SyncYes
Incremental - Append SyncYes
Incremental - Append + DedupedYes

Output Schema

Only one stream will exist to collect data from all source streams. This will be in a collection in Chroma whose name will be defined by the user, and validated and corrected by Airbyte.

For each record, a UUID string is generated and used as the document id. The embeddings generated as defined will be stored as embeddings. Data in the text fields will be stored as documents and those in the metadata fields will be stored as metadata.

Getting Started (Airbyte Open Source)

You can connect to a Chroma instance either in client/server mode or in a local persistent mode. For the local persistent mode, the database file will be saved in the path defined in the path config parameter. Note that path must be an absolute path, prefixed with /local.

danger

Persistent Client mode is not supported on Kubernetes

By default, the LOCAL_ROOT env variable in the .env file is set /tmp/airbyte_local.

The local mount is mounted by Docker onto LOCAL_ROOT. This means the /local is substituted by /tmp/airbyte_local by default.

caution

Please make sure that Docker Desktop has access to /tmp (and /private on a MacOS, as /tmp has a symlink that points to /private. It will not work otherwise). You allow it with "File sharing" in Settings -> Resources -> File sharing -> add the one or two above folder and hit the "Apply & restart" button.

Requirements

To use the Chroma destination, you'll need:

  • An account with API access for OpenAI, Cohere (depending on which embedding method you want to use) or neither (if you want to use the default chroma embedding function)
  • A Chroma db instance (client/server mode or persistent mode)
  • Credentials (for cient/server mode)
  • Local File path (for Persistent mode)

Configure Network Access

Make sure your Chroma database can be accessed by Airbyte. If your database is within a VPC, you may need to allow access from the IP you're using to expose Airbyte.

Setup the Chroma Destination in Airbyte

You should now have all the requirements needed to configure Chroma as a destination in the UI. You'll need the following information to configure the Chroma destination:

  • (Required) Text fields to embed
  • (Optional) Text splitter Options around configuring the chunking process provided by the Langchain Python library.
  • (Required) Fields to store as metadata
  • (Required) Collection The name of the collection in Chroma db to store your data
  • (Required) Authentication method
    • For client/server mode
      • Host for example localhost
      • Port for example 8000
      • Username (Optional)
      • Password (Optional)
    • For persistent mode
      • Path The path to the local database file. Note that path must be an absolute path, prefixed with /local.
  • (Optional) Embedding
    • OpenAI API key if using OpenAI for embedding
    • Cohere API key if using Cohere for embedding
    • Embedding Field name and Embedding dimensions if getting the embeddings from stream records

Changelog

Expand to review
VersionDatePull RequestSubject
0.0.132024-06-2640215Replaced deprecated AirbyteLogger with logging.Logger
0.0.122024-06-2340222Update dependencies
0.0.112024-06-2240068Update dependencies
0.0.102024-04-15#37333Updated CDK & pytest version to fix security vulnerabilities
0.0.92023-12-11#33303Fix bug with embedding special tokens
0.0.82023-12-01#32697Allow omitting raw text
0.0.72023-11-16#32608Support deleting records for CDC sources
0.0.62023-11-13#32357Improve spec schema
0.0.52023-10-23#31563Add field mapping option
0.0.42023-10-15#31329Add OpenAI-compatible embedder option
0.0.32023-10-04#31075Fix OpenAI embedder batch size
0.0.22023-09-29#30820Update CDK
0.0.12023-09-08#30023🎉 New Destination: Chroma (Vector Database)