Exporting Query Results

PuppyGraph supports exporting query results to cloud storage services.

Supported storage types and file formats

Storage Types

Amazon S3
Azure Data Lake Storage (Gen2)
Google Cloud Storage
MinIO
HDFS

File Formats

Parquet Iceberg tables
Parquet Hive tables
Parquet files
CSV files

Syntax

OpenCypherGremlin

EXPORT TO <path>
PROPERTIES {
  key: value,
  ...
}
[cypher query]

From Graph traversal queries:

g.with("exportTo", <path>)
.with("key", value)
...
[gremlin steps]

From Graph algorithm programs:

graph.program([program_def])
.submitAndSave([
  "exportTo": "<target_path>",
  "key": value,
  ...
])

Parameters

Key	Description	Default Value
`fileType`	Target file format (`csv`, `parquet`, or `table`)	`csv`
`storageType`	Cloud storage type (inferred from export path if not provided)	-
`catalog`	Catalog name defined in schema for storage configuration	-
`identifier`	Cloud storage access identifier	-
`credential`	Cloud storage access credential	-
`endpoint`	S3-compatible storage endpoint	-
`region`	S3/S3-compatible storage region	-
`useComputeEngineService`	Use GCS Compute Engine service account	false
`serviceAccountEmail`	GCS service account email	-
`useManagedIdentity`	Use Azure VM managed identity	false
`tenantId`	Azure tenant ID	-
`clientId`	Azure client ID	-
`clientSecret`	Azure client secret	-

Export using schema catalog configuration

When the export destination storage matches the schema's catalog configuration, you can reuse existing storage settings. Before saving data to this location, ensure you possess appropriate write access privileges.

Export to cloud storage

OpenCypherGremlin

EXPORT TO '<target_path>'
PROPERTIES {
  catalog: '<catalog name>'
}
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name

g.with("exportTo", "<target_path>")
  .with("catalog": "<catalog name>")
  .V()
  .out("created")
  .project('name')
    .by(values('name'))

<target_path>: Storage path URI matching catalog scheme (e.g., s3://my_bucket/subfolder)
<catalog name>: Name of the catalog defined in the graph schema

Export as Iceberg/Hive table

It is also possible to store query results as an Iceberg or Hive table. This feature is currently experimental.

To store results as an Iceberg table:

Configure the catalog type in the graph schema as Iceberg.
Ensure the Iceberg catalog service aligns with the schema configuration.
Confirm you have CREATE TABLE privileges on the target Iceberg database (schema).
Check that you have write permissions for the designated storage path.

To store results as a Hive table:

Configure the catalog type in the graph schema as Hive (See Connecting to Hive).
If you are using a kerberized Hive cluster, ensure you have configured PuppyGraph with the Kerberos settings (See Querying Kerberized Hive Data as a Graph).
Ensure the Hive catalog service aligns with the schema configuration.
Confirm you have CREATE TABLE privileges on the target Hive database.
Check that you have write permissions for the designated storage path.

Query results will be stored as Parquet files in the new table.

OpenCypherGremlin

EXPORT TO '<target_database>.<target_table>'
PROPERTIES {
  catalog: '<catalog name>',
  fileType: 'table'
}
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name

g.with("exportTo", "<target_database>.<target_table>")
  .with("catalog", "<catalog name>")
  .with("fileType", "table")
  .V()
  .out("created")
  .project('name')
    .by(values('name'))

<target_database>.<target_table>: The database and table to save to, database name must be provided. e.g. mydatabase.result_table
<catalog name>: Name of the catalog defined in the graph schema. It must be of Iceberg or Hive type.
fileType parameter: Must be set to table

Export using a separate configuration

When the export destination storage does not match the schema's catalog configuration, you can provide separate storage settings.

Amazon S3

Before exporting to Amazon S3, ensure that:

The provided credentials have been granted write permissions to the target S3 bucket
The IAM policy includes proper authorization for s3:PutObject actions

OpenCypherGremlin

EXPORT TO '<target_path>'
PROPERTIES {
  identifier: '<aws_access_key_id>',
  credential: '<aws_secret_access_key>'
  region: '<region>'
}
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name

g.with("exportTo", "<target_path>")
  .with("region": "<region>")
  .with("identifier", "<aws_access_key_id>")
  .with("credential": "<aws_secret_access_key>")
  .V()
  .out("created")
  .project('name')
    .by(values('name'))

<target_path>: Specifies the destination directory path URI (format: s3://bucket/path/ or s3a://bucket/path/). The path must use either s3 or s3a scheme and will always be treated as a directory.
<region>: AWS region where the target S3 bucket is located
<aws_access_key_id>: AWS access key ID for S3 authentication
<aws_secret_access_key>: Corresponding secret access key for AWS authentication

MinIO

Before saving data to this location, ensure you possess appropriate write access privileges.

OpenCypherGremlin

EXPORT TO '<target_path>'
PROPERTIES {
  storageType: 'minio',
  endpoint: '<minio_endpoint>',
  identifier: '<user>',
  credential: '<password>'
}
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name

g.with("exportTo", "<target_path>")
  .with("storageType", "minio")
  .with("endpoint", "<minio_endpoint>")
  .with("identifier", "<user>")
  .with("credential", "<password>")
  .V()
  .out("created")
  .project('name')
    .by(values('name'))

<target_path>: Destination URI (scheme must be s3) specifying the directory path for saving exported data
<minio_endpoint>: Endpoint URL of the MinIO service
<user>: Authentication username for MinIO
<password>: Authentication password for MinIO
The storageType parameter must be explicitly set to minio

Google Cloud Storage

Before saving results to GCS, ensure the provided credentials have write permissions granted.

Using a Key JSON File

To authenticate with a JSON key file:

Mount the JSON key file during container creation
Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the mounted file path within the container

See Authentication for PuppyGraph to access Google Cloud resources for more information on how to configure this.

After completing these steps, you can export results to GCS using the following queries:

OpenCypherGremlin

EXPORT TO '<target_path>'
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name

g.with("exportTo", "<target_path>")
  .V()
  .out("created")
  .project('name')
    .by(values('name'))

Using an attached Service Account on Compute Engine

Before utilizing using a Service Account on Compute Engine, ensure the following prerequisites are met:

The VM instance running PuppyGraph is associated with a service account
The instance's access scope includes appropriate permissions for storage operations

Reference: service accounts

OpenCypherGremlin

EXPORT TO '<target_path>'
PROPERTIES {
  useComputeEngineService: true
}
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name

g.with("exportTo", "<target_path>")
  .with("useComputeEngineService": true)
  .V()
  .out("created")
  .project('name')
    .by(values('name'))

Using service account identifier and credential pair

OpenCypherGremlin

EXPORT TO '<target_path>'
PROPERTIES {
  serviceAccountEmail: '<email>',
  identifier: '<key_id>',
  credential: '<secret_key>'
}
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name

g.with("exportTo", "<target_path>")
  .with("storageType", "GCS")
  .with("serviceAccountEmail": "<email>")
  .with("identifier", "<key_id>")
  .with("credential": "<secret_key>")
  .V()
  .out("created")
  .project('name')
    .by(values('name'))

<target_path>: Destination path URI for exported files (must use gs:// scheme for Google Cloud Storage). The path will be treated as a directory
<email>: Service account email address associated with your Google Cloud project
<key_id>: Unique identifier for the service account's private key
<secret_key>: Service account's private key for authentication

Azure Data Lake Storage Gen2

Before saving data to Azure, ensure the provided credentials have write permissions to the target storage location.

Using Managed Identity

OpenCypherGremlin

EXPORT TO '<target_path>'
PROPERTIES {
  useManagedIdentity: true,
  tenantId: '<tenant-id-of-identity>',
  clientId: '<client-id-of-identity>'
}
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name, sum(p.age) as totalAge

g.with("exportTo", "<target_path>")
  .with("useManagedIdentity", true)
  .with("tenantId", "<tenant-id-of-identity>")
  .with("clientId": "<client-id-of-identity>")
  .V().as("s")
  .in("created").as("p")
  .group().by(select("s")).by(values('age').sum())
  .unfold()
  .project('name', 'totalAge')
    .by(select(keys).values('name'))
    .by(select(values))

Using Storage Account Access Key

OpenCypherGremlin

EXPORT TO '<target_path>'
PROPERTIES {
  identifier: '<storage-account-name>',
  credential: '<storage-account-access-key>'
}
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name, sum(p.age) as totalAge

g.with("exportTo", "<target_path>")
  .with("identifier", "<storage-account-name>")
  .with("credential": "<storage-account-access-key>")
  .V().as("s")
  .in("created").as("p")
  .group().by(select("s")).by(values('age').sum())
  .unfold()
  .project('name', 'totalAge')
    .by(select(keys).values('name'))
    .by(select(values))

<target_path>: Target directory path URI (scheme must be abfs:// or abfss://). The path is always treated as a directory
<storage-account-name>: Name of your Azure Storage Account
<storage-account-access-key>: Access key for the storage account
<tenant-id-of-identity>: Tenant ID of the managed identity
<client-id-of-identity>: Client ID (application ID) of the managed identity

HDFS

Before saving data to this location, ensure you possess appropriate write access privileges.

OpenCypherGremlin

EXPORT TO '<target_path>'
MATCH (p)-[:created]->(s:software) 
RETURN s.name as name

g.with("exportTo", "<target_path>")
  .V()
  .out("created")
  .project('name')
    .by(values('name'))

<target_path>: Destination URI like hdfs://hadoop_node:9000/path/to/dir, specifying the directory path for saving exported data

If you are using a kerberized HDFS server, ensure you have configured PuppyGraph with the Kerberos settings (See Querying Kerberized Hive Data as a Graph).