Data Sources API - Importing Data and Managing your Data Sources¶
The Data Sources API enables you to create and manage your Data Sources as well as importing Data into them.
In order to use the Data Sources API, you must use an Auth token with the right permissions depending on whether you want to CREATE
, APPEND
, READ
or DROP
(or a combination of those)
Importing data into Tinybird Analytics¶
Tinybird Analytics is specifically designed to ingest, process and analyze data in CSV, NDJSON, and Parquet format. CSV files must have one line for each row of data and have comma-separated fields, with the column headers in the first row. See the API reference to learn about how to ingest NDJSON and Parquet files.
You can import your data into Tinybird Analytics by creating a new Data Source. Tinybird Analytics will automatically detect and optimize your column types so you don’t have to worry about anything and can start analyzing your data right away.
curl \
-H "Authorization: Bearer <import_token>" \
-X POST "https://api.tinybird.co/v0/datasources?url=https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2018-12.csv"
Advanced users can explicitly analyze their data before importing.
If you are looking for information on how to create a new Pipe, take a look at the Pipes API reference.
- POST /v0/datasources/?¶
There are three
mode
operations to import data into a Data Source:create
,append
, andreplace
:create
: Create the Data Source with its schema.append
: Once the Data Source exists to load data into the Data Source. If the Data Source has dependent Materialized Views, data will be appended in cascade.replace
: Replace selective data from an existent Data Source or completely replace the content of the existent Data Source. If the Data Source has dependent Materialized Views, these would be replaced in cascade.
When creating a Data Source, it’s necessary to define the schema of the Data Source. It can be done in two different ways:
Automatically: When you have a CSV file or a URL for a CSV file, you can use create a Data Source using the file itself. Tinybird will take care of guessing the schema. Setting
type_guessing=false
will disable guessing and all the columns will be set asString
.Manually: When you already know your schema or you want to change or optimize a schema.
You can guess the best schema for a CSV, NDJSON, or Parquet file (both local or remote) with the Analyze API or using the UI.
In the first case, you’ll be actually doing two operations in one: first, a
create
operation using the guessed schema and then anappend
operation, ingesting the data.Creating a Data Source from a schema
To create an empty Data Source, you must pass a
schema
with your desired column names and types and leave theurl
parameter empty.When you want to have a more granular control about the Data Source schema, you can directly specify it when creating the Data Source. These are some cases where setting the schema is the best option:
You are already familiar with your data and know the types in advance.
For optimization purposes. For example, you know a column is a 32 bit integer instead of a 64 bits one.
Making some columns optional by defining them as Nullable.
When Tinybird’s guessing fails and incorrectly identifies a column data type. We do our best, but sometimes we get it wrong!
Creating a CSV Data Source from a schema¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "schema=symbol String, date Date, close Float32"
Quarantine rows
Imports do not stop when Tinybird finds rows that do not match the Data Source schema; instead, those rows are stored into a “quarantine” Data Source. This “quarantine” Data Source is automatically created along with each Data Source and it contains the same columns as the original Data Source but with
Nullable(String)
as the data type. Those records can be processed later to be recovered. Learn more about quarantine Data SourcesImporting data from a CSV
You import data by specifying the URL of a CSV file or by sending the CSV data in the request body.
Tinybird guesses the Data Source schema (the columns and their data types) and the partition and sorting keys (how the data is stored) from the CSV contents. Once the Data Source has been created, it is recommended not to include the CSV header anymore for performance reasons. Check out our guide for tuning CSVs for fast ingestion. However, in case the header is included and it contains all names present in the Data Source schema, the ingestion will work even if the columns follow a different order than the one used when the Data Source was created.
Using a URL
When using a URL for the CSV file, the URL must be remotely accessible. If the server supports HTTP Range headers the import process will be parallelized. The URL must be encoded.
Creating a Data Source from a remote CSV file¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ --data-urlencode "url=http://example.com/file.csv"
When importing a CSV via a URL, the response will not be the final result of the import but a Job. You can check the job status and progress using the Jobs API. In the response,
id
,job_id
, andimport_id
should have the same value.Using a local file
You can also use this endpoint to import data to a Data Source from a local file.
Creating a Data Source from local CSV files¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -F csv=@local_file.csv
Successful response¶{ "id": "e9ae235f-f139-43a6-7ad5-a1e17c0071c2", "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name", "cluster": null, "tags": {}, "created_at": "2019-03-12 17:45:04", "updated_at": "2019-03-12 17:45:04", "statistics": { "bytes": 1397, "row_count": 4 }, "replicated": false, "version": 0, "project": null, "used_by": [], "type": "csv" }, "error": false }
In this case, you will receive the final import result in the response, so you won’t need to check a job status. You can also track the progress by using the
progress=true
parameter.Since the default
mode
iscreate
, if you want to append or replace data into a Data Source that already exists, you should specify both themode
and thename
:Appending data to a Data Source from a local CSV file¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?mode=append&name=<data_source_name>" \ -F csv=@local_file.csv
Replacing a Data Source from a URL¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d mode='replace' \ -d name='data_source_name' \ --data-urlencode "url=http://example.com/file.csv"
There’s a
debug
option, which isfalse
by default. We can pass the debug information we want to receive, it could be:blocks
,block_log
, and / orhook_log
Using debug parameter¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources?debug=blocks,block_log" \ -F csv=@local_file.csv
Response example:
Successful response¶{ "import_id": "e9ae235f-f139-43a6-7ad5-a1e17c0071c2", "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name", "cluster": null, "tags": {}, "created_at": "2019-03-12 17:45:04", "updated_at": "2019-03-12 17:45:04", "statistics": { "bytes": 1397, "row_count": 4 }, "replicated": false, "version": 0, "project": null, "used_by": [] }, "error": false, "blocks": [{ "block_id": "0a261696-c531-4f33-bd95-b871a73eaa04", "process_return": [{ "lines": 100, "parser": "python", "quarantine": 0, "time": 0.0274507999420166, "invalid_lines": 0, "empty_lines": 1, "bytes": 1234, }], "processing_time": 0.13909363746643066, "processing_error": null, "processing_error_type": null }], "block_log": [{ "block_id": "0a261696-c531-4f33-bd95-b871a73eaa04", "status": "processing", "timestamp": 1594742501.624614 }, { "block_id": "0a261696-c531-4f33-bd95-b871a73eaa04", "status": "guessing", "timestamp": 1594742501.624639 }, { "block_id": "0a261696-c531-4f33-bd95-b871a73eaa04", "status": "inserting_chunk:0", "timestamp": 1594742501.747998 }, { "block_id": "0a261696-c531-4f33-bd95-b871a73eaa04", "status": "done_inserting_chunk:0", "timestamp": 1594742501.753025 }, { "block_id": "0a261696-c531-4f33-bd95-b871a73eaa04", "status": "done", "timestamp": 1594742501.753836 }] }
Importing data from NDJSON or Parquet
You import data by specifying the URL of an NDJSON or Parquet file or by sending the NDJSON/Parquet data in the request body, but before that you have to manually create the Data Source.
The process consists on:
(Optionally) Use the Analyze API to get the suggested schema and JSONPaths for creating the Data Source
Create the Data Source with the schema and JSONPath with
mode=create
, from previous step or created by you manually.Ingest data passing the file with
mode=append
(ormode=replace
)
Creating a Data Source using NDJSON/Parquet Data
To create an Data Source using NDJSON/Parquet data, you must pass a schema with your desired column names, types and JSONPath and indicate
format=ndjson
orformat=parquet
as appropiate. Let’s see an example:Creating a Data Source using NDJSON¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "format=ndjson" \ -d "name=events" \ -d "mode=create" \ --data-urlencode "schema=date DateTime \`json:$.date\`, event String \`json:$.event\`"
Note we are escaping the backticks with a backslash because it’s a curl command otherwise it’s not needed.
JSONPaths
When creating a Data Source using NDJSON/Parquet data, for each column in the
schema
you have to provide a JSONPath using the JSONPath syntax.Let’s see a complete example. Given this NDJSON object:
{ "field": "test", "nested": { "nested_field": "bla" }, "an_array": [1, 2, 3], "a_nested_array": { "nested_array": [1, 2, 3] } }
The schema would be something like this:
schema with jsonpath¶a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`, an_array Array(Int16) `json:$.an_array[:]`, field String `json:$.field`, nested_nested_field String `json:$.nested.nested_field`
Our JSONPath syntax support has some limitations: It support nested objects at multiple levels, but it just support nested arrays at the first level, as in the example above. To ingest and transform more complex JSON object you should use the root object JSONPath syntax as described in the next section.
JSONPaths and the root object
Defining a column as “column_name String
json:$
” in the Data Source schema will ingest each line in the NDJSON file as a String in thecolumn_name
column.This is very handy in these scenarios:
When you have nested arrays, such as polygons:
Nested arrays¶{ "id": 49518, "polygon": [ [ [30.471785843000134,-1.066836591999916], [30.463855835000118,-1.075127054999925], [30.456156047000093,-1.086082457999908], [30.453003785000135,-1.097347919999962], [30.456311076000134,-1.108096617999891], [30.471785843000134,-1.066836591999916] ] ] }
You can parse the
id
and then add the whole JSON string to the root column to extract the polygon with JSON functions.schema definition¶id String `json:$.id`, root String `json:$`
When you have complex objects:
Complex JSON objects¶{ "elem": { "payments": [ { "users": [ { "user_id": "Admin_XXXXXXXXX", "value": 4 } ] } ] } }
Same for more complex objects, in this case we just push the whole JSON string in the root column:
schema definition¶root String `json:$`
Schemaless events:
Schemaless events¶{ "user_id": "1", "data": { "whatever": "bla", "whatever2": "bla" } } { "user_id": "1", "data": [1, 2, 3] }
schema definition¶root String `json:$`
Using the Analyze API
You don’t need to remember the exact JSONPath notation when creating an NDJSON/Parquet Data Source, just rely on the Analyze API and it’ll guess a valid schema based on a sample NDJSON/Parquet file:
analyze a NDJSON file to get a valid schema¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/analyze" \ -F "ndjson=@local_file_path"
Successful analyze response¶{ "analysis": { "columns": [ { "path": "$.a_nested_array.nested_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "a_nested_array_nested_array" }, { "path": "$.an_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "an_array" }, { "path": "$.field", "recommended_type": "String", "present_pct": 1, "name": "field" }, { "path": "$.nested.nested_field", "recommended_type": "String", "present_pct": 1, "name": "nested_nested_field" } ], "schema": "a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`, an_array Array(Int16) `json:$.an_array[:]`, field String `json:$.field`, nested_nested_field String `json:$.nested.nested_field`" }, "preview": { "meta": [ { "name": "a_nested_array_nested_array", "type": "Array(Int16)" }, { "name": "an_array", "type": "Array(Int16)" }, { "name": "field", "type": "String" }, { "name": "nested_nested_field", "type": "String" } ], "data": [ { "a_nested_array_nested_array": [ 1, 2, 3 ], "an_array": [ 1, 2, 3 ], "field": "test", "nested_nested_field": "bla" } ], "rows": 1, "statistics": { "elapsed": 0.00032175, "rows_read": 2, "bytes_read": 142 } } }
Now you can create the Data Source using the
schema
in the analyze response.Using a URL
When using a URL for the NDJSON/Parquet file, the URL must be remotely accessible and the
url
param must be encoded.Creating a Data Source from a remote NDJSON file¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "mode=append" \ -d "format=ndjson" \ -d "name=events" \ --data-urlencode "url=http://example.com/file.ndjson"
When importing an NDJSON/Parquet via a URL, the response will not be the final result of the import but a Job. You can check the job status and progress using the Jobs API. In the response,
id
,job_id
, andimport_id
should have the same value.Using a local NDJSON file
You can also use this endpoint to import data to a Data Source from a local file.
Creating a Data Source from local NDJSON files¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "mode=append" \ -d "format=ndjson" \ -d "name=events" \ -F ndjson=@local_file.ndjson
Successful response¶{ "id": "e9ae235f-f139-43a6-7ad5-a1e17c0071c2", "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name", "cluster": null, "tags": {}, "created_at": "2019-03-12 17:45:04", "updated_at": "2019-03-12 17:45:04", "statistics": { "bytes": 1397, "row_count": 4 }, "replicated": false, "version": 0, "project": null, "used_by": [], "type": "ndjson" }, "error": false }
In this case, you will receive the final import result in the response, so you won’t need to check a job status.
Data Source Engines
Tinybird uses ClickHouse as the underlying storage technology. ClickHouse features different strategies to store data, these different strategies define not only where and how the data is stored but what kind of data access, queries, and availability your data has. In ClickHouse terms, a Tinybird Data Source uses a Table Engine that determines those factors.
Tinybird allows you to choose the ClickHouse Table Engine for your Data Source. To make things simpler, the API hides some of the complexity around ClickHouse’s Table Engines. For instance, it simplifies the replication configuration, so you don’t have to worry about it.
Currently, Tinybird supports the following Engines:
MergeTree
ReplacingMergeTree
SummingMergeTree
AggregatingMergeTree
CollapsingMergeTree
VersionedCollapsingMergeTree
Join
Null
If you need to use any other Table Engine, get in touch with us.
You can use the
engine
parameter to specify the name of any of the available engines, e.g.engine=ReplacingMergeTree
. In order to specify the engine parameters and the engine options, you can use as manyengine_*
request parameters as needed.Engine parameters and options¶Engine options ------------------------ ↓ ------------- ReplacingMergeTree(insert_date) ORDER BY (pk) ----------- ↑ Engine parameters -------
Find the different parameters for each engine in the following table:
Engine parameters¶ ENGINE
SIGNATURE
PARAMETER
DESCRIPTION
([ver])
engine_ver
Optional. The column with version.
([columns])
engine_columns
Optional. The names of columns where values will be summarized
(sign)
engine_sign
Name of the column for computing the state
(sign, version)
engine_sign
Name of the column for computing the state
engine_version
Name of the column with the version of the object state.
(join_strictness, join_type, columns)
engine_join_strictness
How to perform the JOIN.
Valid values include:ANY
,ALL
,SEMI
, andANTI
.engine_join_type
How to combine the columns of the tables.
Valid values include:LEFT
,INNER
,RIGHT
, andFULL
engine_key_columns
column1[,column2], key columns from the USING clause that the JOIN operation is made with.
Creating a Data Source with a Join engine¶curl \ -X POST \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ "https://api.tinybird.co/v0/datasources" \ -d "schema=suppkey Int16, name String, region String" \ -d "name=region_by_suppkey" \ -d "engine=Join" \ -d "engine_join_strictness=ANY" \ -d "engine_join_type=LEFT" \ -d "engine_key_columns=suppkey"
The engine options, in particular the MergeTree engine options, match ClickHouse terminology:
engine_partition_key
,engine_sorting_key
,engine_primary_key
,engine_sampling_key
,engine_ttl
andengine_settings
. Check the ClickHouse documentation for a detailed explanation of the different engine options available.Creating a Data Source with a ReplacingMergeTree engine¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "schema=pk UInt64, insert_date Date, close Float32" \ -d "engine=ReplacingMergeTree" \ -d "engine_sorting_key=pk" \ -d "engine_ver=insert_date" \ -d "name=test123"
Replacing data
You can completely, or selectively, replace the content of a Data Source using the
mode=replace
parameter. By default, if you do not specify any condition all the content of the Data Source will be destroyed and replaced by the new content you provide. The operation is atomic.Replacing a Data Source from a URL¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d mode='replace' \ -d name='data_source_name' \ --data-urlencode "url=http://example.com/file.csv"
Alternatively, you might want to replace just certain rows of your Data Source. In that case, you have to use the
replace_condition
parameter too. This parameter defines the condition, the filter, that will be applied so all the matching rows will be deleted before ingesting the new file and, only those rows matching the condition will be ingested. Note that if the source file contains rows that do not match the condition, these rows will not be ingested. This operation is atomic.Selective replacing rows on a Data Source from a URL¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d mode='replace' \ -d name='data_source_name' \ -d replace_condition='replacement_condition' \ --data-urlencode "url=http://example.com/file.csv"
The replacements are made by partition so it is mandatory that the
replace_condition
contains the partition field.It is important to understand that all the dependencies of the Data Source (materialized views) will be recomputed so your data will be consistent after the replacement. If you have n-level dependencies they will be also updated by this operation. Taking the example A –> B –> C, if you replace data in A, Data Sources B and C will automatically be updated accordingly. You need to take into account that the Partition Key of Data Source C must also be compatible with Data Source A.
You have some examples about how to selectively replace data in our guide .
Important notes
It’s not possible run an append operation to Data Source with a Join Engine, or to a Data Source that has a dependent Materialized View that materializes into a Data Source with a Join Engine. In order to append data in these cases, it’s possible by doing a replace using
mode=replace
, since this mode will also insert new data.Replacements are atomic but we cannot assure data consistency if you continue appending data to any related Data Source at the same time the replacement takes place. The new incoming data will be discarded. This is an important limitation and we are working to provide you with a more flexible alternative solution.
Handling errors
When ingesting data using
create
,append
, andreplace
operations it’s possible to encounter some errors. For example, if you’re trying to ingest a file via a URL that doesn’t exist. While in most of the cases we return a HTTP error, the following situation may occur: As we’ve explained before, it’s possible that while importing a file, some data is appended and some data goes to quarantine.In this case, we return a
200
HTTP Status code. In addition, we return in the response anerror
message, along with theinvalid_lines
and thequarantine_rows
:Successful ingestion with errors¶{ "import_id": "e9ae235f-f139-43a6-7ad5-a1e17c0071c2", "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name", "cluster": null, "tags": {}, "created_at": "2019-03-12 17:45:04", "updated_at": "2019-03-12 17:45:04", "statistics": { "bytes": 1397, "row_count": 4 }, "replicated": false, "version": 0, "project": null, "used_by": [] }, "error": "There was an error with file contents: 2 rows in quarantine and 2 invalid lines", "quarantine_rows": 2, "invalid_lines": 2 }
Request parameters¶ KEY
TYPE
DESCRIPTION
mode
String
Default:
create
. Other modes:append
andreplace
.
Thecreate
mode creates a new Data Source and attempts to import the data of the CSV if a URL is provided or the body contains any data.
Theappend
mode inserts the new rows provided into an existing Data Source (it will also create it if it does not exist yet).
Thereplace
mode will remove the previous Data Source and its data and replace it with the new one; Pipes or queries pointing to this Data Source will immediately start returning data from the new one and without disruption once the replace operation is complete.
Thecreate
mode will automatically name the Data Source if noname
parameter is provided; for theappend
andreplace
modes to work, thename
parameter must be provided and the schema must be compatible.name
String
Optional. Name of the Data Source to create, append or replace data. This parameter is mandatory when using the
append
orreplace
modes.url
String
Optional. The URL of the CSV with the data to be imported
dialect_delimiter
String
Optional. The one-character string separating the fields. We try to guess the delimiter based on the CSV contents using some statistics, but sometimes we fail to identify the correct delimiter. If you know your CSV’s field delimiter, you can use this parameter to explicitly define it.
dialect_escapechar
String
Optional. The escapechar removes any special meaning from the following character. This is useful if the CSV does not use double quotes to encapsulate a column but uses double quotes in the content of a column and it is escaped with, e.g. a backslash.
schema
String
Optional. Data Source schema in the format ‘column_name Type, column_name_2 Type2…’. When creating a Data Source with format
ndjson
theschema
must include thejsonpath
for each column, see theJSONPaths
section for more details.engine
String
Optional. Engine for the underlying data. Requires the
schema
parameter.engine_*
String
Optional. Engine parameters and options, check the Engines section for more details
progress
String
Default:
false
. When usingtrue
and sending the data in the request body, Tinybird will return block status while loading using Line-delimited JSON.token
String
Auth token with create or append permissions. Required only if no Bearer Authorization header is found
type_guessing
String
Default:
true
Thetype_guessing
parameter is not taken into account when replacing or appending data to an existing Data Source. When usingfalse
all columns are created asString
otherwise it tries to guess the column types based on the CSV contents. Sometimes you are not familiar with the data and the first step is to get familiar with it: by disabling the type guessing, we enable you to quickly import everything as strings that you can explore with SQL and cast to the right type or shape in whatever way you see fit via a Pipe.debug
String
Optional. Enables returning debug information from logs. It can include
blocks
,block_log
and/orhook_log
replace_condition
String
Optional. When used in combination with the
replace
mode it allows you to replace a portion of your Data Source that matches thereplace_condition
SQL statement with the contents of theurl
or query passed as a parameter. See this guide to learn more.format
String
Default:
csv
. Indicates the format of the data to be ingested in the Data Source. By default iscsv
and you should specifyformat=ndjson
for NDJSON format, andformat=parquet
for Parquet files.
- GET /v0/datasources/?¶
- getting a list of your Data Sources¶
curl \ -H "Authorization: Bearer <DATASOURCES:READ token>" \ -X GET "https://api.tinybird.co/v0/datasources"
Get a list of the Data Sources in your account.
The token you use to query the available Data Sources will determine what Data Sources get returned: only those accessible with the token you are using will be returned in the response.
Successful response¶{ "datasources": [{ "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_bame", "statistics": { "bytes": 430833, "row_count": 3980 } }] }
Request parameters¶ Key
Type
Description
attrs
String
comma separated list of the Data Source attributes to return in the response. Example:
attrs=name,id,engine
. Leave empty to return a full responseNote that the
statistics
’sbytes
androw_count
attributes might benull
depending on how the Data Source was created.
- POST /v0/datasources/(.+)/alter¶
Modify the Data Source schema.
This endpoint supports the operation to alter the following fields of a Data Source:
Request parameters¶ Key
Type
Description
schema
String
Optional. Set the whole schema that adds new columns to the existing ones of a Data Source.
description
String
Optional. Sets the description of the Data Source.
kafka_store_raw_value
Boolean
Optional. Default: false. When set to true, the ‘value’ column of a Kafka Data Source will save the JSON as a raw string.
ttl
String
Optional. Set to any value accepted in ClickHouse for a TTL or to ‘false’ to remove the TTL.
dry
Boolean
Optional. Default: false. Set to true to show what would be modified in the Data Source, without running any modification at all.
The schema parameter can be used to add new columns at the end of the existing ones in a Data Source.
Be aware that currently we don’t validate if the change will affect the existing MVs (Materialized Views) attached to the Data Source to be modified, so this change may break existing MVs. For example, avoid changing a Data Source that has a MV created with something like
SELECT * FROM Data Source ...
. If you want to have forward compatible MVs with column additions, create them especifying the columns instead of using the*
operator.Also, take in account that, for now, the only engines supporting adding new columns are those inside the MergeTree family.
To add a column to a Data Source, call this endpoint with the Data Source name and the new schema definition.
For example, having a Data Source created like this:
Creating a Data Source from a schema¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "schema=symbol String, date Date, close Float32"
if you want to add a new column ‘concept String’, you need to call this endpoint with the new schema:
Adding a new column to an existing Data Source¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources/stocks/alter" \ -d "schema=symbol String, date Date, close Float32, concept String"
If everything went ok, you will get the operations done in the response:
ADD COLUMN operation resulted from the schema change.¶{ "operations": [ "ADD COLUMN `concept` String" ] }
You can also view the inferred operations without executing them adding
dry=true
in the parameters.To modify the description of a Data Source:
Modifying the description a Data Source¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "description=My new description"
To save in the “value” column of a Kafka Data Source the JSON as a raw string:
Saving the raw string in the value column of a Kafka Data Source¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "kafka_store_raw_value=true"
To modify the TTL of a Data Source:
Modifying the TTL of a Data Source¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "ttl=12 hours"
To remove the TTL of a Data Source:
Modifying the TTL of a Data Source¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources" \ -d "name=stocks" \ -d "ttl=false"
You can alter Data Sources of type
ndjson
/parquet
, in that case you have to specify the JSONPath in the schema in the same way as when you created the Data Source.
- POST /v0/datasources/(.+)/truncate¶
Truncates a Data Source in your account. If the Data Source has dependent Materialized Views, those won’t be truncated in cascade. In case you want to delete data from other dependent Materialized Views, you’ll have to do a subsequent call to this method. Auth token in use must have the
DATASOURCES:CREATE
scope.Truncating a Data Source¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/datasources/name/truncate"
This works as well for the
quarantine
table of a Data Source. Remember that the quarantine table for a Data Source has the same name but with the “_quarantine” suffix.Truncating the quarantine table from a Data Source¶curl \ -H "Authorization: Bearer <DATASOURCES:DROP token>" \ -X POST "https://api.tinybird.co/v0/datasources/:name_quarantine/truncate"
- POST /v0/datasources/(.+)/delete¶
Deletes rows from a Data Source in your account given a SQL condition. Auth token in use must have the
DATASOURCES:CREATE
scope.Deleting rows from a Data Source given a SQL condition¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ --data "delete_condition=(country='ES')" \ "https://api.tinybird.co/v0/datasources/:name/delete"
When deleting rows from a Data Source, the response will not be the final result of the deletion but a Job. You can check the job status and progress using the Jobs API. In the response,
id
,job_id
, anddelete_id
should have the same value:Job created and waiting to start¶{ "id": "job_id", "job_id": "job_id", "delete_id": "job_id", "job_url": "http://api.tinybird.co/v0/jobs/job_id", "job": { "kind": "delete_data", "...": "Full job representation as described below" }, "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name" }, "status": "waiting" }
Job
status
can be one of the following:waiting
: The initial status of a job. When creating a job, it has to wait if there’re other jobs runningworking
: Once the job operation has starteddone
: The job has finished successfullyerror
: The job has finished with an error
Job has finished successfully¶{ "id": "c8ae13ef-e739-40b6-8bd5-b1e07c8671c2", "job_id": "c8ae13ef-e739-40b6-8bd5-b1e07c8671c2", "kind": "delete_data", "status": "done", "delete_condition": "column_01=1", "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name" } }
If there’s been an error in the import operation, the job response will also include a detailed error:
Job has finished with errors¶{ "id": "1f6a5a3d-cfcb-4244-ba0b-0bfa1d1752fb", "job_id": "1f6a5a3d-cfcb-4244-ba0b-0bfa1d1752fb", "kind": "delete_data", "status": "error", "delete_condition": "column_01=1", "datasource": { "id": "t_0ab7a11969fa4f67985cec481f71a5c2", "name": "your_datasource_name" }, "error": "[Error] Missing columns: 'random_column' while processing query" }
Data Source engines supported
Tinybird uses ClickHouse as the underlying storage technology. ClickHouse features different strategies to store data, these different strategies define not only where and how the data is stored but what kind of data access, queries, and availability your data has. In ClickHouse terms, a Tinybird Data Source uses a Table Engine that determines those factors.
Currently, Tinybird supports deleting data for data sources with the following Engines:
MergeTree
ReplacingMergeTree
SummingMergeTree
AggregatingMergeTree
CollapsingMergeTree
VersionedCollapsingMergeTree
Dependent views deletion
If the Data Source has dependent Materialized Views, those won’t be cascade deleted. In case you want to delete data from other dependent Materialized Views, you’ll have to do a subsequent call to this method for the affected view with a proper
delete_condition
. This applies as well to the associatedquarantine
Data Source.Request parameters¶ KEY
TYPE
DESCRIPTION
delete_condition
String
Mandatory. A string representing the WHERE SQL clause you’d add to a regular DELETE FROM <table> WHERE <delete_condition> statement. Most of the times you might want to write a simple
delete_condition
such ascolumn_name=value
but any valid SQL statement including conditional operators is valid
- GET /v0/datasources/(.+)¶
- Getting information about a particular Data Source¶
curl \ -H "Authorization: Bearer <DATASOURCES:READ token>" \ -X GET "https://api.tinybird.co/v0/datasources/datasource_name"
Get Data Source information and stats. The token provided must have read access to the Data Source.
Successful response¶{ "id": "t_bd1c62b5e67142bd9bf9a7f113a2b6ea", "name": "datasource_name", "statistics": { "bytes": 430833, "row_count": 3980 }, "used_by": [{ "id": "t_efdc62b5e67142bd9bf9a7f113a34353", "name": "pipe_using_datasource_name" }] "updated_at": "2018-09-07 23:50:32.322461", "created_at": "2018-11-28 23:50:32.322461", "type": "csv" }
id
andname
are two ways to refer to the Data Source in SQL queries and API endpoints. The only difference is that theid
never changes; it will work even if you change thename
(which is the name used to display the Data Source in the UI). In general you can useid
orname
indistinctively:Using the above response as an example:
select count(1) from events_table
is equivalent to
select count(1) from t_bd1c62b5e67142bd9bf9a7f113a2b6ea
The id
t_bd1c62b5e67142bd9bf9a7f113a2b6ea
is not a descriptive name so you can add a description liket_my_events_datasource.bd1c62b5e67142bd9bf9a7f113a2b6ea
The
statistics
property contains information about the table. Those numbers are an estimation:bytes
is the estimated data size on disk androw_count
the estimated number of rows. These statistics are updated whenever data is appended to the Data Source.The
used_by
property contains the list of pipes that are using this data source. Only Pipeid
andname
are sent.The
type
property indicates theformat
used when the Data Source was created. Available formats arecsv
,ndjson
, andparquet
. The Data Sourcetype
indicates what file format you can use to ingest data.
- DELETE /v0/datasources/(.+)¶
- Dropping a Data Source¶
curl \ -H "Authorization: Bearer <DATASOURCES:DROP token>" \ -X DELETE "https://api.tinybird.co/v0/datasources/:name"
Drops a Data Source from your account. Auth token in use must have the
DROP:datasource_name
scope.
- PUT /v0/datasources/(.+)¶
Update Data Source attributes
Updating the name of a Data Source¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X PUT "https://api.tinybird.co/v0/datasources/:name?name=new_name"
Request parameters¶ Key
Type
Description
name
String
new name for the Data Source
token
String
Auth token. Only required if no Bearer Authorization header is sent. It should have
DATASOURCES:CREATE
scope for the given Data Source.
- POST /v0/analyze/?¶
The Analyze API takes a sample of a supported file (
csv
,ndjson
,parquet
) and guesses the file format, schema, columns, types, nullables and JSONPaths (in the case of NDJSON paths).This is a helper endpoint to create Data Sources without having to write the schema manually.
Take into account Tinybird’s guessing algorithm is not deterministic since it takes a random portion of the file passed to the endpoint, that means it can guess different types or nullables depending on the sample analyzed. We recommend to double check the schema guessed in case you have to make some manual adjustments.
Analyze a local file¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -X POST "https://api.tinybird.co/v0/analyze" \ -F "file=@path_to_local_file"
Analyze a remote file¶curl \ -H "Authorization: Bearer <DATASOURCES:CREATE token>" \ -G -X POST "https://api.tinybird.co/v0/analyze" \ --data-urlencode "url=https://example.com/file"
Analyze response¶{ "analysis": { "columns": [ { "path": "$.a_nested_array.nested_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "a_nested_array_nested_array" }, { "path": "$.an_array[:]", "recommended_type": "Array(Int16)", "present_pct": 3, "name": "an_array" }, { "path": "$.field", "recommended_type": "String", "present_pct": 1, "name": "field" }, { "path": "$.nested.nested_field", "recommended_type": "String", "present_pct": 1, "name": "nested_nested_field" } ], "schema": "a_nested_array_nested_array Array(Int16) `json:$.a_nested_array.nested_array[:]`, an_array Array(Int16) `json:$.an_array[:]`, field String `json:$.field`, nested_nested_field String `json:$.nested.nested_field`" }, "preview": { "meta": [ { "name": "a_nested_array_nested_array", "type": "Array(Int16)" }, { "name": "an_array", "type": "Array(Int16)" }, { "name": "field", "type": "String" }, { "name": "nested_nested_field", "type": "String" } ], "data": [ { "a_nested_array_nested_array": [ 1, 2, 3 ], "an_array": [ 1, 2, 3 ], "field": "test", "nested_nested_field": "bla" } ], "rows": 1, "statistics": { "elapsed": 0.000310539, "rows_read": 2, "bytes_read": 142 } } }
The
columns
attribute contains the guessed columns and for each one:path
: The JSONPath syntax in the case of NDJSON/Parquet filesrecommended_type
: The guessed database typepresent_pct
: If the value is lower than 1 then there was nulls in the sample used for guessingname
: The recommended column name
The
schema
attribute is ready to be used in the Data Sources APIThe
preview
contains up to 10 rows of the content of the file.
POST /v0/events¶
This endpoint is currently a feature preview and is considered experimental.
You can send individual events just sending the JSON event in the request body.
curl \
-H "Authorization: Bearer <import_token>" \
-d '{"date": "2020-04-05 00:05:38", "city": "Chicago"}' \
'https://api.tinybird.co/v0/events?name=events_test'
You can also send more than one event per request using \n
to separate them. This is far more optimal than sending one request per event. If you can batch events, it will work faster.
$
before the JSON events. It’s needed in order for Bash to replace the \n
. curl doesn’t do it automatically.¶curl \
-H "Authorization: Bearer <import_token>" \
-d $'{"date": "2020-04-05 00:05:38", "city": "Chicago"}\n{"date": "2020-04-05 00:07:22", "city": "Madrid"}\n' \
'https://api.tinybird.co/v0/events?name=events_test'
Appends several events in NDJSON format to a JSON Data Source. If the Data Source doesn’t exist, create it with a guessed schema. Requests to this endpoint are not restricted by regular rate limits.
Key |
Type |
Description |
---|---|---|
name |
String |
name of the target Data Source to append data to it |
wait |
Boolean |
set to ‘false’ to avoid waiting until the write is acknowledged |