Uploading Data to a Dataset

View as MarkdownOpen in Claude

This guide covers the V4 dataset upload flow: requesting a presigned URL, uploading a file, and polling the import job until processing completes.

For V3 (legacy CSV/ZIP) uploads, see Working with Datasets.

Overview

V4 uploads follow a three-step flow:

1

Request a presigned URL

Call GET /datasets/{dataset_id}/upload-url/{filename}. The service creates an import job and returns a presigned S3 URL and an import_id.

2

Upload the file to S3

PUT the file directly to S3 using the presigned URL. The file never passes through the Prolific API.

3

Poll the import job

Poll GET /datasets/{dataset_id}/imports/{import_id} until a terminal status is reached.

Step 1: Request a presigned URL

$GET /api/v1/data-collection/datasets/{dataset_id}/upload-url/{filename}

Example:

$curl -H "Authorization: Token {api_token}" \
> "https://api.prolific.com/api/v1/data-collection/datasets/0192a3b5-e8f9-7a0b-1c2d-3e4f5a6b7c8d/upload-url/reviews.jsonl"

Response:

1{
2 "upload_url": "https://s3.amazonaws.com/raw-datasets/0192a3b5.../01935c2d.../reviews.jsonl?X-Amz-Signature=...",
3 "http_method": "PUT",
4 "import_id": "01935c2d-1a2b-3c4d-5e6f-7a8b9c0d1e2f"
5}

Save the import_id — you will need it to poll for status.

Step 2: Upload the file to S3

PUT the file directly to the presigned URL. No authentication headers are required for the S3 request.

Use the content_type value from the upload URL response as the Content-Type header — for example, application/x-ndjson for JSONL files.

$curl -X PUT \
> -H "Content-Type: {content_type}" \
> --data-binary @reviews.jsonl \
> "{upload_url}"

Step 3: Poll the import job

$GET /api/v1/data-collection/datasets/{dataset_id}/imports/{import_id}

Example:

$curl -H "Authorization: Token {api_token}" \
> "https://api.prolific.com/api/v1/data-collection/datasets/0192a3b5.../imports/01935c2d..."

Poll every few seconds until the status is terminal (complete, partial, failed, or pending_schema).

Import job statuses

StatusTerminalDescription
uninitialisedNoImport job created; waiting for S3 upload
queuedNoFile received; queued for extraction
processingNoExtraction in progress
completeYesAll records accepted
partialYesSome records accepted, some rejected
failedYesExtraction failed entirely
pending_schemaYesDataset has no schema; upload paused

Handling outcomes

Complete

All records were accepted. The dataset is ready to be used.

1{
2 "import_id": "01935c2d-1a2b-3c4d-5e6f-7a8b9c0d1e2f",
3 "status": "complete",
4 "accepted_count": 1000
5}

Partial

Some records were accepted and some were rejected. The accepted records are available immediately — you do not need to re-upload the entire file. Review the errors array to understand which records were rejected and why.

1{
2 "import_id": "01935c2d-1a2b-3c4d-5e6f-7a8b9c0d1e2f",
3 "status": "partial",
4 "accepted_count": 997,
5 "rejected_count": 3,
6 "errors": [
7 { "record_index": 47, "field": "review_text", "reason": "Value exceeds maximum length" },
8 { "record_index": 312, "field": null, "reason": "Record missing required field: product_name" },
9 { "record_index": 891, "field": "image_url", "reason": "Value is not a valid URL" }
10 ]
11}

Fix the rejected records in a new file and upload it separately using the same three-step flow. The new records will be appended to the existing accepted records.

Failed

The file could not be parsed at all. Check the reason field and correct the file before re-uploading.

1{
2 "import_id": "01935c2d-1a2b-3c4d-5e6f-7a8b9c0d1e2f",
3 "status": "failed",
4 "reason": "File could not be parsed as JSONL: invalid JSON on line 1"
5}

Pending schema

The dataset does not yet have a schema defined. Set a schema on the dataset and then re-upload the file.

Uploading multiple files

You can upload multiple files to the same dataset. Each upload creates a new import job with its own import_id. Uploads for the same dataset are processed one at a time in arrival order, so a second upload will wait in queued status while the first is still processing.

$# Upload first file
$GET /datasets/{dataset_id}/upload-url/batch1.jsonl
$# → import_id: "aaa..."
$
$# Upload second file (while first may still be processing)
$GET /datasets/{dataset_id}/upload-url/batch2.jsonl
$# → import_id: "bbb..."
$
$# Poll both independently
$GET /datasets/{dataset_id}/imports/aaa...
$GET /datasets/{dataset_id}/imports/bbb...

The total_datapoint_count field on the dataset reflects the cumulative count across all completed imports.