PicoAudio2 Online Inference

Definition

TCC (Temporal Coarse Caption):
A brief text description for the overall audio scene.
Example: a dog barks

TDC (Temporal Detailed Caption):
A caption with timestamp information for each event.
It allows precise temporal control over when events happen in the generated audio.
Example: a dog barks(1.0-2.0, 3.0-4.0); a man speaks(5.0-6.0)

Input Requirements & Format

TCC is required for audio generation.
TDC is optional. If provided, it should follow the format: event1(start1-end1, start2-end2); event2(start1-end1, ...)
Length (in seconds) is optional, but recommended for temporal control. The length defaults to 10.0 seconds.
Enable Time Control: Tick to use TDC and length for precise event timing.

Notes

If TDC format is incorrect or length is missing, the model will generate audio without precise temporal control.
For general audio generation, it is recommended to input random for TDC.
You may leave TDC blank to let the LLM generate timestamps automatically (subject to API quota).