PicoAudio2 Online Inference

Definition

TCC (Temporal Coarse Caption):
A brief text description for the overall audio scene.
Example: a dog barks

TDC (Temporal Detailed Caption):
A caption with timestamp information for each event.
It allows precise temporal control over when events happen in the generated audio.
Example: a dog barks(1.0-2.0, 3.0-4.0); a man speaks(5.0-6.0)


Input Requirements & Format

  • TCC is required for audio generation.
  • TDC is optional. If provided, it should follow the format: event1(start1-end1, start2-end2); event2(start1-end1, ...)
  • Length (in seconds) is optional, but recommended for temporal control. The length defaults to 10.0 seconds.
  • Enable Time Control: Tick to use TDC and length for precise event timing.

Notes

  • If TDC format is incorrect or length is missing, the model will generate audio without precise temporal control.
  • For general audio generation, it is recommended to input random for TDC.
  • You may leave TDC blank to let the LLM generate timestamps automatically (subject to API quota).