ccextractor project is a fast closed captions extractor for MPEG files.
ccextractor is mostly a mildly optimized C port of McPoodle's excellent but painfully slow Perl script SCC_RIP. It lets you rip the raw closed captions (read: subtitles) data from a number of sources, such as DVD or replay TV.
As an added bonus compared to the original SCC_RIP, ccextractor can extract subtitles from the HDTV transport streams that are becoming more common.
At this point ccextractor extracts the line 21 captions (which must legally be present for a number of years until the transition to digital is complete). Note that in most .ts you can find, there will be subtitle data for both analog (EIA-608) decoders and digital (EIA-708). AFAIK there are not
freely available EIA-708 rippers.
Anyway, since line 21 captions will be available for some time, we have time to build a decent 708 ripper.
Basic Usage
For details on CC, please go to McPoodle's page:
http://www.geocities.com/mcpoodle43/SCC_TOOLS/DOCS/SCC_TOOLS.HTML
You will need his tools to use ccextrator's output.
The basic idea is that you get the raw closed caption dump from ccextractor.
Then you need other tools (which vary depending on what you want to do) to continue processing.
To get a transcript from a .ts file in .srt (I assume this will be the most common use) do this:
ccextractor -12 input_file
-12 means "extract both subtitle tracks" (actually technical names are fields but tracks is easier to understand). 1 is almost always English. 2 is Spanish in HBO (at least in the few samples I've seen) but could be anything. Just extract both of them and check.
Example: cctractor -12 house315.ts
ccextractor will create two files, called house315_1.bin and _2.
Then use McPoodle's RAW2SCC to create a temporary SCC file (means Scenerist, which is originally the native format for some program, it's not important here).
raw2scc house315_1.bin
This creates house315_1.scc
From this .scc file, you can get the final .srt by using McPoodle's CCASDI:
ccasdi -s house315_1.srt
Which looks like this (just 3 random lines shown).
514
00:24:07,400 --> 00:24:09,300
They've got another trial
going on at Duke.
515
00:24:09,367 --> 00:24:12,567
15% extend their lives
beyond five years.
516
00:24:12,634 --> 00:24:13,701
If you're positive
for protein PHF--
· This release adds support for DVR-MS files.
· It improves the CC decoder.
· There are several bugfixes, a major speed boost (20%-40%), improved timing for non-TS files, improved format autodetection, and other minor improvements.
· Описание и дополнения от редакторов и пользователей сайта