多年之前,就有为不带字幕的游戏视频配上字幕的想法。( {! s; }, Y0 \8 x2 `- ]/ s! e
但是当时条件不成熟,但是目前来看,条件似乎成熟了6 j" k! ~( @' M; |( y3 b
& V6 M! I+ v, G5 R! q; v0 R: I D0 v
Whisper是openAI的开源语音识别软件。
6 F7 S7 J) o! g: E6 L6 W 它有一个.net的版本,在这个版本的基础上进行少量修改,就能将游戏视频对应的字幕识别成srt格式。
: n0 y) j( o) R9 V4 k( ?- ~ 之后,对这个srt文件再进行在线批量翻译之后,进行少量调整之后,汉化工作就完成了。) B7 N1 u* q' c
# K# t& N7 K. b+ @
地址如下/ v# x8 u) @6 W( f9 O, Q& q
https://github.com/sandrohanea/whisper.net3 }, B S7 \3 L* p; R4 @4 o% g% F
& z# z8 O( b" z+ C9 v+ W$ l: x3 w$ f
编译最好使用vs2022编译,否则在.net sdk版本上会出很多问题。! k- g* a6 f9 M. I& f" L# E6 Q
4 A R7 j9 H/ _. q
编译好之后,有几个注意点/ `6 y9 @' G7 {
' S9 }5 ~3 `+ z* @+ z, J: _ C1 i+ }
<0>使用的模型文件修改为大模型,ggml-large.bin,用这个模型效果比较好。& U) @% i+ ~ ~4 f: ?/ s
当然,所有时间也会比较多,估计转换一批文件需要几个甚至几十个小时。 / W. j! k7 H6 K9 \
; e* p! K0 h! _5 ]0 _
<1>Language要设定为"english"。! } L9 v7 Z3 H* l; S3 e
* ^+ k% `, w) ~4 Q6 K9 K( C+ Y
- /* var builder = factory.CreateBuilder()
9 l' w$ m& ?7 |; g3 J - .WithLanguage(opt.Language);*/
7 }0 ^" E1 i+ d8 \ - var builder = factory.CreateBuilder()
; p8 C' n# \( \8 N4 y& W. f - .WithLanguage("english");
复制代码 : |2 v& _/ i; c
<2>缺省好像只支持Wav格式,而且是要16K采样率的,需要实现转换成这种格式,否则会出错。
3 F9 `$ r. ]$ @, R$ E* ?7 Y' Y- U" q$ J1 `# D( j
<3>缺省只提供了一个例子wav文件的转换,需要改为批量形式。- _( S! v' _. p f. \( t
(遍历某个目录中的所有文件)4 e) O3 d6 p2 E! P6 q4 D+ Z
: p5 m: s- r: N( b: W# X2 _ <4>输出的文件,需要稍加整理,以符合srt格式9 @' {+ l* E3 o' W- q" F* b. i
1 L8 U3 ?, g) K: j$ Q 以下是一个Wav文件的控制台输出(幽魂开场动画)
. c9 \) E% H) Q3 f" B% b9 n1 m* K- ]% D# L) f: s
( j; R3 V( u' R$ }4 {: V' R+ H- whisper_init_from_file_no_state: loading model from 'ggml-large.bin'
! P: M2 X, o, e( \ - whisper_model_load: loading model
6 D/ w$ W6 F1 c) W - whisper_model_load: n_vocab = 51865
* Q: F) C' N% w; L# C% m - whisper_model_load: n_audio_ctx = 1500+ ?8 Q4 y% E! z+ D
- whisper_model_load: n_audio_state = 1280% ^5 y; }+ c7 r# x
- whisper_model_load: n_audio_head = 20
% D, C& ?# q4 o: {8 i - whisper_model_load: n_audio_layer = 321 S0 x7 g% u7 i: B7 u) w. V! X* c
- whisper_model_load: n_text_ctx = 448* W7 A$ f3 n( N+ v
- whisper_model_load: n_text_state = 1280
& U Z, @- z- E& ~2 w - whisper_model_load: n_text_head = 207 r; x3 |4 S$ |' L
- whisper_model_load: n_text_layer = 32
' L M3 ^- a2 W- k9 z9 F - whisper_model_load: n_mels = 80
- k. ^4 n5 t: G/ R' w - whisper_model_load: ftype = 1! v9 N" Z5 c, O( @
- whisper_model_load: qntvr = 0
* p: w5 Z' ?- g2 y# H; } h! c8 h - whisper_model_load: type = 56 P# p9 V0 Z% Z: R& n+ y; {
- whisper_model_load: mem required = 3557.00 MB (+ 71.00 MB per decoder)% e$ S2 ~9 U5 u4 |' O7 i* V4 T
- whisper_model_load: adding 1608 extra tokens
7 C% N% [ c9 ?3 n! E - whisper_model_load: model ctx = 2951.27 MB
6 x; k9 o6 b, F* Y" d1 Q/ H$ I4 A - whisper_model_load: model size = 2950.66 MB
3 K& z% [, |. K/ n( _/ \ e, O - whisper_init_state: kv self size = 70.00 MB
o Z& y$ t& o5 d J% u) R - whisper_init_state: kv cross size = 234.38 MB+ P1 |3 t* T, I& m
- New Segment: 00:00:00 ==> 00:00:02.7600000 : (birds chirping)) b9 M9 n. U9 _7 L" i
- New Segment: 00:00:03.6600000 ==> 00:00:05.9000000 : (exhaling)
4 o1 v. r* @. ^5 b - New Segment: 00:00:05.9000000 ==> 00:00:08.6600000 : (birds chirping)
* W5 e, N% f4 d - New Segment: 00:00:08.6600000 ==> 00:00:35.1200000 : (gun firing)
+ p/ u0 O* l* E0 ]- ~! U - New Segment: 00:00:36.1200000 ==> 00:00:38.5400000 : (gun firing)
0 {9 e4 _3 O1 C - New Segment: 00:00:39.0600000 ==> 00:00:41.4800000 : (gun firing)9 ^- R/ M1 u! E6 U# ]1 X+ F
- New Segment: 00:00:41.4800000 ==> 00:00:49.4000000 : (tires screeching)* y# r, i+ Z) d! {5 a8 |$ J, T
- New Segment: 00:00:49.4000000 ==> 00:00:58.5800000 : (glass shattering)0 P/ ^0 m0 h6 b
- New Segment: 00:00:58.5800000 ==> 00:01:07.7400000 : (singing in foreign language)# O5 u$ K6 Q; r: D! p
- New Segment: 00:01:07.7400000 ==> 00:01:11.5800000 : (singing in foreign language)
' r! R% c. w2 h5 O! K - New Segment: 00:01:11.5800000 ==> 00:01:17 : (tires screeching)
$ }3 |. G# b- ]3 V& t$ |% X - New Segment: 00:01:17 ==> 00:01:24.8400000 : (singing in foreign language)
" K/ L" T b, U; ~4 U& w; j - New Segment: 00:01:24.8400000 ==> 00:01:28.6400000 : (panting)
% j1 ? u8 O* V: j- L - New Segment: 00:01:36.7800000 ==> 00:01:39.2000000 : (gun firing)
6 S- l8 j- c3 j. a6 Y - New Segment: 00:01:39.2000000 ==> 00:01:43.4600000 : - Adrian.+ f7 j" _7 Z' r) Q7 @% f
- New Segment: 00:01:43.4600000 ==> 00:01:45.6200000 : - Oh God.0 R/ L- R6 l( ?4 M) B8 ~) z0 d2 `/ J8 D
- New Segment: 00:01:45.6200000 ==> 00:01:48.2000000 : - What's the matter sweetheart?
1 ]: G( @) ` l- |3 i/ n$ \ - New Segment: 00:01:48.2000000 ==> 00:01:50.4200000 : Oh.
5 A/ t% h& L2 W6 [ - New Segment: 00:01:50.4200000 ==> 00:01:53.4600000 : - Oh it's horrible.* Z& m `! |4 A0 x% s, d% o
- New Segment: 00:01:53.4600000 ==> 00:01:55.3000000 : - Shh.+ u1 a0 O8 w6 P1 `( \) d. g$ n
- New Segment: 00:01:55.3000000 ==> 00:02:02.3400000 : It was just a bad dream.
: F; T! N! \/ G5 Y2 V( O+ E - New Segment: 00:02:05.4200000 ==> 00:02:09.8800000 : - You don't ever have to be afraid of anything.$ y. u2 O7 q0 K+ d# G
- New Segment: 00:02:09.8800000 ==> 00:02:12.8000000 : I'll always be here to protect you.# n" H9 `3 z9 g$ ~& x W) X$ A7 @
- New Segment: 00:02:12.9200000 ==> 00:02:15.5000000 : (gentle music)
# @, w% f( b! m" u' W - New Segment: 00:02:16.4800000 ==> 00:02:19.0600000 : (gentle music) a1 Q' p& [- Y8 v% h/ Y
- New Segment: 00:02:19.0600000 ==> 00:02:21.6400000 : (gentle music)7 u2 `7 z8 m! r. @
- New Segment: 00:02:21.6400000 ==> 00:02:24.2200000 : (gentle music)
) O% P3 g% Z% Y% l& w* H* y C3 q - New Segment: 00:02:24.5400000 ==> 00:02:27.1200000 : (gentle music)3 O; r6 H' W/ J& w% G* @
- New Segment: 00:02:27.1200000 ==> 00:02:29.7000000 : (gentle music)
2 ?' C* C. M3 }2 l9 ] - New Segment: 00:02:29.7000000 ==> 00:02:33.1800000 : [Music]* b) ~; }2 F" ?
-
复制代码 ( I+ m7 j% j2 C6 a& ]
7 K N8 o. u6 u+ ]* R- {3 `/ X6 w |