多年之前,就有为不带字幕的游戏视频配上字幕的想法。; |, c9 @' |3 ^" |/ C+ V
但是当时条件不成熟,但是目前来看,条件似乎成熟了9 S; G0 {9 c( Q7 |/ P
+ l/ J& z" O" A9 `2 w Whisper是openAI的开源语音识别软件。
) H% b5 B" U0 [2 { G 它有一个.net的版本,在这个版本的基础上进行少量修改,就能将游戏视频对应的字幕识别成srt格式。
1 W) o" |3 d' A7 U( N: k* E, N 之后,对这个srt文件再进行在线批量翻译之后,进行少量调整之后,汉化工作就完成了。# [4 G* g' G$ R( `
% C" ]" E3 c) O( ]+ y 地址如下
' A$ o5 F- [6 b1 F4 h https://github.com/sandrohanea/whisper.net
: e5 U. ~* m" n7 A7 ?, z
4 i+ v* K: x& Z8 |; _ O1 @5 e, I/ c& a3 T+ ]
编译最好使用vs2022编译,否则在.net sdk版本上会出很多问题。2 y9 E3 V# I% L# u. m9 [* n
9 c- J/ q; Z; A3 Z7 ]6 g' r* N
编译好之后,有几个注意点* @, z* S8 \3 k. p3 c- w1 T
" Z; T% d* v6 y <0>使用的模型文件修改为大模型,ggml-large.bin,用这个模型效果比较好。
- V* @7 Y& T4 i$ R$ j- U0 D( b 当然,所有时间也会比较多,估计转换一批文件需要几个甚至几十个小时。 - G+ O6 C# `, Y3 R. [: w$ d
5 g3 }" g5 g; N; _; T. Q: ]5 _ <1>Language要设定为"english"。2 q& W( V+ ~, Z, W- _
% _4 M& O% a. `* B- {7 ^8 ^6 a* u- /* var builder = factory.CreateBuilder()
% A3 O8 j4 t+ x& c - .WithLanguage(opt.Language);*/( |2 R, d4 j) h8 |1 l6 B
- var builder = factory.CreateBuilder()& i( P' r$ z! @0 t; |" J+ o
- .WithLanguage("english");
复制代码
9 P. X" u1 Y* E# N2 W5 b, D <2>缺省好像只支持Wav格式,而且是要16K采样率的,需要实现转换成这种格式,否则会出错。
* |% \7 q3 ~ U+ n5 v* c) K6 H. o6 D+ L, f
<3>缺省只提供了一个例子wav文件的转换,需要改为批量形式。
, e5 j; Z* D* a6 d' L5 H! q (遍历某个目录中的所有文件)
7 B+ z) P% [4 ^" [
4 E; I* y* Q$ Q3 h+ E1 U2 d* l <4>输出的文件,需要稍加整理,以符合srt格式
( l) r0 {% h% H( r. \2 |6 r( t0 O7 o4 @' E3 Z9 k y0 n! f
以下是一个Wav文件的控制台输出(幽魂开场动画)" D( r, R% W Q- A: l, g- @) e
9 k2 a" P; r6 ^& V! R4 c9 p- Q0 d5 Z
- , M* _2 \# \! f
- whisper_init_from_file_no_state: loading model from 'ggml-large.bin'
( k2 V3 y# W% f0 p0 | - whisper_model_load: loading model0 ~* y2 C4 y/ Y5 b. D# ]4 n# o
- whisper_model_load: n_vocab = 51865* ^9 R V% N9 V. F
- whisper_model_load: n_audio_ctx = 1500! v9 I/ c* D8 v
- whisper_model_load: n_audio_state = 1280
$ o. h5 a! g- g) z; \% { - whisper_model_load: n_audio_head = 20
5 ?/ E' x; L' V% Q$ @- v - whisper_model_load: n_audio_layer = 32 N4 r% p- D+ o2 j, H
- whisper_model_load: n_text_ctx = 448' S7 N* ^6 C( n( @ R! f
- whisper_model_load: n_text_state = 1280$ H( a+ e/ w& p4 l! ?; Z
- whisper_model_load: n_text_head = 20
# U! i- F5 j `$ Q; D' o" B% ^# j - whisper_model_load: n_text_layer = 32
2 L, N# Y4 y' T. b- R - whisper_model_load: n_mels = 80
. ]6 {" |: V; q+ } - whisper_model_load: ftype = 1
! L# e: ` S- o - whisper_model_load: qntvr = 0
, q# B* r$ Y$ }6 d& [$ u$ z( X- L - whisper_model_load: type = 5
' ~; S( S1 m T- | - whisper_model_load: mem required = 3557.00 MB (+ 71.00 MB per decoder)
, @ c+ u9 f+ x& k9 Y2 M0 _+ f - whisper_model_load: adding 1608 extra tokens
: Z& E1 H- X8 P - whisper_model_load: model ctx = 2951.27 MB3 ?! c# k+ t+ p/ ?* w/ b. M
- whisper_model_load: model size = 2950.66 MB
/ _0 d9 D0 z( ~% f) ^- J1 e - whisper_init_state: kv self size = 70.00 MB
" ? Z4 P$ S4 N( f2 A% M - whisper_init_state: kv cross size = 234.38 MB- E* L# ^0 P+ s7 T
- New Segment: 00:00:00 ==> 00:00:02.7600000 : (birds chirping); ^# ]# Z4 M/ `6 O1 H
- New Segment: 00:00:03.6600000 ==> 00:00:05.9000000 : (exhaling), W2 N+ V8 d$ w8 U8 k0 W& N
- New Segment: 00:00:05.9000000 ==> 00:00:08.6600000 : (birds chirping)
- b0 l9 y# P1 L% W - New Segment: 00:00:08.6600000 ==> 00:00:35.1200000 : (gun firing), }1 I8 ?& E$ P. f8 D
- New Segment: 00:00:36.1200000 ==> 00:00:38.5400000 : (gun firing)9 U# ~9 B k1 K, F
- New Segment: 00:00:39.0600000 ==> 00:00:41.4800000 : (gun firing)
- h! I, Q4 g5 S* n0 J9 h" h# q - New Segment: 00:00:41.4800000 ==> 00:00:49.4000000 : (tires screeching)
. W1 U8 @' V' B/ S" @9 i# Z, W - New Segment: 00:00:49.4000000 ==> 00:00:58.5800000 : (glass shattering)/ M- V. S9 G% M6 X
- New Segment: 00:00:58.5800000 ==> 00:01:07.7400000 : (singing in foreign language). c. R9 l# o0 s: f5 D4 ]
- New Segment: 00:01:07.7400000 ==> 00:01:11.5800000 : (singing in foreign language)
% x" g! M- j- u. T5 A4 B1 W6 q - New Segment: 00:01:11.5800000 ==> 00:01:17 : (tires screeching); Y* _5 T. Z5 t4 ]9 ?
- New Segment: 00:01:17 ==> 00:01:24.8400000 : (singing in foreign language)) r, v8 A) E- F. s3 x" H& [& ~; H
- New Segment: 00:01:24.8400000 ==> 00:01:28.6400000 : (panting)& Q: P% Q0 k7 X* c3 z6 j3 j
- New Segment: 00:01:36.7800000 ==> 00:01:39.2000000 : (gun firing)* M: t) F; k% }. n* y
- New Segment: 00:01:39.2000000 ==> 00:01:43.4600000 : - Adrian.
' r" R: t0 P6 Z - New Segment: 00:01:43.4600000 ==> 00:01:45.6200000 : - Oh God.
) ?- C2 l0 Z. D# ]% g" o) G - New Segment: 00:01:45.6200000 ==> 00:01:48.2000000 : - What's the matter sweetheart?
7 r; R# u7 o* N" `. ? - New Segment: 00:01:48.2000000 ==> 00:01:50.4200000 : Oh./ g+ g/ y$ ^/ {( i; k
- New Segment: 00:01:50.4200000 ==> 00:01:53.4600000 : - Oh it's horrible.8 h4 D1 v4 {1 ]0 _
- New Segment: 00:01:53.4600000 ==> 00:01:55.3000000 : - Shh.
. ]4 V9 U9 Y3 h1 u9 n5 h* d - New Segment: 00:01:55.3000000 ==> 00:02:02.3400000 : It was just a bad dream.+ q5 C& \) n( n X
- New Segment: 00:02:05.4200000 ==> 00:02:09.8800000 : - You don't ever have to be afraid of anything.
2 ]1 l6 ~) A- x V/ K - New Segment: 00:02:09.8800000 ==> 00:02:12.8000000 : I'll always be here to protect you.
& g$ D6 J0 H5 e" t7 l - New Segment: 00:02:12.9200000 ==> 00:02:15.5000000 : (gentle music)/ K5 ~. Z& [, H5 E' Y
- New Segment: 00:02:16.4800000 ==> 00:02:19.0600000 : (gentle music)! O7 Q% Q& E# a& n3 y
- New Segment: 00:02:19.0600000 ==> 00:02:21.6400000 : (gentle music)
5 _% L- e6 }8 B: {& ^6 o! ] - New Segment: 00:02:21.6400000 ==> 00:02:24.2200000 : (gentle music) D, d3 Y5 c4 Y
- New Segment: 00:02:24.5400000 ==> 00:02:27.1200000 : (gentle music): j, P, z# Y6 w3 b3 g
- New Segment: 00:02:27.1200000 ==> 00:02:29.7000000 : (gentle music)
- d' T5 \' O3 K/ ~ - New Segment: 00:02:29.7000000 ==> 00:02:33.1800000 : [Music]
/ w& b5 M7 N' O* |, q1 I -
复制代码 ) x: n7 i4 c! C. K
& P r4 i4 v2 B+ M
|