多年之前,就有为不带字幕的游戏视频配上字幕的想法。, ]" U3 {) S/ ]. d2 y# @9 m
但是当时条件不成熟,但是目前来看,条件似乎成熟了
# h% A# t3 ]3 P6 x' T7 R. C
. X3 H3 J( Y1 s6 ]$ H% c1 r! ? Whisper是openAI的开源语音识别软件。
8 Y: B1 d/ S1 \7 k; G: n2 d 它有一个.net的版本,在这个版本的基础上进行少量修改,就能将游戏视频对应的字幕识别成srt格式。
^1 W) Y& p% D: p6 G- p 之后,对这个srt文件再进行在线批量翻译之后,进行少量调整之后,汉化工作就完成了。
# ^. l4 |3 r: l
! T5 E. u9 R. _ 地址如下
2 y) \- s/ b' m( g3 \1 e https://github.com/sandrohanea/whisper.net
" B% @) B( C, a, o7 o: ], k. W3 q8 ~' U6 w& |
! X0 [1 p K. M; L- t: e5 s 编译最好使用vs2022编译,否则在.net sdk版本上会出很多问题。4 A5 E# Q& _ Y6 E, I* j& T
1 f# v% y I* }- N2 [ 编译好之后,有几个注意点
* u( y- Z- N" @1 J' p, k6 M- y) N1 V* v9 _* j/ b+ Q6 |
<0>使用的模型文件修改为大模型,ggml-large.bin,用这个模型效果比较好。
0 S* w" J; u7 r6 a 当然,所有时间也会比较多,估计转换一批文件需要几个甚至几十个小时。
; d7 P* u! n+ q/ c
) k: L: r V& j1 l <1>Language要设定为"english"。% a9 M" p' I; t+ E$ q: D
, \7 \$ y' a6 d* P6 \* P I- /* var builder = factory.CreateBuilder()8 r' G8 Q9 M- z% C7 J4 E
- .WithLanguage(opt.Language);*/: ?. k/ B3 D% z, h- K( D$ A
- var builder = factory.CreateBuilder()9 k& M' d7 ^8 t
- .WithLanguage("english");
复制代码 ; n, O$ }% ]1 `; V( D$ y2 |! v
<2>缺省好像只支持Wav格式,而且是要16K采样率的,需要实现转换成这种格式,否则会出错。0 s/ v/ v z4 f2 q3 p) l# E3 W* |6 I
- O0 A: M" J( H# Q2 a: G% e3 S7 k <3>缺省只提供了一个例子wav文件的转换,需要改为批量形式。4 W$ x( ^! p" o, m+ j* l7 O
(遍历某个目录中的所有文件)
0 l, R( u. b& w3 @9 L& @& N3 \0 y, Z8 ~, Z* O9 }: m2 L5 J5 F
<4>输出的文件,需要稍加整理,以符合srt格式# [1 R' g0 U, K- i
4 a. b* P/ X j8 g) M6 o7 T U! |
以下是一个Wav文件的控制台输出(幽魂开场动画)4 T7 y. p* T- H& q' x
j/ ~# A, F5 b4 a* g) m- 1 n a( ^9 m& j' q% l* ?+ Y8 U8 B
- whisper_init_from_file_no_state: loading model from 'ggml-large.bin'
; W8 M5 g+ b" \4 \7 a& { - whisper_model_load: loading model7 }- |+ q7 {' @" T& T
- whisper_model_load: n_vocab = 51865
( b+ k# q) p. } R/ H# w4 \9 y - whisper_model_load: n_audio_ctx = 1500# F: O: n& r+ l7 g( [) i% R
- whisper_model_load: n_audio_state = 1280! F2 N0 n3 G; f
- whisper_model_load: n_audio_head = 20
5 P2 a( ?3 V: c$ R1 `% G - whisper_model_load: n_audio_layer = 32
6 l/ @* J+ P1 J" n$ a: e - whisper_model_load: n_text_ctx = 448+ X' W! ?+ F# y9 @
- whisper_model_load: n_text_state = 1280
' U9 b8 ~& @9 ]( m- e3 ~6 F7 H1 r - whisper_model_load: n_text_head = 203 u7 F. G3 l' H6 s' Q7 |' ~
- whisper_model_load: n_text_layer = 32
i7 A6 [5 p. G4 y( p8 `( T - whisper_model_load: n_mels = 80: n- B9 q6 ~# s/ X) n7 [8 E% c, T; @
- whisper_model_load: ftype = 17 `2 L. e/ b& L6 w
- whisper_model_load: qntvr = 0
- ?) a4 @) i7 L9 {, W - whisper_model_load: type = 56 U+ n8 i6 ^1 i, f3 ^: U8 ?
- whisper_model_load: mem required = 3557.00 MB (+ 71.00 MB per decoder)
% s# C9 f$ L9 N3 _, m - whisper_model_load: adding 1608 extra tokens
7 @4 z1 z3 [" B4 l0 W/ h - whisper_model_load: model ctx = 2951.27 MB
& }5 @. b+ m6 T: h - whisper_model_load: model size = 2950.66 MB' B8 u7 t* s% _ d9 J! x6 r
- whisper_init_state: kv self size = 70.00 MB
" `0 l, P) u& }* [4 L$ O v - whisper_init_state: kv cross size = 234.38 MB
+ C" N$ v0 y6 \. Y; t - New Segment: 00:00:00 ==> 00:00:02.7600000 : (birds chirping)
& O3 w$ w* v$ U( ~ - New Segment: 00:00:03.6600000 ==> 00:00:05.9000000 : (exhaling)
" s/ y8 n' z4 a9 C5 F2 Z* I - New Segment: 00:00:05.9000000 ==> 00:00:08.6600000 : (birds chirping)
5 W. A1 b8 j, I8 m1 |5 p7 r - New Segment: 00:00:08.6600000 ==> 00:00:35.1200000 : (gun firing), ^3 D3 c& S) y
- New Segment: 00:00:36.1200000 ==> 00:00:38.5400000 : (gun firing)
( V# M. R% Q" T: `0 L K2 m - New Segment: 00:00:39.0600000 ==> 00:00:41.4800000 : (gun firing)
$ }; Z3 m) Z. E - New Segment: 00:00:41.4800000 ==> 00:00:49.4000000 : (tires screeching)
, D' G$ w- s1 L8 R, V3 { - New Segment: 00:00:49.4000000 ==> 00:00:58.5800000 : (glass shattering)2 f+ X1 F5 Z2 B. |4 P
- New Segment: 00:00:58.5800000 ==> 00:01:07.7400000 : (singing in foreign language)1 W+ a! Z; x# ~2 f3 ?
- New Segment: 00:01:07.7400000 ==> 00:01:11.5800000 : (singing in foreign language)
% V% ^. W0 R2 E2 ?1 E; T - New Segment: 00:01:11.5800000 ==> 00:01:17 : (tires screeching)
% E& o- `! U" D' \ - New Segment: 00:01:17 ==> 00:01:24.8400000 : (singing in foreign language). b" I% I) _ f3 M
- New Segment: 00:01:24.8400000 ==> 00:01:28.6400000 : (panting)# v( V& l5 T& |4 J2 z, |9 |! i G) f
- New Segment: 00:01:36.7800000 ==> 00:01:39.2000000 : (gun firing)
' |0 ~ e& w, c/ V" w - New Segment: 00:01:39.2000000 ==> 00:01:43.4600000 : - Adrian.
. B/ F+ \. L' I; N8 W" ~! m2 H - New Segment: 00:01:43.4600000 ==> 00:01:45.6200000 : - Oh God.3 }* g6 r0 E' P8 e7 g
- New Segment: 00:01:45.6200000 ==> 00:01:48.2000000 : - What's the matter sweetheart?. t Q% z& z9 A D$ k9 s: I
- New Segment: 00:01:48.2000000 ==> 00:01:50.4200000 : Oh.5 @# U: [4 T+ c9 q
- New Segment: 00:01:50.4200000 ==> 00:01:53.4600000 : - Oh it's horrible.
* E% k8 p+ P# y" b2 I - New Segment: 00:01:53.4600000 ==> 00:01:55.3000000 : - Shh.
& u5 b/ [% @7 i, h/ J! N. g - New Segment: 00:01:55.3000000 ==> 00:02:02.3400000 : It was just a bad dream.4 V" ], G( u7 s- U
- New Segment: 00:02:05.4200000 ==> 00:02:09.8800000 : - You don't ever have to be afraid of anything.5 G6 }, F% H: q
- New Segment: 00:02:09.8800000 ==> 00:02:12.8000000 : I'll always be here to protect you.4 B1 i( J" k- n y9 H, c( Z
- New Segment: 00:02:12.9200000 ==> 00:02:15.5000000 : (gentle music)& x Z+ ]( b6 G. x
- New Segment: 00:02:16.4800000 ==> 00:02:19.0600000 : (gentle music)6 [3 k6 U/ m( K4 G7 \& H4 N
- New Segment: 00:02:19.0600000 ==> 00:02:21.6400000 : (gentle music)
" f+ T0 N/ O* |; t9 `. q - New Segment: 00:02:21.6400000 ==> 00:02:24.2200000 : (gentle music)
1 X8 T+ N( V% O- ?5 m - New Segment: 00:02:24.5400000 ==> 00:02:27.1200000 : (gentle music)
: T2 G" V& n/ L - New Segment: 00:02:27.1200000 ==> 00:02:29.7000000 : (gentle music)
1 W/ \, H: c; j' g2 l - New Segment: 00:02:29.7000000 ==> 00:02:33.1800000 : [Music]4 ?/ B& ?8 I: M# n, C' b+ { B$ U Y
-
复制代码 1 E5 m9 K& a. o6 V0 j N
% N4 C/ K9 f: @1 j* b6 L( V2 P |