多年之前,就有为不带字幕的游戏视频配上字幕的想法。
: ^0 v( D5 E+ q 但是当时条件不成熟,但是目前来看,条件似乎成熟了
0 {7 t+ w+ a8 l
`( k# o3 C) ^- c Whisper是openAI的开源语音识别软件。% x, C' T4 A, D2 \% R! P1 g
它有一个.net的版本,在这个版本的基础上进行少量修改,就能将游戏视频对应的字幕识别成srt格式。- K' A) ]9 t) R ~6 R, y" H
之后,对这个srt文件再进行在线批量翻译之后,进行少量调整之后,汉化工作就完成了。
9 r: k- ~* M9 n( E) M, E5 l% s2 f2 `5 }( ~: n# U: U: c1 q4 N% i
地址如下
- p, _7 q! O+ M6 v2 V9 h8 V" }9 P https://github.com/sandrohanea/whisper.net
/ {6 [6 R: \/ c' x1 K* Z5 h, g$ ~( A( M* `
5 O' S, P5 G* y& F: c* A b8 k/ S; I1 l: _ 编译最好使用vs2022编译,否则在.net sdk版本上会出很多问题。
7 g' I5 M& a4 i9 D4 v) [
" |2 w2 n5 h+ t( I; F6 c 编译好之后,有几个注意点
: A! m& U, @( ^4 h8 V8 f
" C4 u3 h. C3 Z% G/ } <0>使用的模型文件修改为大模型,ggml-large.bin,用这个模型效果比较好。
+ y" q! X ~( Q 当然,所有时间也会比较多,估计转换一批文件需要几个甚至几十个小时。
' d: w: M* I/ e% Y! T' T( O) Z: H
* T: c4 O7 k3 D" Q9 ]# r <1>Language要设定为"english"。2 B4 m6 _( m5 Y# [5 M7 q
3 s+ v9 Z8 t% z( D0 ^2 i& _- /* var builder = factory.CreateBuilder()
% g1 q/ g, Y8 _4 o; A - .WithLanguage(opt.Language);*/: ~ F" U3 i3 }& F
- var builder = factory.CreateBuilder(), k Y1 S8 b5 G B( |
- .WithLanguage("english");
复制代码 ) r# N( |( c3 ?! Q' b2 V$ h5 G
<2>缺省好像只支持Wav格式,而且是要16K采样率的,需要实现转换成这种格式,否则会出错。+ Y2 m% F2 ^7 @6 `3 L+ J) r5 h7 {! N
* |' ]! U# m/ M2 S6 t <3>缺省只提供了一个例子wav文件的转换,需要改为批量形式。
3 f+ H6 N, ], R3 f; }% o u( k (遍历某个目录中的所有文件)
* r* Y3 ^, A {! L f$ Q X* U# h" H
<4>输出的文件,需要稍加整理,以符合srt格式( w8 p+ v/ N, F( x" D* B' J
4 Y# `6 f! F7 i G1 [/ V 以下是一个Wav文件的控制台输出(幽魂开场动画)
: C2 ?6 a# K1 z, t( {/ V$ o! g8 ^. U; T
- 6 H7 I: `* ~. Q# d0 a. H/ `
- whisper_init_from_file_no_state: loading model from 'ggml-large.bin'
]$ [) s% k2 T* s - whisper_model_load: loading model1 ]* u2 n. Y0 |6 y& k$ p$ p
- whisper_model_load: n_vocab = 51865# ?) H S e3 A- V# V" y' U4 `0 u
- whisper_model_load: n_audio_ctx = 1500! y0 o$ A' V- T* N8 N* w
- whisper_model_load: n_audio_state = 1280
' J- w: I' C4 b, z: l3 ~! d J - whisper_model_load: n_audio_head = 208 _" L! h' e" ~, r8 M+ q) h
- whisper_model_load: n_audio_layer = 32) C# N- j. b5 S7 f2 n
- whisper_model_load: n_text_ctx = 448
p! n8 ?1 t* n! f" l; Y9 K& W - whisper_model_load: n_text_state = 1280: f# F0 q, D: s: e6 l# J9 b6 j
- whisper_model_load: n_text_head = 20) v- _( ?1 W, N! j2 B; V6 V
- whisper_model_load: n_text_layer = 32
3 w3 J1 \/ W V- ]4 W9 Q! J4 z- O$ M - whisper_model_load: n_mels = 80: o0 P' y- {* w0 T4 V1 X
- whisper_model_load: ftype = 1/ A; L4 |6 g& t/ \( r3 G& ^' `7 w0 B
- whisper_model_load: qntvr = 06 P4 ], D! P) y+ s- u
- whisper_model_load: type = 58 |$ `- d9 U% b9 z- m/ y
- whisper_model_load: mem required = 3557.00 MB (+ 71.00 MB per decoder) f: N4 a8 v% Z4 j& z
- whisper_model_load: adding 1608 extra tokens
; H$ B1 R$ d/ \. z- } - whisper_model_load: model ctx = 2951.27 MB+ H! Q) K+ P5 G A6 H
- whisper_model_load: model size = 2950.66 MB
6 {( N. m3 T) v - whisper_init_state: kv self size = 70.00 MB
' F5 `- G1 t/ f. ~1 s - whisper_init_state: kv cross size = 234.38 MB
' c9 s ~! V d! ?$ d/ m! o, } - New Segment: 00:00:00 ==> 00:00:02.7600000 : (birds chirping)
8 h: @8 C, Q0 i# @" `' k9 U - New Segment: 00:00:03.6600000 ==> 00:00:05.9000000 : (exhaling)
, \6 b) n% H1 T2 q! ] - New Segment: 00:00:05.9000000 ==> 00:00:08.6600000 : (birds chirping)
; g- x. ~# B3 W' s3 l- o - New Segment: 00:00:08.6600000 ==> 00:00:35.1200000 : (gun firing)+ _" o, Z0 s4 c; V% C. O: Z
- New Segment: 00:00:36.1200000 ==> 00:00:38.5400000 : (gun firing)
8 a0 @- r4 J- T1 V+ o) z4 B6 m4 M4 q - New Segment: 00:00:39.0600000 ==> 00:00:41.4800000 : (gun firing)
" T9 }# {* o2 v* \- n - New Segment: 00:00:41.4800000 ==> 00:00:49.4000000 : (tires screeching)
' M; q$ H( q& ^8 Q - New Segment: 00:00:49.4000000 ==> 00:00:58.5800000 : (glass shattering)# s, n `, \- l# I( j
- New Segment: 00:00:58.5800000 ==> 00:01:07.7400000 : (singing in foreign language). B# @0 t T: D! o
- New Segment: 00:01:07.7400000 ==> 00:01:11.5800000 : (singing in foreign language)- d6 l8 R. c3 C) n J5 U5 i y% G% Q
- New Segment: 00:01:11.5800000 ==> 00:01:17 : (tires screeching)
! |+ _0 m9 l9 j2 M3 g4 L) d& v- l - New Segment: 00:01:17 ==> 00:01:24.8400000 : (singing in foreign language)8 N% F. X- \9 ^% a
- New Segment: 00:01:24.8400000 ==> 00:01:28.6400000 : (panting)
7 O. I0 y- r, S4 R/ _ - New Segment: 00:01:36.7800000 ==> 00:01:39.2000000 : (gun firing)
3 F! H( o6 q0 a9 a& C - New Segment: 00:01:39.2000000 ==> 00:01:43.4600000 : - Adrian.
2 A4 Z" g% G- B, X, Q# W1 W - New Segment: 00:01:43.4600000 ==> 00:01:45.6200000 : - Oh God.
( P8 H. y' G1 o/ {& m; x - New Segment: 00:01:45.6200000 ==> 00:01:48.2000000 : - What's the matter sweetheart?# M4 m/ Z* S# q9 ?1 s
- New Segment: 00:01:48.2000000 ==> 00:01:50.4200000 : Oh.
3 [3 Y9 n* r9 Q, @: E; Y* Y - New Segment: 00:01:50.4200000 ==> 00:01:53.4600000 : - Oh it's horrible.# ^ m* v/ O( t4 K* G+ o
- New Segment: 00:01:53.4600000 ==> 00:01:55.3000000 : - Shh.
4 z/ Q, Y5 Y" x+ h - New Segment: 00:01:55.3000000 ==> 00:02:02.3400000 : It was just a bad dream.2 n; z1 F% S" L( C
- New Segment: 00:02:05.4200000 ==> 00:02:09.8800000 : - You don't ever have to be afraid of anything.) W2 l: Z& X0 t$ z. s+ W
- New Segment: 00:02:09.8800000 ==> 00:02:12.8000000 : I'll always be here to protect you.* J* i9 j J* x" B ^! j
- New Segment: 00:02:12.9200000 ==> 00:02:15.5000000 : (gentle music)
6 Y3 k u L4 G# j; r% l V# `- H - New Segment: 00:02:16.4800000 ==> 00:02:19.0600000 : (gentle music), w2 k' K( x3 A: n( v
- New Segment: 00:02:19.0600000 ==> 00:02:21.6400000 : (gentle music)
1 N9 l4 H, r1 e8 |% ~# k - New Segment: 00:02:21.6400000 ==> 00:02:24.2200000 : (gentle music)8 ?& E; a* _, U0 G
- New Segment: 00:02:24.5400000 ==> 00:02:27.1200000 : (gentle music)
7 u6 C/ {& U/ k! Z: S - New Segment: 00:02:27.1200000 ==> 00:02:29.7000000 : (gentle music)
& Z, \* e% C. A: R9 b# m - New Segment: 00:02:29.7000000 ==> 00:02:33.1800000 : [Music]7 x5 k" [* B {, C' m; a
-
复制代码
0 ]$ h' A9 r) r$ {- ]) u5 I1 l2 y. ], J, ]& o5 ]
|