Playstation CPU reversing [56k Warning]
RLE control progress: found 2 counters:
1) first is downcounter which takes number of empty RLE values (LEN in no$psx docs) and count them down to 0. it generates 3 signals:
- current cunter value not zero
- input LEN value not zero
- counter finish counting (current value zero but start value was not zero)
This counter directly manages inputs to unit05 where decoded RLE stored. If this counter is working then all result zeroed through NANDs.
2) second counter just works no matter the other signals exept clock. The outpit is 3 main signals:
- counter not zero
- counter not overflow
- current counter value
This counter can count from 0 to 63 and incremented with each decoded value.
But there are a lot more control structures.
1) first is downcounter which takes number of empty RLE values (LEN in no$psx docs) and count them down to 0. it generates 3 signals:
- current cunter value not zero
- input LEN value not zero
- counter finish counting (current value zero but start value was not zero)
This counter directly manages inputs to unit05 where decoded RLE stored. If this counter is working then all result zeroed through NANDs.
2) second counter just works no matter the other signals exept clock. The outpit is 3 main signals:
- counter not zero
- counter not overflow
- current counter value
This counter can count from 0 to 63 and incremented with each decoded value.
But there are a lot more control structures.
Just finish reversing RLE address counter. This is counter that sets up addres for store result of RLE decoding. It must be in zigzag order, but real result are:
00 01 08 10 09 02 03 0a
11 18 20 19 12 0b 04 05
0c 13 1a 21 28 30 29 22
1b 14 0d 06 07 0e 15 1c
23 2a 31 38 39 32 2b 24
1d 16 0f 17 1e 25 2c 33
3a 3b 34 2d 26 1f 27 2e
35 3c 3d 36 2f 37 3e 3f
i don't think this is error. Most likely algorythm is more puzzling than it seems
00 01 08 10 09 02 03 0a
11 18 20 19 12 0b 04 05
0c 13 1a 21 28 30 29 22
1b 14 0d 06 07 0e 15 1c
23 2a 31 38 39 32 2b 24
1d 16 0f 17 1e 25 2c 33
3a 3b 34 2d 26 1f 27 2e
35 3c 3d 36 2f 37 3e 3f
i don't think this is error. Most likely algorythm is more puzzling than it seems
That's easy: It looks like zagzig:
for i=0 to 63, zagzig[zigzag]=i, next i
for i=0 to 63, zagzig[zigzag]=i, next i
nocash wrote:That's easy: It looks like zagzig:
for i=0 to 63, zagzig[zigzag]=i, next i
Ow I got it =)
I never see that there are zagzig in function in your documents. I always read it as zigzag =)
Thanks.
Now I have only one question, but this is for futher exploration.
RLE finished.
While I busy with RLE algorythm description i want to tell you one error i found in your RLE decoding:
Error: if quantization factor in DCT == 0 then zagzig order not used. All data stored directly one after another.
Code: Select all
rl_decode_block(blk,src,qt)
for i=0 to 63, blk[i]=0, next i ;initially zerofill all entries (for skip)
@@skip:
n=[src], src=src+2, k=0 ;get first entry, init dest addr k=0
if n=FE00h then @@skip ;ignore padding (FE00h as first halfword)
q_scale=(n SHR 10) AND 3Fh ;contains scale value (not "skip" value)
val=signed10bit(n AND 3FFh)*qt[k] ;calc first value (without q_scale/8) (?)
@@lop:
if q_scale=0 then val=signed10bit(n AND 3FFh)*2 ;special mode without qt[k]
val=minmax(val,-400h,+3FFh) ;saturate to signed 11bit range
val=val*scalezag[i] ;<-- for "fast_idct_core" only
blk[zagzig[k]]=val ;store entry
n=[src], src=src+2 ;get next entry (or FE00h end code)
k=k+((n SHR 10) AND 3Fh)+1 ;skip zerofilled entries
val=(signed10bit(n AND 3FFh)*qt[k]*q_scale+4)/8 ;calc value for next entry
if k<=63 then jump @@lop ;should end with n=FE00h (that sets k>63)
idct_core(blk)
return (with "src" address advanced)
Holy crap dude for whatever this is good, great work.
-
Administrator Verified
- Admin / PSXDEV
- Posts: 2689
- Joined: Dec 31, 2012
- I am a: Shadow
- PlayStation Model: H2000/5502
Absolutely amazing. Excellent work. I have changed your rank.
Development Console: SCPH-5502 with 8MB RAM, MM3 Modchip, PAL 60 Colour Modification (for NTSC), PSIO Switch Board, DB-9 breakout headers for both RGB and Serial output and an Xplorer with CAETLA 0.34.
PlayStation Development PC: Windows 98 SE, Pentium 3 at 400MHz, 128MB SDRAM, DTL-H2000, DTL-H2010, DTL-H201A, DTL-S2020 (with 4GB SCSI-2 HDD), 21" Sony G420, CD-R burner, 3.25" and 5.25" Floppy Diskette Drives, ZIP 100 Diskette Drive and an IBM Model M keyboard.
PlayStation Development PC: Windows 98 SE, Pentium 3 at 400MHz, 128MB SDRAM, DTL-H2000, DTL-H2010, DTL-H201A, DTL-S2020 (with 4GB SCSI-2 HDD), 21" Sony G420, CD-R burner, 3.25" and 5.25" Floppy Diskette Drives, ZIP 100 Diskette Drive and an IBM Model M keyboard.
mousefx wrote:Holy crap dude for whatever this is good, great work.
Thanks. Now I'm working on overall RLE control which links RLE with IDCT. Soon I'll post new screens )Shadow wrote:Absolutely amazing. Excellent work. I have changed your rank.
-
- What is PSXDEV?
- Posts: 3
- Joined: Aug 25, 2013
Very nice progess, Akari. I really love this project.
Anyway, I think that I now understand most of the RLE circuit, so let me report my findings, maybe it can be of help to everyone as well. I will refer to this version of Akari's RLE circuit schematic that I structured into its components. Of course, my descriptions are tentative and may contain errors, so feel free to correct me!
The counters are mostly self-explanatory, so I will omit them here. At the bottom, we have the control circuit that outputs the clocks for most other components. It basically contains a finite state machine that implements a two-level pipeline: one that enters a new stage every 4 cycles (right shift register), and one that changes every cycle (left shift register), but repeats every 4 cycles. This is done so that the components can be orchestrated in a pipelined fashion, where each component can start in an arbitrary cycle and be pipelined either every 4 or every cycle. For example, the multipliers are clocked every master clock cycle but output the result only every four cycles to the next pipeline stage. This is because the multipliers need four adds to get the result, so every cycle an add must be executed, and every 4 cycles the product is ready. Whether the state machine enters the next state is controlled mostly by the counter flags (full, zero, etc.), but also by some other flags (e.g. data == 0xfe00).
In the center of the schematic we have the quantization table multiplier that multiplies the data with the elements of the quantization table/matrix. I am fairly sure that this is a variant of the Booth multiplication algorithm where instead of bit slices of length 2, we look at bit slices of length 3. This particular multiplier works according to this algorithm. I can elaborate if desired. (By the way, I noticed that the multiplier of the IDCT is exactly the same algorithm, just a different, more parallelized, hardware implementation). At the end, before the result is written to the next stage, the result is clamped to 15 bits (the multiplier itself calculates the the full 19=11+8 bits of the result).
Next is the quantization scale multiplier. It's again the same algorithm, but implemented a bit differently. Because every element of our block needs to be multiplied by QS, the shift register that selects the next 3-bit-slice is cyclic and repeats every 4 cycles. What I tagged as "constant factor" is actually the factor 8 that is used with the very first element (it's probably 8 in order to compensate for the division by 8 in the following steps). The result of the multiplication drops bits 0 and 1, which is equivalent to a right shift by 2, or an integer division by 4 (the last perceived division by 2 probably stems from the very last step, but I'm not too sure; additional input would from others would be appreciated).
Finally, the result is clamped to 12 bits, and then, as Akari mentioned, rounded to the next even number closer to zero (except -1, which stays the same). I do not know why this is done; does -1 have a special meaning? I am not too knowledgeable about JPEG compression. But I guess this is where the last perceived division by 2 comes from; it's not actually a division (that's why I wrote "perceived"), but the precision is effectively reduced by one bit.
So overall the steps should be:
1. res = clamp_to_15_bits(val * qt[k])
2. res = clamp_to_12_bits((res * qs) >> 2)
3. res = res == -1 ? -1 : round_to_even_number_closer_to_zero(res)
Anyway, I think that I now understand most of the RLE circuit, so let me report my findings, maybe it can be of help to everyone as well. I will refer to this version of Akari's RLE circuit schematic that I structured into its components. Of course, my descriptions are tentative and may contain errors, so feel free to correct me!
The counters are mostly self-explanatory, so I will omit them here. At the bottom, we have the control circuit that outputs the clocks for most other components. It basically contains a finite state machine that implements a two-level pipeline: one that enters a new stage every 4 cycles (right shift register), and one that changes every cycle (left shift register), but repeats every 4 cycles. This is done so that the components can be orchestrated in a pipelined fashion, where each component can start in an arbitrary cycle and be pipelined either every 4 or every cycle. For example, the multipliers are clocked every master clock cycle but output the result only every four cycles to the next pipeline stage. This is because the multipliers need four adds to get the result, so every cycle an add must be executed, and every 4 cycles the product is ready. Whether the state machine enters the next state is controlled mostly by the counter flags (full, zero, etc.), but also by some other flags (e.g. data == 0xfe00).
In the center of the schematic we have the quantization table multiplier that multiplies the data with the elements of the quantization table/matrix. I am fairly sure that this is a variant of the Booth multiplication algorithm where instead of bit slices of length 2, we look at bit slices of length 3. This particular multiplier works according to this algorithm. I can elaborate if desired. (By the way, I noticed that the multiplier of the IDCT is exactly the same algorithm, just a different, more parallelized, hardware implementation). At the end, before the result is written to the next stage, the result is clamped to 15 bits (the multiplier itself calculates the the full 19=11+8 bits of the result).
Next is the quantization scale multiplier. It's again the same algorithm, but implemented a bit differently. Because every element of our block needs to be multiplied by QS, the shift register that selects the next 3-bit-slice is cyclic and repeats every 4 cycles. What I tagged as "constant factor" is actually the factor 8 that is used with the very first element (it's probably 8 in order to compensate for the division by 8 in the following steps). The result of the multiplication drops bits 0 and 1, which is equivalent to a right shift by 2, or an integer division by 4 (the last perceived division by 2 probably stems from the very last step, but I'm not too sure; additional input would from others would be appreciated).
Finally, the result is clamped to 12 bits, and then, as Akari mentioned, rounded to the next even number closer to zero (except -1, which stays the same). I do not know why this is done; does -1 have a special meaning? I am not too knowledgeable about JPEG compression. But I guess this is where the last perceived division by 2 comes from; it's not actually a division (that's why I wrote "perceived"), but the precision is effectively reduced by one bit.
So overall the steps should be:
1. res = clamp_to_15_bits(val * qt[k])
2. res = clamp_to_12_bits((res * qs) >> 2)
3. res = res == -1 ? -1 : round_to_even_number_closer_to_zero(res)
Thanks a lot, you help me fix last error in first multiplication circuit. Now it works correctly with all values. Did you use Logisim when you work with circuit? If so - i upload fixed circuit to svn (https://code.google.com/p/psxdev/source ... /IDCT.circ)LostTemplar wrote:In the center of the schematic we have the quantization table multiplier that multiplies the data with the elements of the quantization table/matrix. I am fairly sure that this is a variant of the Booth multiplication algorithm where instead of bit slices of length 2, we look at bit slices of length 3. This particular multiplier works according to this algorithm. I can elaborate if desired. (By the way, I noticed that the multiplier of the IDCT is exactly the same algorithm, just a different, more parallelized, hardware implementation). At the end, before the result is written to the next stage, the result is clamped to 15 bits (the multiplier itself calculates the the full 19=11+8 bits of the result).
By the way we can give you access to wiki where you can write all your findings related to circuits.
I have algorithm of RLE that I wrote when I try to understand how circuit works. Decoding is split into 4 stages. After each stage data went to next stage, while previous stage receive next data to process. Each stage has 4 substages.
Code: Select all
STAGE 0
sub XX1X_XXX1:
Set ADDRESS_COUNTER_4_Q as address for UNIT4 and set bit A5 (all other substages has bit A5 set to 0 and address taken from ADDRESS_COUNTER_4_D).
clk17 - read quant data from UNIT4 (8 bit 0xFF) into start triggers of QUANT_DIVIDER.
sub X1XX_XXX1:
set QUANT_DIVIDER to read data from start triggers (all other substages it takes shifted data from main triggers).
clk20 - read quant data from start triggers of QUANT_DIVIDER into main triggers.
clk16 - read data (less significant 10 bit 0x03FF) into start triggers of SUMMATOR1 (start triggers also are 1/3 direct value triggers).
if ADDRESS_COUNTER_4_Q equal to zero (ac4_q_n0 == 0, we decoding first value ie DCT) then:
set Q_SCALE_DIVIDER read from input (all other situation it takes shifted data from main triggers).
clk23:
write padding_s (check that data equals to 0xFE00) from data into trigger.
write qsc_n0_s into trigger (check that q_scale != 0).
clk21 - write q_scale (most significant 6 bit 0xFC00) to q_scale counter triggers for SUMMATOR2.
if ADDRESS_COUNTER_4_Q not equal to zero (ac4_q_n0 != 0, we decoding not first value ie RLE) then clk19:
read len (most significant 6 bit 0xFC00) into LEN_COUNTER, if current value 0, else decrement counter by 1.
sub 1XXX_XXX1:
if padding_s equals zero (data not 0xFE00):
set SUMMATOR1 to use data from start triggers.
clk24:
write check that ADDRESS_COUNTER_4_Q not equal zero to ac4_q_n0_s in trigger1/2.
write check that LEN_COUNTER equal zero to lc_0_s in trigger 1/3.
calculate and write nnull_s to trigger1/3.
increment ADDRESS_COUNTER_4_Q by 1.
clk15 - read SUMMATOR1 result into result triggers.
clk20 - QUANT_DIVIDER tick.
if ac4_d_lock not equal 1:
if LEN_COUNTER not working (lc_n0 == 0) or current data falue equal 0xFE00) then clk37:
increment ADDRESS_COUNTER_4_D.
STAGE 1
sub XXX0_XX1X:
clk15 - read SUMMATOR1 result into result triggers.
clk20 - QUANT_DIVIDER tick.
sub XX1X_XX1X:
clk15 - read SUMMATOR1 result into result triggers.
clk20 - QUANT_DIVIDER tick.
if qsc_n0_s == 0 (q_scale == 0) then clk32 - read data into second direct value triggers.
sub X1XX_XX1X:
clk15 - read SUMMATOR1 result into result triggers.
sub 1XXX_XX1X:
clk22:
write ac4_q_n0_s into trigger 2/2.
write lc_0_s into trigger 2/3.
write nnull_s into trigger 2/3.
write clamped value of SUMMATOR1 result into start triggers of SUMMATOR2.
STAGE 2
sub XXX0_X1XX:
clk18 - read SUMMATOR2 result into result triggers.
clk21 - Q_SCALE_DIVIDER tick.
if qsc_n0_s == 0 (q_scale == 0) then clk32 - read data into third direct value triggers.
sub XX1X_X1XX:
clk18 - read SUMMATOR2 result into result triggers.
clk21 - Q_SCALE_DIVIDER tick.
sub X1XX_X1XX:
clk18 - read SUMMATOR2 result into result triggers.
clk21 - Q_SCALE_DIVIDER tick.
sub 1XXX_X1XX:
clk27:
write lc_0_s into trigger 3/3.
write nnull_s into trigger 3/3.
clk18 - read SUMMATOR2 result into result triggers.
clk21 - Q_SCALE_DIVIDER tick.
STAGE 3
sub XXX0_1XXX:
clk36 - write result to final triggers:
if ndirect_s == 0 then result taken from direct triggers (if lc_0_s == 0 then result nullified)
if ndirect_s != 0 then result taken from SUMMATOR2 (with final clamping and roundings))
id ac5_d_nfull == 0 then save 1 into trigger -709- (on next clock it will reset LEN_COUNTER and switch output trigger for UNIT05 with manipulates A5.
sub XX1X_1XXX:
set IE1 or IE2 for UNIT05 to write result from final triggers.
sub 1XXX_1XXX:
clk35 - increment ADDRESS_COUNTER_5_D by 1.
clock for ADDRESS_COUNTER_5_Z according to inner condition it will be clk33 and/or clk34).
Yes, I can confirm that first value (DCT) left shifted by 1instead of right shifting by 2 as usual values. So first value are 8 times bigger than others.LostTemplar wrote:What I tagged as "constant factor" is actually the factor 8 that is used with the very first element (it's probably 8 in order to compensate for the division by 8 in the following steps).
I found one more error in my circuit, fix it and simulate second multiplication. It works as you described. It just multiplies two numbers and drop two less significant bits.(the last perceived division by 2 probably stems from the very last step, but I'm not too sure; additional input would from others would be appreciated).
Looss like whole circuit works now.
A bit of progress on YUV to RGB conversion.
Data from UNIT20 (result from IDCT) expanded to 10 bits by duplicate bit 4 and 6. And then used in some multiplication. I can't get second multiplier yet.
Data from UNIT20 (result from IDCT) expanded to 10 bits by duplicate bit 4 and 6. And then used in some multiplication. I can't get second multiplier yet.
New update: still trqacing YUV to RGB. Last week it was a bit part of control logic. Now all cells are in upper 05 part, near RLE.
Real numbers used for YUVtoRGB conversion are:
Code: Select all
G=(-88/256 * B)+(-311/256 * R)
R=(359/256 * R)
B=(454/256 * B)
Quick review about progress: YUV to RGB almost finished. Now I try to finish citcuit that pack converted data to 32bit buffer. Circuit integrated with IDCT control and uses stream of data from IDCT to speedup converting process.
Thats overall progress for now )
Thats overall progress for now )
Are you sure? That G formula doesn't look correct to me. According to the tests I ran a year or two ago on my SCPH 7002 unit, it should be:Akari wrote:Real numbers used for YUVtoRGB conversion are:
Code: Select all
G=(-88/256 * B)+(-311/256 * R) R=(359/256 * R) B=(454/256 * B)
G=(-88/256 * B)+(-183/256 * R)
Sorry, my mistake. Of course it's -183.AmiDog wrote:Are you sure? That G formula doesn't look correct to me. According to the tests I ran a year or two ago on my SCPH 7002 unit, it should be:Akari wrote:Real numbers used for YUVtoRGB conversion are:
Code: Select all
G=(-88/256 * B)+(-311/256 * R) R=(359/256 * R) B=(454/256 * B)
G=(-88/256 * B)+(-183/256 * R)
In bits this is:
0101100111 = 359
0111000110 = 454
1110101000 = -88
1101001001 = -183
This is the circuit that creates those constants. Less significant bit is on the left.
After small break i finish circuit that packes result bits of yuv to rgb conversion into 32bits words of result buffer. Circuit uses depth output bits from control register.
Found where register 1F801824 written and 1F801820 read from data bus.
and where
Found where register 1F801824 written and 1F801820 read from data bus.
Code: Select all
1F801820h - MDEC0 - MDEC Command/Parameter Register (W)
31-0 Command or Parameters
Used to send command word, followed by parameter words to the MDEC (usually, only the command word is written to this register, and the parameter words are transferred via DMA0).
1F801824h - MDEC1 - MDEC Status Register (R)
31 Data-Out Fifo Empty (0=No, 1=Empty)
30 Data-In Fifo Full (0=No, 1=Full, or Last word received)
29 Command Busy (0=Ready, 1=Busy receiving or processing parameters)
28 Data-In Request (set when DMA0 enabled and ready to receive data)
27 Data-Out Request (set when DMA1 enabled and ready to send data)
26-25 Data Output Depth (0=4bit, 1=8bit, 2=24bit, 3=15bit) ;CMD.28-27
24 Data Output Signed (0=Unsigned, 1=Signed) ;CMD.26
23 Data Output Bit15 (0=Clear, 1=Set) (for 15bit depth only) ;CMD.25
22-19 Not used (seems to be always zero)
18-16 Current Block (0..3=Y1..Y4, 4=Cr, 5=Cb) (or for mono: always 4=Y)
15-0 Number of Parameter Words remaining minus 1 (FFFFh=None) ;CMD.Bit0-15
If there's data in the output fifo, then the Current Block bits are always set to the current output block number (ie. Y1..Y4; or Y for mono) (this information is apparently passed to the DMA1 controller, so that it knows if and how it must re-order the data in RAM). If the output fifo is empty, then the bits indicate the currently processsed incoming block (ie. Cr,Cb,Y1..Y4; or Y for mono).
Reverse how cpu handles data that writen to register 1F801820h and readed from register 1F801824h
1F801820h register that controls MDEC.
1 - Decode Macroblock
2 - Set Quant Table
3 - Set Scale Table
Martin's description is right:
Only things that i can add is that in case of command 2 and 3 Number of Parameter Words (биты 0-15) forcely set to value 64. In case of others commands data readed from bits 0-15.
Bit 0 from command 2 (Color) setted from databus only in case of command 2. For all other commands it is set to 1 and ignore bit 0. (I think it is not used though).
Also phrase "should be zero" not nessesary.. Bit's 0-15 or 1-15 can has any value. It will be ignored anyway.
Current progress can be seen on picture. Light green are newest
1F801820h register that controls MDEC.
1 - Decode Macroblock
2 - Set Quant Table
3 - Set Scale Table
Martin's description is right:
Code: Select all
MDEC(1) - Decode Macroblock(s)
31-29 Command (1=decode_macroblock)
28-27 Data Output Depth (0=4bit, 1=8bit, 2=24bit, 3=15bit) ;STAT.26-25
26 Data Output Signed (0=Unsigned, 1=Signed) ;STAT.24
25 Data Output Bit15 (0=Clear, 1=Set) (for 15bit depth only) ;STAT.23
24-16 Not used (should be zero)
15-0 Number of Parameter Words (size of compressed data)
This command is followed by one or more Macroblock parameters (usually, all macroblocks for the whole image are sent at once).
MDEC(2) - Set Quant Table(s)
31-29 Command (2=set_iqtab)
28-1 Not used (should be zero) ;Bit25-28 are copied to STAT.23-26 though
0 Color (0=Luminance only, 1=Luminance and Color)
The command word is followed by 64 unsigned parameter bytes for the Luminance Quant Table (used for Y1..Y4), and if Command.Bit0 was set, by another 64 unsigned parameter bytes for the Color Quant Table (used for Cb and Cr).
MDEC(3) - Set Scale Table
31-29 Command (3=set_scale)
28-0 Not used (should be zero) ;Bit25-28 are copied to STAT.23-26 though
Bit 0 from command 2 (Color) setted from databus only in case of command 2. For all other commands it is set to 1 and ignore bit 0. (I think it is not used though).
Also phrase "should be zero" not nessesary.. Bit's 0-15 or 1-15 can has any value. It will be ignored anyway.
Current progress can be seen on picture. Light green are newest
Who is online
Users browsing this forum: No registered users and 0 guests