16 July 2012

TCBs and SRBs and FRRs, oh my! - part 2

Debugging the SRB and associated FRR

UPDATE: See the end for a nice piece of helpful information that can remove hours, gray hairs, and wrinkles from the debugging process!


Debugging an SRB and an associated FRR can be a real PITA. You can't get a traditional SYSUDUMP/SYSABEND; the only thing you can do is try to get an SVC dump, as they are known. These are binary dumps of storage, similar to SYSMDUMP, and have to be processed with IPCS (or AMDPRDMP in the old days).

So, the first problem I ran into was that the SRB was abending with an S0C4. No problem, look at the FRR…but there's no FRR WTO output. There's also no WTO output from the SRB, even the one near the very beginning. Hmmm…so I do what all programmers do – I start limiting the code. I reduce my SRB to just the WTO. Now it shows up. I eye suspiciously my accessing of the product CVT via GPR1, which is the PARM= parameter of the IEAMSCHD macro. Oops. Apparently, I had a bout of fuzzy logic. I was interpreting GPR1 as being a pointer to the fullword specified in PARM=. Nope. GPR1 has the value specified in the area pointed to by PARM=. So from the example from part 1, GPR1 will have the value of GPR10, which was stored in SRBEPARM, which is then pointed to by the PARM=. So I fixed that, and I got the WTO. 


Now I want to drive the FRR, so I added a good old DC H'0' to force an S0C1. Instead, I got an S0C4. I also noticed garbage in the job name/job number fields for the WTO in the hardcopy log. These fields are obtained from the product CVT, so I realized that the product CVT might either be zero or initialized or otherwise garbage. I changed the code so that if the product CVT pointer was invalid, then use a eye-catching value for the job name and job number. That helped, but then I was still not getting the WTOs with the PSW and other information. I then noticed that a system dump had been produced from the WTO. I saved that off, but then I realized that I needed to make sure that the PARM= was actually the product CVT. I added a WTO to print the contents of the area and return, and, sure enough. it was fine. But I was still getting an S0C4 and my FRR was not firing…why was this happening? Then I thought that it must be occurring upon return from the FRR. But after a few more tests, I noticed that the FRR WTO was firing randomly.


During this time, I was involved in an IM conversation with my friend "R",  discussing my frustrations. He reminded me that WTOs are not synchronous by default, and in an SRB this can lead to timing issues. I added SYNCH=YES, and I also added a parameter to do testing inside my address space, not the external address space, as this will help reduce issues and exposure to screwing up the system. And then begins the part of debugging I hate the most – things suddenly start working.


I ran with an option that sets the FRR to be the wonderful z/XDC product from ColeSoft. And it hits my forced S0C1. That's good. I set it back to my FRR. It doesn't hit. I then begin to notice that the LPAR is having issues, and it looks like we will need to IPL. Time to quit for the weekend, the highlight of which was seeing the blues band that two of my high school classmates are in playing in the East Bay. Another classmate, who was on vacation, sat in as well. (All three, along with a fourth classmate, had had their own band during our high school era.)


Just another manic Monday…wish it were Sunday…

So, after a weekend of not looking at this, back to figuring out what's going on. But on Sunday, The Shower Principle hit again; I realized that by scheduling the SRB for testing purposes in the same address space, I can use GTF. Unfortunately, a SLIP Instruction Fetch trap has a limitation of 1M on the RANGE= keyword, so I have to think about to handle this.


Meanwhile, when I looked again at the S0C4 I had on Friday, I realized that it's not dying in the FRR, but on a QSAM (Queued Sequential Access Method) PUT. Since it doesn't occur when z/XDC is in the way, this is when GTF becomes useful. I moved to our sandbox LPAR so if my testing is messing with the LPAR, I will only shoot myself. Working with SRBs is like working with fire - you need to be very careful, because if you make a mistake, you can burn not only yourself, but the system. (And, as it turns out, I was. A wayward store in *MASTER* or in other common storage can have far reaching negative consequences.) Therefore, you need to be testing these types of routines on a sandbox system so you burn only yourself, and in case you need to IPL, you can quickly do so.

So I started GTF and reran the job…and of course, because this is how it's going, it sort of works. It finished with CC 16. That may prove useful, though, and I have output. Oh, wait, I forgot that the LPAR was IPLed on Friday night, so there is no APF authorization. I issue the appropriate APF commands, running a convenient batch job that issues operator commands that I run after an IPL. And the job completes successfully. The FRR fires, but the return information from the IEAMSCHD shows an S0C4. Remember, I have a forced S0C1 to drive the FRR, so the FRR itself must be abending with the S0C4.


Now I download it to my PC (remember, binary!) because I have a Perl script which parses a GTF trace data set and creates a CSV file that can be loaded into Excel for easier perusal and analysis. (I may make it available for download on my web site.) Two things cropped up, though; 1) MATCHLIM was too small; 2) HSM reallocated my TRACE data set and I only collected about 2500 records. So I fix those and try again, more to prove that my GTF parameters work, and they do.


Now how do I trace the GTF? Listings show that the ESQA allocation address is different every time, so I can't really guarantee a range. (ESQA allocation is volatile, so it's not surprising, really.) z/XDC can show me where it's allocated, so I make a test run under z/XDC but without using it as the SRB FRR. For various reasons which I won't go into, this doesn't work. And I'm beginning to think that maybe the S0C4 in the FRR isn't always triggering, and other times random storage is being overlaid. But it appears that I cannot reliably get a GTF trace of the SRB and FRR.


Since I'm dealing with something that is apparently random, I started looking at the FRR code after the two WTOs that I'm seeing. And, oh, boy. I made a horrible coding mistake. I can lay it at the foot of "it's been a while since I worked with SRBs and FRRs," but, still, I should not have made this.


What did I do that is worthy of a *headdesk*? The coding error was in the SETRP macro. If you do not specify WKAREA=, it defaults to assuming that GPR1 points to the SDWA. And which register is used by the WTO macro for the service's parameter list? Yep, you guessed it. And that is why it was purely random, as the address of the SRB code and data area moved around consistently.


Unfortunately, there also seems to be some sort of conflict with z/XDC and my SRB and FRR, so I'm going to take a suggestion that my friend "C" gave me during a conversation last week. Instead of obtaining ESQA and moving the code into the storage, I can split the SRB and FRR into its own load module. I can then use a LOAD GLOBAL=YES,EOM=YES and I then should be able to use GTF to trace any activity there. But there is a question of how I can set up the SLIP trap, because the only way to specify modules are LPA or private. LOAD GLOBAL loads them into CSA, which can be addressed by RANGE. But I don't know where it will be loaded until after the LOAD has executed, which makes it rather difficult to set up the SLIP, unless I do something like coding a WTOR with the address in it. I'd rather not do this, but it may come to that. But back to the head-scratching…


I'm beginning to suspect that the parameter coming in via GPR2 to the FRR isn't quite what I expect it to be. I added a WTO that writes out the value of GPR2 coming in, and my first run looks suspicious – in fact, it's X'0000C00'. That looks rather strange, so I run it again. It's the same. I check with a couple of people…and yes, that is a legitimate address. The FRR stack is located between X'800' and X'1000'. So something else is happening; maybe the SRB is building this 24-byte area wrong.


I added WTOs dumping the 24 bytes pointed to by GPR2. (Note that SDWAPARM also has this address.) And this shows what I've done. Somewhere along the way, the instruction to store the address of the product CVT has disappeared from the SRB, so the FRR was loading 0 instead. Big oops. Big, big oops. And this may explain a lot of the randomness as well.


So I fix that and rerun, and now the 24-byte area looks correct. But things still don't seem to be working right. I add a WTO here and there, and I realize that I've miscoded an NI instruction to turn off a recursion-prevention flag, forgetting the 255– part; thus it thought it was recursing. I fixed that and the FRR is working for the code I have there. I now add the full FRR code (dump the register/PSW info out of the SDWA), but something is still breaking. I narrow it down to a load of another part of the FRR parm area. Now I haven't purposely munched GPR2, but I add some WTOs to see if maybe branch entry WTO is changing it (the documentation says it does not). Nope, that's not it. 


I start an IM session with "R"; I can't understand why it fails if I include an instruction and works if I skip it. Finally, I move the code up to the very beginning, and suddenly my WTOs become garbage. I then look and I realize that I am munching my constant area pointer with one of these loads. This was left over from when I started working on this code and my register settings were in flux. I change the register and voilà, it works.


*headdesk* *headdesk* *headdesk*


Part of what made this so hard is that I was so familiar with the code, it looked right. I'd lost track of the fact that the original use of that register had changed. This is why when you are really at a loss, a second set of eyes can help immensely. The second set is not familiar with the code, and they can spot anomalies pretty quickly. Back in university, I became known for the person who looked at the code and could spot the issue quickly, and it was because I was the second set of eyes. I did pass this along to others, though, and gradually people picked up on this. Never be afraid to ask others to look at your code, and never be embarrassed if they find something really, really stupid. You just didn't see it, because your brain interpreted the code as you thought you wrote it, not how you did write it.


But, after all this, I have a working FRR routine. Now I can fully test the SRB code, which is connecting an ETX…and it works perfectly the first time out. (This is always suspicious, but it worked when it was not a part of the SRB, so I suspect it is fine.) Now I will need to add the destroy ETX code. Once this is working, I will pass this version onwards so my colleague can begin testing, and then I will code the ESTAEX, which is the subject of part 3.  I will then make some enhancements, including the aforementioned LOAD GLOBAL for the SRB, and also adding LOAD GLOBAL for the PC code.


So, a post-mortem on this pretty much lays it on three things which had overwriting consequences: 

  1. incorrectly interpreting the value in R1 when the SRB gets control;
  2. forgetting the WKAREA= on the SETRP macro; and
  3. wiping out my constant base register.
Oh, one piece of advice I received from "C" that I should not neglect: if you do move the FRR directly into SQA, put it on a doubleword boundary.

Part three, detailing adventures in ESTAEXing, will probably come later this week or early next.


UPDATE: Jim Mulder, who is one of the z/OS people in Pok, reminded me that normally when an FRR abends, a record is written to LOGREC. You can print this off and it will give you lots of information. I know this would have saved me several hours of teeth-gnashing.

No comments:

Post a Comment

Feel free to leave a comment or ask questions.