Stacktraces in ASE

From SybaseWiki
Revision as of 10:39, 17 May 2007 by Psap (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Sometimes your Sybase ASE will produce a stacktrace in the errorlog. At first glance the messages within the stacktrace are obscure and seem very hard to read. In this document some tips and techniques are given to get information out of the stacktrace so you can isolate the problem that caused the stacktrace.

What is a stacktrace?

In my opinion a stacktrace is the result of hitting a bug within the Sybase server. A stacktrace should never occur and it should be reported to Sybase so they can resolve the problem and make a fix (EBF) available (when you have a support contract). However, it does help that you also spent some time working on a stacktrace, because EBF's can take quite some time to become available. In the meantime you will probably need a workaround. The stacktrace itself is a list of all the functions called within ASE from the point where your session connected to the server up to the moment you hit the bug. While reading a stacktrace from top to bottom you will see that the stacktrace is a so called traceback. It starts at the point where you hit the bug and then each function is shown up to the point that your session connected to the server. As we will see later, there are also some other messages given, such as database id.

How to recognize a stacktrace.

As said earlier, a stacktrace is reported in the errorlog. When your client connection suddenly dies you should suspect a stacktrace, especially when you see the message "The SQL Server is terminating this process". Take a look in the errorlog or ask your DBA to take a look.

Sample stack trace.

For readability the first part (cpu number, datetime, etc) of all lines, has been omitted.

kernel current process (0x1f090038) infected with 11
kernel Address 0x0036a39c (expand_ALL_nodes+0x1b4), siginfo (code, address) = (1, 0x0000000c)
kernel ************************************
kernel SQL causing error : exec @error = grid_exceptions @system_id with recompile
kernel ************************************
server SQL Text: exec @error = grid_exceptions @system_id with recompile
kernel curdb = 10 pstat = 0x40010000 lasterror = 0
kernel preverror = 0 transtate = 1
kernel curcmd = 224 program = isql
kernel pc: 0x0064ec6c pcstkwalk+0x24(0x81d1b458, 0x00000000, 0x0000270f, 0x00000002, 0xfffffff8)
kernel pc: 0x0064eb78 ucstkgentrace+0x194(0x1f090038, 0x821f4468, 0x821f4468, 0x81e0c460, 0x00000000)
kernel pc: 0x0061c0d8 ucbacktrace+0xa8(0x81e0c460, 0x00000001, 0x1f090038, 0x821a3ed8, 0x00000000)
kernel pc: 0x000f6238 terminate_process+0xd70(0x00bedc60, 0x000054e8, 0x000054e4, 0xffffffff, 0x00005000)
kernel pc: 0x00631868 kisignal+0x1e0(0x81d1bf94, 0x81d1bcfc, 0x0000000b, 0x81d1bcd0, 0x81d1bf88)
kernel pc: 0xff34905c sigaction+0x740(0x0000000b, 0x81d1bf88, 0x81d1bcd0, 0xff35e000, 0x0000000b)
kernel pc: 0xff34bbcc _getfp+0x220(0x0000000b, 0x81d1bf88, 0x81d1bcd0, 0xff348f78, 0x00bf2128)
kernel pc: 0xff3483b8 _fork+0x850(0x0000000b, 0x00bf2090, 0x00000000, 0x00000000, 0x00000000)
kernel pc: 0x001b2810 lock_logical+0x284(0x00000000, 0x81d1c0f0, 0x00000000, 0x00000003, 0x821f4468)
kernel pc: 0x003668c0 colnames+0x498(0x821f45f0, 0x84173cf0, 0x00000000, 0x821f4468, 0x81d1c1c8)
kernel pc: 0x00366338 colnames_recurse+0xa8(0x000001, 0x000000, 0x85393058, 0x84173dc4, 0x84173d50)
kernel pc: 0x00366244 colnames_driver+0x40(0x84173cf0, 0x84173d50, 0x00000000, 0x84173cf0, 0x87609000)
kernel pc: 0x003a32e0 s_normquery+0x58(0x00006c00, 0x84173d50, 0x821f4468, 0x00000000, 0x00000001)
kernel pc: 0x003a24fc s_normalize+0x274(0x00000001, 0x84173d50, 0x00bdf504, 0x00000000, 0x00000000)
kernel pc: 0x003a3d64 s_renormalize+0xa4(0x00005000, 0x00bdf400, 0x00005388, 0x821f4468, 0x00000000)
kernel pc: 0x003a3a84 s_recompile+0x1b8(0x00000001, 0x873b5348, 0x00005000, 0x00005388, 0x821f4468)
kernel [Handler pc: 0x003a4594 s_handle installed by the following function:-]
kernel pc: 0x003a1100 sequencer+0xc0c(0x00bdf400, 0x00bee800, 0x00005000, 0x00005388, 0x821f4468)
kernel pc: 0x00391938 execproc+0x79c(0x00005388, 0x00007000, 0x821f4468, 0x00005000, 0x00000011)
kernel pc: 0x0038ab14 s_execute+0x291c(0x00bedc00, 0x00be1400, 0x81816340, 0x00000008, 0x00000000)
kernel [Handler pc: 0x003a4594 s_handle installed by the following function:-]
kernel pc: 0x003a15cc sequencer+0x10d8(0x00bdf400, 0x00bedc00, 0x81816340, 0x00005388, 0x821f4468)
kernel pc: 0x0011ce34 tdsrecv_language+0xb4(0x00bdf400, 0x00bee800, 0x000001, 0x00005000, 0x00005400)
kernel end of stack trace, spid 10, kpid 520683576, suid 68

Let's use this stacktrace as an example.

Stacktrace Header (error while executing SQL)

Let's call the first part of the stacktrace the header. There are two types, one for internal server faults and another one that occurred while executing some SQL statement. Let's take a look at this one first.

current process (0x1f090038) infected with 11
Address 0x0036a39c (expand_ALL_nodes+0x1b4), siginfo (code, address) = (1, 0x0000000c)
************************************
SQL causing error : exec @error = grid_exceptions @system_id with recompile
************************************
SQL Text: exec @error = grid_exceptions @system_id with recompile
curdb = 10 pstat = 0x40010000 lasterror = 0
preverror = 0 transtate = 1
curcmd = 224 program = isql

Legend

  • current process: Shows the kernel process id in hexadecimal. The decimal value is 520683576 and is the same as the kpid shown on the last line of this stacktrace.
  • infected with: Reflects the cause of the problem within ASE at the level of the operating system. Most common are 10 and 11, meaning "Bus error" and "Segmentation fault" respectively.
  • SQL causing error: Points to the query that caused the problem. Please note that the bug is ofcourse within ASE but a workaround can be applied in the code shown. Since the introduction of dynamic SQL in version 12.0 you cannot trust this information when your application uses dynamic SQL. The "SQL causing error" will not point to the right SQL code that caused the problem.
  • curdb: Points to the dbid of the current database of the session. Use "select db_name(<curdb>)" (db_name) to retrieve the name of the database.
  • lasterror and preverror: Shows the error number (if any) that is accompanied with the stacktrace. This error can be shown in the errorlog just above the stacktrace and will yield important information to find a workaround for the problem. When the error is not shown you can retrieve the text for the error with "select description from master..sysmessages where error = ...". Alternatively, look in the Troubleshooting and Error Messages Guide
  • transtate: Shows the state of the transaction when the stacktrace occured. The values are the same as for the global variable @@transtate. Possible value are: 0=transaction in progress, 1=transaction succeeded, completed and committed, 2=previous statement aborted, transaction still active, 3=transaction aborted and rolled back.
  • curcmd: Reflects the command that was executing when the stacktrace occurred. I don't have a definitive list for these numbers yet, but this might help: 193 and 196: execute, 195 and 224: select, 197 update.
  • program: The name of the client that connected to the server. It's equivalent to the column program_name in the table master..sysprocesses.

Stacktrace Header (Internal server error)

A different header will be shown when an internal error occurred. An example:

current process (0x518002f) infected with 10
Address 0x00179978 (bufdlink+0x44), siginfo (code, address) = (1, 0x000091c1)
Spinlocks held by kpid 85458991

Spinlock temp_cache at address 81815dc0 owned by 85458991
End of spinlock display.

In this example the stacktrace was caused by the invalid handling of a spinlock. The name of the cache is shown (temp_cache). These errors can be very hard to reproduce and a restart of the server is usually recommended (in this case the server committed suicide).

Legend

  • current process: Shows the kernel process id in hexadecimal. The decimal value is 520683576 and is the same as the kpid shown on the last line of this stacktrace.
  • infected with: Reflects the cause of the problem within ASE at the level of the operating system. Most common are 10 and 11, meaning "Bus error" and "Segmentation fault" respectively.
  • kpid: All server processes have a spid and also a kernel process id. It can be retrieved by using a "select * from master..sysprocesses".

Termination of the server process

The error that caused the problem within the server (either caused by SQL code or an internal error) is catched within the server, the server process is terminated and the stacktrace is reported in the errorlog.

pc: 0x0064ec6c pcstkwalk+0x24(0x81d1b458, 0x00000000, 0x0000270f, 0x00000002, 0xfffffff8)
pc: 0x0064eb78 ucstkgentrace+0x194(0x1f090038, 0x821f4468, 0x821f4468, 0x81e0c460, 0x00000000)
pc: 0x0061c0d8 ucbacktrace+0xa8(0x81e0c460, 0x00000001, 0x1f090038, 0x821a3ed8, 0x00000000)
pc: 0x000f6238 terminate_process+0xd70(0x00bedc60, 0x000054e8, 0x000054e4, 0xffffffff, 0x00005000)
pc: 0x00631868 kisignal+0x1e0(0x81d1bf94, 0x81d1bcfc, 0x0000000b, 0x81d1bcd0, 0x81d1bf88)
pc: 0xff34905c sigaction+0x740(0x0000000b, 0x81d1bf88, 0x81d1bcd0, 0xff35e000, 0x0000000b)
pc: 0xff34bbcc _getfp+0x220(0x0000000b, 0x81d1bf88, 0x81d1bcd0, 0xff348f78, 0x00bf2128)
pc: 0xff3483b8 _fork+0x850(0x0000000b, 0x00bf2090, 0x00000000, 0x00000000, 0x00000000)

Don't get confused by the all the hexadecimal numbers but just look at the names of the functions (as shown in bold here). These functions handle the termination of the process. The information shown is of little use so let's carry straight on to the next part that details the problem within the server.

What went wrong?

pc: 0x001b2810 lock_logical+0x284(0x00000000, 0x81d1c0f0, 0x00000000, 0x00000003, 0x821f4468)
pc: 0x003668c0 colnames+0x498(0x821f45f0, 0x84173cf0, 0x00000000, 0x821f4468, 0x81d1c1c8)
pc: 0x00366338 colnames_recurse+0xa8(0x00000001, 0x00000000, 0x85393058, 0x84173dc4, 0x84173d50)
pc: 0x00366244 colnames_driver+0x40(0x84173cf0, 0x84173d50, 0x00000000, 0x84173cf0, 0x87609000)
pc: 0x003a32e0 s_normquery+0x58(0x00006c00, 0x84173d50, 0x821f4468, 0x00000000, 0x00000001)
pc: 0x003a24fc s_normalize+0x274(0x00000001, 0x84173d50, 0x00bdf504, 0x00000000, 0x00000000)
pc: 0x003a3d64 s_renormalize+0xa4(0x00005000, 0x00bdf400, 0x00005388, 0x821f4468, 0x00000000)
pc: 0x003a3a84 s_recompile+0x1b8(0x00000001, 0x873b5348, 0x00005000, 0x00005388, 0x821f4468)
[Handler pc: 0x003a4594 s_handle installed by the following function:-]
pc: 0x003a1100 sequencer+0xc0c(0x00bdf400, 0x00bee800, 0x00005000, 0x00005388, 0x821f4468)
pc: 0x00391938 execproc+0x79c(0x00005388, 0x00007000, 0x821f4468, 0x00005000, 0x00000011)
pc: 0x0038ab14 s_execute+0x291c(0x00bedc00, 0x00be1400, 0x81816340, 0x00000008, 0x00000000)
[Handler pc: 0x003a4594 s_handle installed by the following function:-]
pc: 0x003a15cc sequencer+0x10d8(0x00bdf400, 0x00bedc00, 0x81816340, 0x00005388, 0x821f4468)
pc: 0x0011ce34 tdsrecv_language+0xb4(0x00bdf400, 0x00bee800, 0x00000001, 0x00005000, 0x00005400)

You only have to look at the names of the internal functions and try to figure out what was happening at the time of the crash. Let's look at some: lock_logical is the first and that has probably something to do with locking, colnames, colnames_recurse and colnames_driver have most likely something to do with determining column names. The next four s_normquery, s_normalize, s_renormalize and s_recompile are probably coming from the execution of a stored procedure or trigger that had to be recompiled. There's further proof for that because somewhat later on we see s_handle, sequencer, execproc and s_execute. Now that really is proof of the execution of a stored procedure.

Who can we blame?

At the very last line we see

end of stack trace, spid 10, kpid 520683576, suid 68

Legend

  • spid: The spid (server process id) is shown here. It's a bit overdone because at the start of each line in the errorlog it is also shown. If you look at the full stacktrace example you'll see cpu-number, family-id, spid and datetime.
  • kpid: All server processes have a spid and also a kernel process id. It can be retrieved by using a "select * from master..sysprocesses".
  • suid: The suid (server user id) can be traced back to a user name with "select suser_name(<suid>)". suser_name()

When users are using the same login names is can be very hard to determine which user caused the problem. When you set traceflag 4013 with "dbcc traceon(4013)" dbcc traceon() each logon will be shown in the errorlog together with the name of the client, so that can help you to trace back to the user causing the problem. You can then talk to the person and try to figure out what he/she was doing.

Reproducing the problem

The reproduction of the problem will help you solve the problem. At first glance it seems to be a locking problem (lock_logical leads us to that conclusion). When I was working on this problem it turned out not be a locking problem because when rerunning the query the stack trace re-appeared. Debugging the code lead me to a point where a temporary table was created and also queried in the same stored procedure but in the query the column names were not specified, but just a "select *". Now that makes sense because we also had these colnames, colnames_recurse and colnames_driver function names. Changing the "*" in the select to a named list solved the problem.

Conclusion

When you see a stacktrace it seems very difficult to read but it does give some information to pinpoint the problem. It was also give you some extra understanding about the internals of ASE. When you keep a list of your stacktraces (hopefully a short list) you can use that list to see if you had the problem before and what the reason was for the stacktrace.