Troubleshooting EPICS  
General:

A variety of difficulties arose when efforts were made to tailor the original CEBAF HV code. In some cases, solving these problems took several days. For documentation purposes, here may be found descriptions of the various problems and how we solved them.

Problem Listing:

5/13/99 The 'LS' Problem

Source Code Files Affected: Cause of Problem:
You are using an older version of the EPICS code or an older set of LeCroy Mainframe EPROMS. This was a bug in the mainframe's Arcnet support software. When LeCroy fixed the error, we had to alter our EPICS support code to be compatible with the new outputs.

Fix:
Make sure you are using a mainframe with the proper EPROMS. The affected chips are U19 and U22. They should be Version 3.13 or later. Also make sure you have up-to-date versions of the files listed above.

Detailed Description:
The mainframe command LS, which returns a series of hexadecimal numbers cooresponding to the summary of changes to each logical unit within a mainframe, contained a bug. This bug, coded unto the mainframe's EPROM, issued an error statement in place of the logical unit summary in mainframes which contained more than 25 logical units. This wasn't a problem before, because only mainframes using 1469s could possibly contain this many logical units (every LeCroy module is considered to be one logical unit, except for 1469s, which have 2). The error message was sent because the Arcnet could not handle sending a character string as long as the response for 25+ logical units would require. Since our boot sequence and asynchoronous tasks depend on the LS command, all of this meant we could not control a mainframe of 13 or more 1469 modules via Arcnet.

Our Solution:
We contacted LeCroy and informed them of the bug. They send out some updated EPROMS, which split the response into 2 lines, allowing it to be sent over Arcnet. The first line is prefixed with the character 'C' as its first character -- this is a flag to other software that the line will have a second, continuation line. The second line does not contain a flag character, nor does it have the standard command label (all other responses from the mainframe are prefixed with the command that caused the response). The support software that handles the responses from the mainframes was modified to handle this special case. Currently, it assumes that the very next line of response from that mainframe will be the continuation line. Also note: LeCroy introduced another bug in the new EPROM software: for the case of 24 logical units, the LS response is repeated. They have promised to repair this bug.

5/13/99 The '1471' Problem

Source Code Files Affected: Cause of Problem:
You are most likely using a version of the EPICS software which doesn't support 1471 modules.

Fix:
You need to get an updated version of the above files and recompile.

Detailed Description:
The original hiv EPICS record was based on 1461 modules. The 1471s contained three properties not present in the 1461 -- Measured Peak Current, Peak Current Trip, and Ramp Trip Enable. These properties had to be reflected in the hiv EPICS record, and be included in the support software.

Our Solution:
The three properties were added to the software. The hiv record was expanded to include these properties; those modules without them still have the same fields, but they are initialized to zero and never accessed or changed.

Of special note is the Measured Peak Current. As it is a measured quantity, it must be checked in the seq task (located in seqArcnet.c). This involves adding a code segment to check the current checksum against the last checksum retrieved and, if a difference is detected, issue a command to retrieve the new value of the measured property.

In general, it is best to duplicate the format used for the existing properties when coding new properties. We advise that anyone trying to add new properties run a search for the places in the above files were the changed for the 1471 (use 1471 as a keyword) and then make similar changes for the properties you are adding. It may be necessary, for memory reasons, to create a new version of the hiv record, but we have thus far avoided doing this.

5/13/99 The 'scan Task Access Fault' Problem

Source Code Files Affected: Cause of Problem:
There are two possiblities. The first is that the large number of records is causing the initilization tasks to timeout before finishing (see The 'Too Many Records' Problem below).

The other possiblity is that the program you are using is issuing commands to the IOC so fast that a race condition develops.

Fix:
If using your own code, try to slow down the rate at which commands are issued to the IOC. A good way of doing this is to change one field at a time, then verify that it has been changed before changing the next field.

Detailed Description:
After issuing a command that changed a large group of HV properties, the IOC would report Access Faults in one or more of the 'scan' tasks that sweep the EPICS records periodically (note that the hiv records are processed both periodically and are also processed passively -- it depends upon whether the fields in question are being read or set). We uncovered this problem when testing 1469 modules for the first time. At that point, we were unaware of the 'Too Many Records' problem, and believed that the trouble was based solely on a race condition developed by the HV test_stand code. Now it seems likely that the 'Too Many Records' problem may have contributed as well.

Our Solution:
We added a section of code to those subroutines that change the HV properties in large groups. The new section of code forces the program to verify that each change is made before changing the next one. Unfortunately, this slows down the code by quite a bit. The code sections can be disabled by undefining the variable _GROUPVERIFY_ in hv_group.cc. Now that the 'Too Many Records' problem has been discovered and remedied, it may be possible to remove/disable these verification routines.

5/13/99 The 'Too Many Records' Problem

Source Code Files Affected: Cause of Problem:
Because of the large number of records, some tasks in the boot sequence responsible for initialization are getting timed out before finishing. Specifically, we have seen Alarm status fields get Undefined conditions, which prevents the record from functioning normally.

Fix:
One of the defined time outs in HiV.h, ASYNC_TIME_OUT, needs to be increased. This will prevent records from being given an Undefined status. The other TIME_OUTs should be examined to see if they are causing similiar problems.

Detailed Description:
Previously, this condition did not yield any sort of warning or error message. When a program tries to access uninitialized fields, it can react strangely. A common reaction is 'scan' tasks having Access Faults and being suspended. Also, MEDM will not be able to display some fields.

Our Solution:
We increased ASYNC_TIME_OUT from 60 to 300. This seemed to be enough for the 700+ records we were testing with. We have noticed no adverse effects from the increase. We have also added an error message to the code, should the timeouts occur again.

5/13/99 The 'hv2db' Problem

Source Code Files Affected: Cause of Problem:
You are using an older version of hv2db to generate a .db file which has more than 500 hiv records.

Fix:
Update your version of hv2db, or comment out the lines:

# if ($nrec == 501) {
# print OUTPUT "}\n";
# close(OUTPUT);
# print "Database continued in $ARGV[0]_2.db!!!!\n";
# open(OUTPUT,">$ARGV[0]_2.db");
# print OUTPUT "database($ARGV[0]_2) { nowhere() {\n";
# print OUTPUT "}\n";
# }

in your hv2db file. Rebuild your .db file with the new hv2db script.

You could also just make sure that the second .db file, phhv_2.db, was loaded on to the IOC with the original phhv.db.

Detailed Description:
The hv2db Perl script reads .dat files and outputs an EPICS .db file. In an older version of EPICS, the DB was loaded with a binary file, and took so long that spawned tasks on the IOC would time out. The people at CEBAF thus limited their .db files to 500 records or less. The remaining records were placed in a file called phhv_2.db. A problem arose for large databases when this second file was not loaded onto the IOC and thus, many records simply weren't present in EPICS.

Our Solution:
The code that split the records into two files was obsolete and therefore was removed.

5/13/99 The 'MEDM csh script' Problem

Source Code Files Affected: Cause of Problem:
The IOC doesn't have enough memory to buffer all of the commands it is receiving.

Fix:
There are 3 general ways to deal with this:

  1. Get more memory for the IOC
  2. Free up existing memory on the IOC
  3. Speed up command execution so that the command buffers will empty faster than they are filled.
Option one can only be done by purchasing more RAM.
Option two is probably the best solution -- reduce the number of records that that IOC is responsible for. If this is done, the IOC runs a great deal more smoothly and in general runs better and faster. Unfortunately, this isn't always possible.
Option three is a temporary solution -- it becomes increasingly difficult to ensure fast enough execution as the database gets larger. One way of speeding things up is to increase the size of the hash table used by EPICS to store process variables, via the command dbPvdTableSize(table_size). This command is issued in the load script immediately after the iocCore is loaded. The default table size is 512, and it can have any power of 2 value between 256 and 65536. Our tests show that, with our current available memory, large table sizes usually don't leave enough memory for standard IOC operations.

Detailed Description:
The MEDM csh functions and others like them send a series of commands to the IOC to executed, which are buffered. If the IOC runs out of buffer space, you get this problem.

Our Solution:
We are currently looking into purchasing more memory for the IOC. For now, the dbPvdTableSize has been increased to 1024.

HomePhenix Home PageOnlineOnline Computing GroupAncillary Home Page 
Ryan Roth (rothr@db.erau.edu)
Last  modified: 17 May, 1999