StoreDocumentationSpecialsLatest PostsContactOther Stuff
Last Update: Jul 1st, 2012
Trouble

Troubleshooting (v3.x and v4.x)

ET/BWMGR Software and Appliance - Common Problems

Following are common problems and potential solutions with systems running legacy (V3.X and v4.X) ET/BWMGR systems. If you have a current (v5.X) appliance, then see the V5 Troubleshooting guide.

WARNING: Outside interface not set. Limiting Disabled

The Outside Interface settings tells the ET/BWMGR which of your bridge ports is the "outside," that is, connected to your upstream network. It's a required setting - as the error message indicates, you cannot do any bandwidth limiting until you've defined an outside interface.

If you've used 'etip' to set up networking, you should not see this message. But if you have done the setup manually, or changed it after the fact, you may have cleared this setting.

To set the "outside" flag, go to the "Interfaces" tab in the ET/BWMGR, select the interface using the check-box to the left of the interface name, then click "Edit."

Select the first check-box, "Outside," and then click "Save" to apply the setting.

Loop Messages on console and/or in /var/log/messages

A bridge configuration depends on each MAC address on your network being accessible via only one port of the bridge. A LOOP occurs when any MAC address can be reached on both sides of the bridge. This is not necessarily a problem if you get one or two isolated messages - especially during testing when you may be moving machines around or plugging them into different ports. If you see a screen full of these messages, this means that two or more bridged ports on the appliance are plugged into the same switch or hub. Specifically, the message tells you that the MAC address was received on both of the listed interfaces. Constant looping can either halt your system or make it painfully slow, and must be resolved. It indicates a serious flaw in your network setup.

In a Bridge configuration, packets cannot pass or you get a lot of errors.

This can be caused by a number of problems.

(If you know you are getting errors on 1 or more interfaces, go to #2)

1) First, check your configuration. Using "bwmgr showbridges", verify that the 2 ports are both in the same bridge group. Make certain that the 2 devices (one on one side of the bridge and one on the other) are both on the same logical network. Make sure you have no rules defined on either interface. Also verify that only the primary bridge interface (shown by showbridges) has an IP address on the logical network that you are bridging. Typically secondary interfaces will have no IP address assigned. You should be able to access the machine from both sides of the device (using a device on the same logical network, of course). First try pinging on the primary interface wire. Then the other. If neither work, then you most likely have a logical setup problem.

2) If that doesnt work, check for errors on the interfaces. Use "netstat -i" in FreeBSD or "ifconfig interface" in LINUX. In the below example, em2 and em3 are the bridge ports, and there are 17 Input errors on em3.

# netstat -i

em0    1500      00:25:90:91:9b:d8   240788     0     0   119620     0     0
em0 1500 10.0.0.0 10.0.0.115 225397 - - 119639 - -
em1* 1500 00:25:90:91:9b:d9 0 0 0 0 0 0
em2 1500 00:e0:ed:1f:06:fe 2353298 0 0 3194009 0 0
em3 1500 00:e0:ed:1f:06:ff 3194009 17 0 2353298 0 0

If you are getting an increasing number of input or output errors while passing data, you may have a wiring problem. Using crossover cables direct to Cisco equipment is a known problem area, as Ciscos do not NWAY (ie negotiate links) correctly in general. If you are getting errors, you can usually solve the problem by forcing the interface on the switch and the ET/BWMGR system to the same setting. You can use ifconfig in FreeBSD (see man interface for details on command and options), and mii-tool in LINUX. If possible, try to use the bwmgr box setting and force the switch or router. If that doesnt work, try to force both. If you can't get that to work, you can put a small switch in between which with allow separate negotiation by each device. We've found that a cheap switch can often solve the problem.

To set the interface in FreeBSD:
# ifconfig fxp0 media 100baseTX mediaopt full-duplex

would set fxp0 interface to 100Mb/s, Full Duplex. See the fxp man page ('man fxp' on the console) for a list of options.

Bridge won't pass packets - System/Appliance

if you do NOT have a Failover-equipped appliance, one possibility is that the secondary port is not connected: The primary ethernet port (eth0/fxp0) is part of the motherboard and cannot come loose. The secondary port(s) are located in PCI expansion card slots internally, and there is a small chance that they may move enough during shipping to move out of the slot. If this happens, typically the interface will not be shown in the system at all. From the command line, you can issue the command "bwmgr showbridges". If only fxp0 is listed, this is your likely culprit. You can also check with the "ifconfig" command from the command line. For example:

# ifconfig fxp1

If the system indicates that the device cannot be found, then the second port (the ethernet card in the box) is probably unseated. If you suspect that a board has become unseated, you need to take off the cover and reseat the board. If you do so, make certain that you contact Emerging Technologies support beforehand; otherwise you may void your warranty. Note that this procedure is only required when the port cannot be found; if the port is shown via ifconfig and the ET/ADMIN (Networking->Network Configuration->Configure Interfaces) reseating the card should not be necessary.

If the interface is present, see above

ET/BWMGR is Limiting Too Much

Check your Interrupts / Second

Reducing the number of interrupts per second that your NICs can receive can reduce CPU usage, but it also adds delays to packets. For connections with servers that are many hops away, the extra time can be significant. different. If you're using the intel em driver, you can check the interrupt moderation setting with sysctl:

bwmgr# sysctl -a | grep max_ints

dev.em.0.max_ints_min: 500
dev.em.0.max_ints_max: 8000
dev.em.0.max_ints: 4000
dev.em.1.max_ints_min: 500
dev.em.1.max_ints_max: 8000
dev.em.1.max_ints: 4000

max_ints tells you that the controller is limited to 4000 interrupts per second. This means that interrupts are at least 1/4000th of a second apart. The ints_min and ints_max settings are only if AIM (Auto Interrupt Moderation) is enabled.

bwmgr# sysctl -a | grep enable_aim
dev.em.0.enable_aim: 0
dev.em.1.enable_aim: 0

AIM should be disabled generally. If your cpu isn't normally loaded, you can safely increase the max_ints setting. While you can do it from the command line with sysctl, the proper way is to change the setting in /boot/loader.conf. You can play with different settings by changing it from the command line as follows:

bwmgr# sysctl -a | grep enable_aim
dev.em.0.enable_aim: 0
dev.em.1.enable_aim: 0

To verify that you're not getting more ints per second than you think, or if you're using another controller, you can use systat

bwmgr# systat -vmstat 1

Vmstat

The screen updates every second; the numbers next to the em0 and em1 are the interrupts per second for the nics. Even though the max_ints is 4000, the system is only generating 2600 interrupts per second.

Adjust your Shaping Settings

If you have a limit set to (for example) 256000, and you can't get a local application to use that much, these are the likely causes: One, you could be losing packets. Check your interface for errors and look for drops on the rule. You could also have a tcp window problem. Try using different settings for tcpwindow to keep the window from being set too low. Try 5000 to start. A setting of 64000 effectively disables window shaping. I There have been reports that running ET generated kernels on AMD Athlon CPUs result in some timing errors. Rebuilding a custom kernel on the machine seems to fix the problem.

mbuf clusters exhausted error message

If you get an error indicating that mbufs are exhausted, it basically means that your system has run out of system memory and there is no memory available to receive new packets. This can occur when your system is receiving packets faster than it can process them for an extended period of time. This can be caused by an attack, or it may just mean that your system doesn't have enough memory allocated to the kernel to fufill the settings. mbuf clusters are allocated on the fly, so just because you've allocated 20K buffers doesn't mean that there is enough memory to actually use that many. Its difficult to tell exactly how much kernel memory is in use or how much is left, so that best you can do is try to increase your kernel allocation. Our appliances use an algorithm to decide how much memory to allocate based on the amount of ram in the system and the most common requirements. You can manually override this setting by placing a setting in the /boot/loader.conf config file.

First, determine how much memory you have. Before you can decide how much to allocate, you need to know how much you currently have, and how much is currently allocated. This info is displayed (and saved in your /var/log/messages file) on system boot. You can snarf it out with the following commands:

# grep Using /var/log/messages | tail

Kernel Using 150000000 bytes

now, enter:

# grep "real memory" /var/log/messages | tail
real memory = 528482304 (516096K bytes)

This tells you that you have 512K of RAM in the system.

Next, set the kernel RAM allocation

Typically, you only have 1 user on a bandwidth management appliance at a time, so you don't need a lot of user memory unless you're also using the system as a server of if you're running squid. So on this system, we can set the kernel allocation much higher. To set it to 300M (not K), we can put the following line into /boot/loader.conf:

kern.vm.kmem.size="300000000"

Then you'll need to reboot the system. You've just doubled the amount of memory available to the bandwidth management application.

What if you still get the "mbuf clusters exhausted" message?

If you still get the message, make sure you have your bandwidth manager "maxbuffers" setting several hundred buffers below your clusters settting. You can get your clusters setting with the following command:

sysctl -a | grep clusters
kern.ipc.nmbclusters: 20000

At the time of this writing, 20000 was the default setting. You should have your maxbuffers set to 150000 or so. If this is the case, and you still get the error message, we recommend that you increate your clusters by 20%. You can do this by inserting a line in /boot/loader.conf:

kern.ipc.nmbclusters="24000"

then reboot the system.

Graph Problems

ET/BWMGR v3.33c and older

If you have a problem viewing graphs in version 3.33c or below, there is likely some problem interfacing to your database or a configuration problem. If the graph is not created or you get a broken graphic symbol on your browser, then you should check your HTTP_ROOT and default graph directory in your ET/BWMGR defaults settings. You can also view the HTML and check the URL that the system is building from the info in your defaults.

ET/BWMGR version 4.0

If you have a broken icon, you can right-click on the icon and "open image in new window". This should print out any error messages. Usually the error is fairily self-explanitory. If you get a "Server cannot be found", then you have a problem with your web server. First, check to see if apache is running on your system. You can just start it and it will complain if its already running:

# apachectl start

If apache is running and you still can't access the server, then you'll have to debug the connection. Check the address in the browser window, and check your httpd config file, which is /usr/local/www/conf/httpd.conf. By default, your document root should be /usr/local/www/et. If you find that you need to make changes to your config file, you'll need to restart apache with the following command:

# apachectl restart

Problems with bwmgrd on all versions

If you have empty graphs and you are getting hits on the rules that are supposed to be graphed, then the data is not being put into the database for one reason or another. Things to check:

On the main GUI page, bwmgrd status is shown. The status must indicate "Running".

If bwmgrd is not running, go to Administration->System and Server Status and click on Bwmgrd Stats Daemon and check the current status for the reason.The status only shows that last reported error. For more extensive analysis, you'll need to check the /var/log/bwmgrd.log for additional errors. From the console:

# tail -25 /var/log/bwmgrd.log

will show the last 25 lines in the log. If you have a lot of errors is started, check your settings using the "edit defaults" button from the GUI

When bwmgr starts it prints a startup message in the log. So look for the last start and look at messages after that. Sometimes the error message will tell you what's wrong right away. Some common messages:

Cannot Open MySQL Database

If the database can't be opened, then either mySQL isn't running, or you have a problem with your permissions. You can check mySQL the same way you checked to see if bwmgrd was running in System and Server Status. If mySQL is running, check your defaults and check your database settings and password.

Can't insert system info: Duplicate entry '1226272680' for key 'PRIMARY'

(PRE V5) A duplicate entry message usually means that the same time has occurred twice. This can happen when you set the clock back on your system. This is an issue during daylight savings, but there is no workaround. If you only get this once (or can resolve it to a change of the system clock), then you can ignore this. If you get duplicate errors constantly you may have multiple instanced of bwmgrd running.

Failed to Add Data Record (mail): Table './etbwmgr/bwdata' is marked as crashed and should be repaired

If you are getting errors like the above, you likely need to repair your database. The procedure is described below.

MySQL Problems and Database Repair

Before repairing the MySQL database, it's good to have an idea of what the problem is. If you have already checked /var/log/bwmgrd.log, then you may already have identified the error. If not, you should look at /usr/local/var/mysql/HOSTNAME.err (HOSTNAME should be replaced with the hostname you have assigned your appliance.)

On appliances, there is a command-line utility that will attempt to automatically repair the database. You must be the super-user to run this utility. Before running fixdb, you should check the available disk space. If you have less than 60% available on your /usr partition, then you may not be able to repair the database due to insufficient disk space.

# df -h

Check the available space on the /usr partition first. If you have more than 50% available on /usr, then you may proceed with the check and repair.

# fixdb

The 'fixdb' command will shut down bwmgrd and the MySQL database, and attempt to repair your database tables. MySQLd will then be restarted. This can be a slow process, especially with large databases. When the process is complete, if the repair was successful you should see the following line:

Starting mysqld daemon with databases from /usr/local/var/mysql

If you see this line and no further error messages, then you can then re-start bwmgrd.

# /usr/local/sbin/bwmgrd

If you continue to have database problems after running 'fixdb', then use the manual method below:

First, change your directory to the location of the 'etbwmgr' database files, and list the files.

#cd /usr/local/var/mysql/etbwmgr
#ls -la

drwx------ 2 mysql mysql 512 Oct 11 14:55 .
drwx------ 4 mysql mysql 512 Oct 14 11:24 ..
-rw-rw---- 1 mysql mysql 107696 Oct 14 14:00 bwdata.MYD
-rw-rw---- 1 mysql mysql 24576 Oct 14 14:00 bwdata.MYI
-rw-rw---- 1 mysql mysql 9042 Oct 11 14:55 bwdata.frm
-rw-rw---- 1 mysql mysql 67 Oct 14 14:00 markers.MYD
-rw-rw---- 1 mysql mysql 2048 Oct 14 11:25 markers.MYI
-rw-rw---- 1 mysql mysql 8710 Oct 11 14:55 markers.frm

You should see a listing similar to the above, although the filesizes will be different. If you do not have the same files, and instead see "bwdata.ISD" and "bwdata.ISM", then instead of running "myisamchk" in the below examples, you must run "isamchk" instead.

The next step is to check your database for errors. Below is the output from an uncorrupted database.

#myisamchk bwdata

Checking MyISAM file: bwdata
Data records: 849 Deleted blocks: 0
- check file-size
- check key delete-chain
- check record delete-chain
- check index reference
- check data record references index: 1
- check data record references index: 2
- check data record references index: 3
Please note that if you see the following lines, this does NOT indicate a serious database corruption.
myisamchk: warning: 1 clients is using or hasn't closed the table properly
MyISAM-table 'bwdata' is usable but should be fixed

The key line to look for in order to determine whether a repair is needed is the last two lines of output:

"MyISAM-table 'bwdata' is corrupted
Fix it using switch "-r" or "-o"

If you see errors listed, the next step is to attempt repair.

First, shut down bwmgrd and the MySQL server:

# killall bwmgrd
# mysqladmin -p -u root shutdown

(you will be prompted for the password to complete this step.)

Next, backup the /usr/local/var/mysql directory manually or using the "Backup" feature of the ET/Admin.
Then, begin the repair operation. If this fails, it may not be possible to recover any information, unless the failure yields more information about the underlying problem.

# myisamchk -r bwdata

If you have an appliance or your mySQL distribution is built using /var as the default directory, you may not have enough space in the partition to repair your database. In this case, create a temp directory in your /usr partition if you don't already have one and specify it as the temp directory as follows

# mkdir /usr/local/temp
#myisamchk --tmpdir=/usr/local/temp -r bwdata

If the repair is successful then you will be able to restart the MySQL server and you are done. If the repair is not successful your only reliable option is a restore from your last backup or to re-create an empty database (You do have backups, right?).

# /usr/local/bin/safe_mysqld --user=mysql & (Restart the MySQL server.)
# /usr/local/sbin/bwmgrd

If you see big spikes in your Graph Data when rebooting (v4)

If you see big spikes in your graphs that correspond to a reboot, it probably means that bwmgrd was started before mysqld, either because you started them in the wrong order or because of timing issues with system threads. Make certain that you start mysqld before bwmgrd, and then you allow at least 2 seconds in between for mysqld to get its act together. You can do this with a "sleep", or by running something else in-between.

You'll need to manually remove the "spikes" from the database with an SQL DELETE. Figure out approximately what the data size is (from the graphs, noting a duration of 300 seconds) and look for data that far exceeds a normal reading for the graph. So, for example, if the normal high reading is 120kbs for incoming data, that equates to a "bytes_in" setting of about 4.5 million bytes. (120,000/8 * 300). You could search the database on the given date for values over 8 million, and you should be able to locate your spike data. Just delete the row, as the reading is invalid.

"Can't Get Statistics" Error Messages

If you see "can't get statistics for rulename" messages in your /var/log/bwmgrd.log file, it means that a rule was deleted (or failed on startup) that the statistical system still thinks should be there. When you delete a rule gracefully from the GUI, the marker file that bwmgrd looks for is removed. If the rule was removed purposefully, you can get rid of this message by deleting the associated file in the /usr/local/etc/bwmgr/config directory.

What to do if your system/appliance doesn't power up properly

If the unit is completely unresponsive (ie, no fan noise, nothing on the screen, no beeping), check all power connections, as well as all switches. The ET/R1500 series have a main power switch on the power supply as well as a "power-on" switch on the front panel. For users with the 2U enclosure option, make sure that you have the correct voltage (see power supply requirements). If the outlet and power cord test good (test on a monitor or other appliance with a standard AC input), and there's still absolutely no response from the unit, contact Emerging Technologies for technical support or RMA service.

If the unit powers up, but freezes before the OS boots, then it's possible that the CPU fan/heatsink has popped off its mount during shipping. Please make a note of what's on the monitor, then power-off the machine and contact Emerging Technologies' technical support.
(LINUX Only) If the boot stops at the message "Starting system logger:", this is likely due to an incorrect or missing DNS setup. You must wait for the current program (syslogd) to timeout while trying to get the hostname. This may take up to 3 minutes, so be patient. Once the machine has booted, make sure DNS is enabled and setup properly.

If the unit displays the power-on self-test (POST), but does not find a bootable device, it's possible (although very unlikely) that the IDE cable has come loose from either the hard drive or the motherboard. Please notify Emerging Technologies' support staff before opening the box!
If the monitor remains blank, but the fans start and you hear a series of beeps from the unit, this indicates a problem with the memory. The ET/R1500 units use standard DIMM RAM modules. Either the module has come loose from its seating, or has failed completely. Contact Emerging Technologies' support staff before opening the box and attempting to re-seat the RAM. If re-seating the RAM does not work, we will likely issue an RMA.

Disaster Recovery

This section deals with a situation wherein your appliance does not boot, either due to a crash that fsck (the UNIX "chkdsk" or "scandisk" equivalent) cannot deal with gracefully, or a panic during the boot process. In either case, you can either use the ET/Recovery CD to fix the problem, or take manual control of the appliance at boot time.

If you do not have a Recovery CD, then you must follow the step-by-step instructions below. If you do have a CD, boot the appliance with the CD-ROM in the drive, and use the "fix" command to repair and mount the appliance filesystems automatically. If you are experiencing a panic, you can make the necessary changes after running "fix", since the appliance filesystem will be accessible in the "/mnt" directory. See the ET/Recovery Manual for more information.

Manual Instructions:

FreeBSD

Hit "F1" at the boot menu to select FreeBSD. After a few seconds, you will see the text "kernel= " as the
kernel is loaded, followed by a 3-second countdown. Press the spacebar (or any key besides enter) to interrupt the boot. You will then see a "boot>" prompt. Enter the following command to boot into single-user mode:

boot> boot -s

Alternately, if you are loading a debug kernel, you must instead do this:

boot> unload
boot> load kernel.dbg
boot> boot

This will load the debug kernel for a single boot.

You will be prompted to enter the shell for root, if you are entering single-user mode. Simply hit "enter" to accept the default of /bin/sh. Now you should have a root prompt - key in the following series of commands:

# /sbin/fsck -y /
# /sbin/fsck -y /var
# /sbin/fsck -y /usr

This last command should take a few minutes to complete, at which time you can either continue the boot, or you can make appropriate changes to your startup files. If you need to make any changes, you must first enable read/write access to your filesystems:

# mount -a

If you know exactly what is causing the problem, then you can take specific action to fix it. If you suspect a BWMGR rule is causing problems, but don't know which one, then you can bypass starting the ET/BWMGR like this:

# mv /etc/rc.bwmgr /etc/rc.bwmgr.sav
# mv /etc/rc.bridge /etc/rc.bridge.save

# exit

LINUX:

Hit "F2" at the boot menu to select Linux. Next will appear the "LILO:" prompt. Type " linux -s " at the prompt and press enter.
You may have to enter the root password to get a shell prompt. At the prompt, type the following commands:

# /sbin/fsck -y /
# /sbin/fsck -y /usr

This last command should take a few minutes to complete, at which time you can either continue the boot, or you can make appropriate changes to your startup files. If you need to make any changes, you must first enable read/write access to your filesystems:

# mount -a

If you know exactly what is causing the problem, then you can take specific action to fix it. If you suspect a BWMGR rule is causing problems, but don't know which one, then you can bypass starting the ET/BWMGR like this:

# mv /etc/rc.bwmgr /etc/rc.bwmgr.sav
# mv /etc/rc.bridge /etc/rc.bridge.save

# exit

Hopefully you will be able to boot after performing this procedure. If not, please contact Emerging Technologies for technical assistance.

Comment Policy Add Comment

Next: ET/BWMGR PHP API