Anaconda/NetworkIssues

From FedoraProject

< Anaconda(Difference between revisions)
Jump to: navigation, search
(Imported from MoinMoin)
 
m (Cisco Switch Issues)
 
(One intermediate revision by one user not shown)
Line 191: Line 191:
 
spanning-tree portfast
 
spanning-tree portfast
 
</pre>
 
</pre>
 
+
You may also change this on a per-VLAN basis, for example:
 +
<pre>
 +
no spanning-tree vlan 1
 +
</pre>
  
  
Line 244: Line 247:
  
 
Portfast is enabled.
 
Portfast is enabled.
 
  
 
== nicdelay and linksleep May Be of Assistance ==
 
== nicdelay and linksleep May Be of Assistance ==

Latest revision as of 02:58, 31 December 2009

Contents

[edit] NOTE: This information is still in draft format

The text of the wiki page has been sent to the Anaconda List for additional review on 20070325. Greg

There are several challenges that can occur when Anaconda is installing the Operating System in some network environments. Some of the symptoms and messages include

pump times out
pump told us: No DHCP reply received

Basically, the DHCP message is telling anaconda that it can't renew the DHCP lease. The error message can be confusing because the kickstart client is definitely getting an IP from the DHCP server which allows it to boot into PXE. After initially getting an IP to boot into PXE, anaconda begins to install the driver modules. The installation just halts when it asks me for the networking info.

Here are several things to check from the Anaconda mailing lists to help you try and resolve these issues.

[edit] File Format

The first thing that you may want to check is the file format of your Anaconda kickstart file, ks.cfg. If you have not developed the file yourself, and do not know of its Operating System origin, then it may not have the correct file format. Anaconda ks.cfg files should be Linux formatted files. The files should have a single linefeed at the end of each line. You can take an editor like vim and use

set ff=unix
wq

to make sure the file has been converted from a dos file to a Linux file. The open source notepad2 Ms Windows text editor can also correctly format the ks.cfg file as a Linux/UNIX formatted txt file.

[edit] SOHO and ISP's DNS

Small Office Home Office, SOHO, may rely on their ISP's DNS servers for name resolution. This causes two notable results. After a user types in all the network information in the anaconda configuration screens, or Anaconda uses the information loaded from the the kickstart file, there will be a notable delay. Anaconda is trying to make sure that network information typed in resolves via the DNS servers supplied. The SOHO environment may be relying on the ISPs DNS and a set of /etc/hosts files for name resolution. Anaconda is trying to make sure it creates an /etc/resolv.conf and /etc/hosts file that will not break the network installation after the computer reboots.

Along with the long delay during NIC start up, the resulting network environment will have an /etc/hosts file that may look like this

127.0.0.1       localhost.localdomain   localhost   desiredhostname.xxx.org

What you were really expecting was

127.0.0.1       localhost.localdomain     localhost
192.96.1.1      desiredhostname.xxx.org   desiredhostname


If you are in this SOHO situation, then you will need to rewrite your your /etc/hosts file. You can also use a kickstart file to generate the /etc/hosts file.

You may find this irritating. It becomes increasing important for anaconda to safely write out a correct /etc/hosts and /etc/resolv.conf file. Think of all of the software tools that rely on network configuration even if it is the 127.0.0.1 localhost name: X, sendmail, webservers, web clients, etc. Moreover, the linu8x hotplug allows for network cards to be none existent during the boot process. Bastion hosts may not have access to a DNS server. If your are on a reliable network but loose connectivity to DNS you still want the local host using network software to function properly. As a last sanity check, the Anaconda attempts to make sure it can resolve the host name that you typed in. If Anaconda cannot, resolve the host name, then the /etc/hosts file is written with both the localhost and your desiredhostname on the same /etc/hosts file record.

[edit] Mixed NIC Cards and Drivers

Perhaps you have a 10/100 LAN on motherboard, LOM, chip, but want to use gigabit. Let's say that the BIOS looses its configuration settings. These settings may have disabled the LOM 10/100 connection. Now the 10/100 LOM NIC is the first adapter that Anaconda sees. If you are using a kickstart file, then a.) the wrong driver may be present for your installation environment. b.) Your anaconda ks.cfg may be applying the wrong network driver to what you think is the first network adapter in your system. c.) Multiple NICs may be a requirement for your configuration. If so, then you may need to supply the desired NIC for installation on the boot line. The boot option is centered around ksdevice options.

This is from the anaconda 11.0.5 docs/command-line.txt file. ksdevice takes one of 4 types of arguments which tells install what network device to use for kickstart from network:

  • An argument like 'eth0' naming a specific interface.
  • An argument like 00:12:34:56:78:9a indicating the MAC address of a specific interface.
  • The keyword 'link' indicating that the first interface with link up.
  • The keyword 'bootif' indicating that the MAC address indicated by the BOOTIF command line option will be used to locate the boot interface. BOOTIF is automagically supplied by pxelinux when you include the option 'IPAPPEND 2' in your pxelinux.cfg file

Another supporting Anaconda option is the device command. The device command allows a system build script to specify either SCSI or Ethernet drivers to probed and loaded in a specific order during the installation. You would use a command similar to

device eth e1000:tg3:e100

in you anaconda kick start file. In addition, you will need the

boot: Linux noprobe nonet

boot command line arguments. This will allow you to tell what order the various Ethernet device drivers are loaded in your system. Moreover, it can be used to specify which driver would be used to set eth0.

[edit] ks.cfg Typos

You may just have a typo in your ks.cfg file if you have multiple NICs present. You edited your kickstart file to say eth0, when it should be eth1. Check and correct the file, if required.

[edit] DHCP Service Issue

A DHCPD service may not running on your network as you believe it to be. The service could be down temporarily. Contact your network support folks to remedy this problem.

[edit] Anaconda Version Issues

Both the anaconda and distribution name and version are important when asking questions on mailing lists. Either Anaconda or Kernel bugs could be a factor in causing problems. These should be researched if any of the steps above have not helped to solve the problems.

[edit] Bus Enumeration Issues

RHEL 4 enumerates the bus in a different order than RHEL 3, so if there are any NICs on expansion cards, this could cause problems.

[edit] NIC Cycled Three Times

The NIC is cycled and performs three re-negotiations during a system build with anaconda. The first one is during power on which would affect PXE installations. The second NIC cycle is when the kickstart file is loaded. The third NIC cycle is when anaconda starts the build process.

Network switch equipment may cause problems because of the number of times that the NIC is cycled during an Anaconda installation. If the portfast setting has been disabled on the switch port that your NIC is attached to, then there will be STP delays that are well past the configured anaconda threshold.

[edit] Example Anaconda and NIC Cycle Issue

Here's an example of how this network interaction works:

I'm trying to kickstart a node in my render farm. The node is connected into an Alcatel 6600 series L2+ switch, connected via LACP aggregate to main switch closet. Everything is on the same VLAN and subnet.

If I start the bootstrap install procedure, if I have my machine connected to the 6600 switch, I get no link. Nada. Nothing. Zip. No DHCP broadcast is seen by the server (on the same subnet as the affected machine). I've even tried connecting the DHCP server physically to the same switch, and again, it doesn't even see a request from Kickstart.

What makes it weird is that there are 14 other machines on this switch also using DHCP (although not from Kickstart), and they have no problems getting leases and communicating with the DHCP server.

If I connect the affected machine to the main switch and try again, it will get an IP properly, but installation always crashes without fail while trying to install one of the OpenOffice RPMs. The RPM itself is fine, mind you, but Kickstart doesn't like it, and fails consistently at the same byte each and every time.

If I get the machine to grab a lease by connecting it to the main switch, and then swap it back on to the Alcatel, it continues to communicate without trouble,

I've tried a second NIC, with the same result. I've tried a different HDD with the same result.

Even when connected to the Alcatel 6600 switch during the DHCP negotiation phase, I can see the NIC light up, I see it actually activate, and then it gets shut down again immediately after.

This is extremely confusing. I have never seen a situation like this before, and I am running out of possible explanations.

[edit] Buffer Switch

You may be able to put a low-end network switch between your managed switch and the Linux install target. The low-end network switch may buffer the effects of the NIC being cycled three times so that Anaconda will be able to successfully build the host. Remove the temporary switch once the box is ready for production.

[edit] Pass DHCP IP

You may be able to determine the DHCP IP address an pass the IP information at the boot prompt.

[edit] Timeout Extension

It may be possible to extend the timeout by adding /etc/pump_device.conf to the initrd ram disk. You will have to configure acceptable values for your network environment with the 'timeout <some number>' option.

[edit] Check Spanning Tree Settings

Check the spanning tree settings on the affected port and even switch-wide. Spanning tree negotiation can take a long time. The spanning tree delay will cause the NIC connected to the spanning tree port to think that the port is up, when the port isn't. Packets sent from the NIC to the port while it is in this state will essentially be eaten.

I'm not as familiar with the Alcatel switch, I'm a Cisco guy, but the problem you are describing regarding DHCP sounds like the portfast issue that we always see on this list. Basically, the NIC cycles itself 3 different times during a build. The first time is during pxeboot. After the pxe information is sent, the NIC is recycled and must re-negotiate with the switch. If portfast is not enabled, then the request times out before spanning tree allows the NIC to begin communications.

[edit] Spanning Tree Work Around

If network policies will not let switch be reconfigured, then you can try work around steps that include assigning the IP address at the boot prompt instead of via DHCP.


[edit] Avoid Autonegoation Delays

Autonegoation delays can cause problems during an Anaconda system build. Use ethtool boot options to avoid autonegotion delays. The order of the options are important.

100 megabit connections would use this stanza.

boot: Linux ksdevice=eth0 eth0_ethtool="autoneg off speed 100 duplex full"

1000 or gigabit connections would use this stanza.

boot: Linux ksdevice=eth0 eth0_ethtool="autoneg off speed 1000 duplex full"

Some gigabit environments may require autoneg on. If off does not work correctly, then please try on.

NOTE: The Ethernet settings may have to be put in the sysconfig database. Edit /etc/sysconfig/network-scripts/ifcfg-eth0 and add the eth0_ethtool options. The name value pair changes from

eth0_ethtool="autoneg off speed 100 duplex full"

as used on the boot line to

ETHTOOL_OPTS="autoneg off speed 100 duplex full"

in the sysconfig database for the eth0 interface. Adjust the eth0 interface as required for your installation.



That's exactly what nicdelay is supposed to help you with. The problem is, it takes ~ 30 seconds for the NIC to negotiate with the switch. By delaying bringing the NIC up until enough time has passed that the negotiation has taken place, you should then be able to configure the NIC, get your kickstart file, and be off and running.

managed to create a kickstart environment that utilizes PXE boot.


I tried to do a kick off of a Dell 1950 and 2950. I tried to boot off of both NICs to no avail I connected a crossover cable from the kickstart client directly to the DHCP/PXE server. I get the same results. An initial IP is assigned the first time for a PXE boot but doesn’t subsequently

get one during the anaconda install phase.

Anthony,


I have a series of switches, in racks, and one of them seems to be a bit slow so when kickstart goes for it's DHCP request, pump times out. The initial PXE takes much more time to get an "ack" at the initial contact than the same boxes in other racks with other switches but then finally starts chugging away -- it's the next DHCP request that fails, times out.

Today I took the same box, draped a cat6 cable to another, less populated switch, and it kicked fine.

(question) So my question is, how or where can I place an argument to tell kickstart's "pump" request to try longer before timing out.


[edit] Cisco Switch Issues

Consider the type of managed switch environment that you are trying to build an anaconda system in. Many older Cisco switches have STP on by default. If you set the port(s) to 'portfast' the links come up immediately rather than waiting ~45 seconds. I haven't encountered any other switch vendors that have this 'feature' enabled by default. You can tell if STP is on without even logging into the switch. Check to see if the link light for the port spends a long time in the orange color before turning green. Allegedly more modern Cisco switches ship with this feature off by default.

I'm not a Cisco guy so you will want to research these command sequences. The idea goes something like this at the Cisco command prompt

int (interface)
spanning-tree portfast
^Z
wr mem

On old 3500s (IOS 12.0 is the latest they'll run)

int fa0/5
spanning-tree portfast

You may also change this on a per-VLAN basis, for example:

no spanning-tree vlan 1


The network team here does not want to adjust STP and PortFast settings on an individual basis. They have legitimate reasons, but in the end it means I need to find a way to make pump handle STP better -- ISC does it.

I encourage everyone to poke at RH to get working on this bug:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=189795

)


I think there are quite reasonable arguments to be made for not enabling STP negotiations on ports that don't need it (ie, enable STP only on an individual port basis). If the device connected to a port can talk STP (a Linux box?) and take part in the negotiations as a rogue, can't it create havoc in an otherwise perfectly tuned network leading to a DoS?

Binand

Yes, I agree that STP doesn't need to be on by default. Their currently policy is just not to make individual port settings changes any longer as systems here tend to get moved without notice causing issues.

Anyways, it will take some time to get that policy changed, so in the short term I am trying to see what can be done to get pump to work within this environment.

Ray

I am experiencing a similar problem; however, my situation is slightly different.

I have HP DL385's connected to a 48 port 10/100 module in a Cisco 6509 (with Sup 2 routing module). I'm using the virtual CDROM via the HP ILO (Integrated Lights Out) interface to perform the initial boot. I mount a customized version of RHEL3U5 CD1. My kickstart file is specified as an url. The system boots and Anaconda starts, but before the link negotiation can complete Anaconda times-out and I am dumped into an interactive install.

A couple of data points:

1) The HP DL385 on-board NICs are manufactured by Broadcom. I've read about a known bug with link negotiation between the Broadcom NICs and Cisco 48-port line cards. The link negotiation works, it just takes a "longer" time than with any other kind of NIC.

2) Setting the switch port NIC to enable "spanning-tree portfast" does shorten the link negotiation time, but not enough to get around my problem.

Work Arounds:

1) I will prebuild the kickstart configs and place them on my customized RHEL3U5 CD. I've been unsuccessful at reading the kickstart config from the CDROM. The driver supporting the ILO virtual cdrom does not name the cdrom device as "cdrom". Still reading up on this.

We build HP DL 380/385/580/585 servers every day while connected to Cisco switches that we don't have any access to. Are you using RHEL 3 or 4? If 4, do you have spare NICs in the box? RHEL 4 enumerates the bus backwards from RHEL 3, so if you have a NIC, it's going to be eth0 instead of the onboard...

In our normal data centers, we build HP DL385/585 daily as well. The data center with issues is from an acquired company with older Cisco switches (5 years +).


Portfast is enabled.

[edit] nicdelay and linksleep May Be of Assistance

Use nicdelay and linksleep boot options to resolve problems. These options are supposed to delay the driver load while the physical negotiation takes place. The problem is the 30 seconds or so delay for the NIC to negotiate with the switch. By delaying the time when the NIC is brought up, there has been enough time for the auto negotiation to take place. The delay should allow you to configure the NIC, get your kickstart file, and be off and running.

Please remember that some environments will not be successful with these two options. The nicdelay and linksleep parameters are supposed to be able to help with the STP issue. However, the Ethernet link is bounced several times during kickstart. These options can introduce long delays for each bounce of the Ethernet link, which is annoying.

It has been reported that boot options of

nicdelay=50 linksleep=50

can be used to solve Anaconda NIC "time-out" issues. It was not known what caused the NIC time-outs, when the nicdelay/linksleep options were successful. The http system builds were failing to contact the apache server. With these boot options in place, you should see Anaconda waiting about 50 seconds while the NIC initialized. The build will then complete normally.

Here are some sample boot line receipts that use the nicdelay and linksleep options. You will need to modify these to meet your network and hardware environment.

If you are connected to a 100Mb switch try:

boot: Linux ap10903-02-net ksdevice=eth0 ip=10.196.254.122 netmask=255.255.255.0 gateway=10.196.254.1 nicdelay=50 linksleep=50 eth0_ethtool="autoneg off speed 100 duplex full"

If you are connected to a gigabit switch try:

boot: Linux ap10903-02-net ksdevice=eth0 ip=10.196.254.122 netmask=255.255.255.0 gateway=10.196.254.1 nicdelay=50 linksleep=50 eth0_ethtool="autoneg off speed 1000 duplex full"

Some gigabit environments may require autoneg on. If autoneg off does not work. Please remember that the order of the eth0_ethtool options are important.

NOTE: you may have to alter the delay numbers for your network environment. You may want to try some increasing scheme like 5, 50, 500, or 50000.

NOTE: Thee Ethernet settings may have to be put in the sysconfig database. Edit /etc/sysconfig/network-scripts/ifcfg-eth0 and add the eth0_ethtool options. The name value pair changes from

eth0_ethtool="autoneg off speed 100 duplex full"

as used on the boot line to

ETHTOOL_OPTS="autoneg off speed 100 duplex full"

in the sysconfig database for the eth0 interface. Adjust the eth0 interface and speed as required for your installation.

[edit] Embed ks.cfg File

If the nicdelay and linksleep options do not help, then you may need to embed the ks.cfg into the initrd file. There are two steps required to use an embedded kickstart file. The first step is to build an initrd with the embedded file. The second step is to reference the embedded file with the ks= boot option.

[edit] Example Procedure

Here is a small script to insert kickstart files into the initrd. Place the script in the isolinux folder, and ks.cfg files in a subfolder called configs.

#!/bin/sh
mv initrd.img initrd.img.gz
gunzip initrd.img.gz
mkdir -p /mnt/init
mount -o loop initrd.img /mnt/init/
cp --reply=yes configs/* /mnt/init/
umount /mnt/init/
gzip initrd.img
mv initrd.img.gz initrd.img

[edit] Proper ks Reference

ks=file:filename

If you ever want to make a "generic" boot cd, you can assign the IP address on the boot: line, then have just what you need to get the machine up in the embedded ks.cfg, and use wget, or nfs (not sure about ftp/tftp) to pull in the system specific config, then use %include to add it into the ks.cfg. This is the way we are moving to. I did build the 20 configs and wrapped them in the initrd.img – and that worked. I did have to edit all 20 configs for a cut/paste line-wrap error. But now the 20 systems are built.

[edit] hp ILO CDROM

Problems have been experienced reading the ks.cfg from a virtually mounted cdrom over the ILO port. System build environments may include client hardware of HP DL385 and RHEL3U5. The boot prompt option used was

Boot: ap10903-02-net ksdevice=eth0 ip=10.196.254.122 netmask=255.255.255.0 gateway=10.196.254.1  <nicdelay/linksleep=options> ks=cdrom:/ks.cfg
}

while the isolinux.cfg contains

label ap10903-02-net
kernel vmlinuz
append ks=http://10.1.181.252/cgi-bin/avamar/dl385_data_node initrd=initrd.img text

You need to understand that an ISO mounted via the Virtual Media appears to the OS as a USB Mass Storage device. /var/log/messages show

Kernel: usb 2-1: new full speed USB device using address 2
Kernel: Initializing USB Mass Storage driver
Kernel: scsi2: SCSI emulation for USB Mass Storage devices
kernel: Vendor: HP Model: Virtual CD-ROM
Kernel: Type: CD-ROM
Kernel: usbcore: registered new driver usb-storage
Kernel: USB Mass Storage support registered
Scsi.agent: cdrom at
/devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1:1.0/host2/target2:0:0/2:0:
0:0
Kernel: sr0: scsi2-mmc drive 12x/12x cd/rw tray
Kernel: Uniform CD-ROM driver revision: 3.20
Fstab-sync: added mount point /media/cdrom for /dev/scd0

Correct iLO system build issues by we embedding the kickstart file in the initrd and then use ks=file:filename.



[edit] Attribution

GregMorgan would like to thank the following people for their contribution of information for this page.

David Mackintosh Chip Shabazian William Ramthun Vivek Kalia Greg Caetano Jason Dixon, RHCE Dixon Group Consulting http://www.dixongroup.net