Microsoft Networking Problems at Carnegie Mellon

Erikas Aras Napjus
Network Architect
erikas@cmu.edu


Introduction

Since the introduction of Windows NT to the Carnegie Mellon campus in early 1994, we have been noting a number of serious and not-so-serious network problems related to Windows NT and Windows 95. Many of these problems are related to Microsoft's NetBIOS-based peer-to-peer networking, but others are simply related to the network stacks.

We would love to work with Microsoft or, if necessary, work on our own to resolve these problems. For now, however, these difficulties have severely restricted our ability to support these machines in our heterogeneous computing environment.

Network Neighborhood / Browsing Problems

This area is our biggest problem area with peer-to-peer networking. On network segments, often the smaller routed network segments in the residence halls, users will suddenly (and randomly) stop seeing most or all machines in their Network Neighborhood window. This appears to happen when the master browers for a network segment and protocol aren't properly configured to talk to the WINS servers for additional browsing information (e.g. Samba servers without WINS support, Windows for Workgroups servers under NetBeui, etc.)

The proposed solution to this problem is to place three machines with sufficiently high browse priorities on every segment to ensure that no improperly configured nodes take control of browsing. This is very costly. An alternative would be configuring all machines properly and administratively locking out users from making changes, but that isn't practical in our environment.

It looks like Windows NT 4.0 Server's browse server might be unstable. Recently, we've seen a few instances where the Network Neighborhood becomes effectively segmented. Machines attached to the same segment as our WINS servers (who are "super" master browsers for the campus) see themselves and machines on remote networks see themselves, but the connection between the two disappears. Name resolution continues to work, however, indicating that the WINS server processes are functioning properly.

WINS Server Name Conflicts

The Microsoft WINS server doesn't appear to handle name conflicts properly. If we have a long-standing server with a WINS name "lease" and a new machine comes along and registers the same name, the new machine "wins" (no pun intended) and takes control of the name, effectively taking the server off the network unless it changes names. This has even happened with the WINS servers themselves and effectively prevents individuals from being able to run stable services.

Broadcast Packets

All three NetBIOS transports for peer-to-peer networking are fairly broadcast (or multicast) intensive while users are browsing the network or resolving names. We've restricted users to only NBT (NetBIOS over TCP/IP), so we've only been looking at that transport in detail. Although Microsoft documentation indicates that a properly configured Windows 95/NT machine that talks to a WINS server will never broadcast on the local wire, we still see fairly frequent broadcasts. NT Server appears to send out significantly more broadcast than NT Workstation or Windows 95.

Furthermore, we seem to be seeing an incredible number of browser elections on our networks. These are probably caused by misconfigured machines trying to take control of browsing (or losing control of browsing). These elections involve large numbers of broadcasts from many machines on our network. In the past, browser elections have caused user machines to stop responding due to the number of IP broadcasts on the wire.

In a bridged environment like ours, broadcast control is particularly critical. Adding an extra 1,000 nodes running peer-to-peer to the campus network would quite possibly break our current architecture. (For example, when a bug in Cisco's routing code caused broadcasts from the residence halls to be forwarded onto the campus network, we ran into severe broadcast-related performance problems for some users; that's a good example of what might happen if another two or three hundred machines appeared on the campus network...)

Workgroup/Domain Scaling Problems

During both browser elections and regular operation of our network, master browers send out broadcast update messages, one for each workgroup or domain registered in the network neighborhood. With individual users defining their own workgroup or domain, the number of broadcast packets started severely impacting performance. Therefore, we've attempted to reduce the number of workgroups and domains attached to the network and tried to standardize on one primary domain for all machines.

Furthermore, since browsing appears to be for each segment, protocol, and workgroup/domain, you need three stable master browsers for each domain or the domain may not be stable as random machines are selected as master browsers. Therefore, anyone who wants a stable domain needs a minimum of three servers. We'd much rather see support for providing master browsing services for multiple domains from the same machine so we can centrally "seed" the network with workgroup and domain information (preferably without a broadcast per workgroup or domain as described above).

IPX SAP Advertisement

Windows NT Workstation and Server boxes appear to advertise themselves via IPX/SAP, rapidly populating the SAP tables on our IPX routers. Although this should be configurable, we haven't been able to find a reliable way of preventing NT machines from advertising themselves. SAP table explosion on routers isn't a good thing, particularly when these machines really don't need to be advertising themselves in the first place.

Windows NT Server machines that are configured to serve information via IPX (I believe -- this might be more general) have been known to attract all IPX clients on a given network segment to themselves, effectively causing IPX to break for all clients on the network. This also causes machines with automatic IPX stacks, such as Windows 95, to take a few minutes to boot while the connection to the IPX server times out. We haven't been able to configure away this problem.

RAS Server DHCP Proxy

The RAS Server functionality built into Windows NT appears to use a non-standard form of DHCP proxy to request addresses for remote clients. These malformed DHCP packets not only contribute to additional broadcast load on the network (since they don't timeout and retry fairly aggressively), but also can confuse DHCP servers who find malformed DHCP packets.

Conclusions

We have had some indications from Microsoft that a number of these difficulties are going to be resolved in future versions of Windows NT. Until those difficulties are resolved or significant development resources are thrown at these problems, it will be difficult if not impossible to support all aspects of Windows NT networking in our environment.


Home | Webmaster | Copyright | Carnegie Mellon Home