http://www.intelligententerprise.com/010327/feat2_1.jhtml

Untangling the Web

SOAP uses XML as a simple and elegant solution that automates B2B transactions

By Greg Barish

Most of today's Web applications are built for human consumption. Because real people interact with these applications, information must be presented in a visually appealing way. Users fill out HTML forms and then receive static or dynamic HTML output in response. For example, metacatalogs automatically query hundreds of existing online catalogs from a single user interface where users have made queries. In recent years, more and more such software agents - not people - are interacting with these Web applications. The long-term view of Web-based B2B is based on such automation. In fact, it is likely that the network transmission of such automation will eventually dwarf the traffic generated from human-based interactivity.

THE ORIGINS OF SOAP
From IBM rejection to W3C recognition

SOAP was first proposed by Microsoft as a means for heterogeneous software objects to communicate over a network. The protocol's Microsoft origins may seem surprising considering that it is not tied directly to any Microsoft technology - rather, it is a proposal for an open standard. However, the truth is that the original 1998 proposal (which involved Microsoft, UserLand, and DevelopMentor Inc.) did emphasize an approach that favored what has become BizTalk - Microsoft's SOAP strategy. It was only after the input of IBM, which initially rejected it, that the proposal began to distance itself from its original Microsoft bent, evolving into something more open. Sun also initially rejected the proposal and only recently (June 2000) changed its tune, whispering support for the version that the W3C acknowledged in May of 2000. Several other B2B companies (Ariba, CommerceOne Corp, and Lotus among them) also supported the proposal submitted to the W3C.

While a nice visual interface is an asset when it comes to enabling humans to interact with machines, it is an unnecessary obstacle when machines communicate with each other. What B2B really needs is an easy way to integrate the back-end systems of participating organizations. And we're not just talking about a solution that involves each business maintaining multiple interfaces to that data. That's the way things work today and, to a large extent, visual interfaces have often proved to be unwieldy solutions. IT managers want a way to consolidate their data and functionality in one system that can be accessed over the Web by real people or automatically by software agents.

The Simple Object Access Protocol, better known as SOAP, is aimed squarely at this data consolidation problem. Recently approved by the World Wide Web Consortium (W3C), SOAP uses XML and HTTP to define a component interoperability standard on the Web. SOAP enables Web applications to communicate with each other in a flexible, descriptive manner while enjoying the built-in network optimization and security of an HTTP-based messaging protocol. SOAP's foundations come from attempts to establish an XML-based form of RPC as well as Microsoft's own efforts to push its DCOM technology beyond Windows.

SOAP increases the utility of Web applications by defining a standard for how information should be requested by remote components and how it should be described upon delivery. The key to achieving both of these goals is the use of XML to provide names to not only the functions and parameters being requested, but to the data being returned.

Why SOAP?

As it exists today, Web-based distributed computing is not widely practical. IT managers have just two ways to go about enabling components to talk to each other over the Internet. One method is to use what HTTP provides, which means marshalling input and output as part of a POST or GET request/reply scenario. The other way is to use existing component technologies (integrating as necessary) between servers. In the latter scenario, objects communicate using a binary protocol over TCP/IP, but not as HTTP.

Let's take the HTTP-based solution first. Under this approach, components invoke functionality on other remote components by issuing POST or GET requests and processing associated HTML replies. However, this process is not general; it is inherently inflexible and, at times, can be just plain ugly. To understand why, let's consider an example.

Mix and Match

Suppose your company is trying to match sellers to buyers. You have established partnerships with several seller Web sites, each one different and each one providing access to its catalog via the Web. Now, suppose your company wants to integrate essentially all of these Web sites into one virtual catalog, so that when users query for some product, your system can match the query against those sellers that have the requested product. The problem is that the seller catalogs are huge, highly dynamic, and the sellers vary widely on how they store their data. Thus, downloading catalogs in their native format on a periodic basis is not always practical because (a) it is not always possible, (b) it may mean significant integration costs, and (c) it usually forces the need for very large, redundant databases. Since each seller distributes its catalog via the Web already, it would be far less costly if the B2B company could simply extract that data from those pages - possibly even extract it on the fly (per query).

However, there is no simple solution to this problem of extraction. For example, extraction implies that your company either develop technology that allows for the data to be extracted (or "scraped") from the seller's Web pages or that the seller provide an alternative, easy-to-parse interface to the data. Obviously, the root of the problem here is that the existing Web site data is prepared for human - not machine - consumption. Although useful data exists on Web pages, it is embedded between another type of data (HTML tags) that is used purely to facilitate browsers and provide a visual representation. However, inflexibility is another problem with querying via HTTP. If the client or server wants to communicate more complex data types (such as a list of catalog items, each of which has a list of colors and/or features), some ad hoc method for encoding those data structures must be developed.

Intricate Integration

An alternative to all of these issues is to use an existing distributed object technology, such as DCOM or the CORBA-equivalent, IIOP. Although this technology solves some problems, it ends up creating others. Foremost among these difficulties is the need to rely on special network ports for communication. Because IIOP and DCOM traffic is not encoded as HTTP requests and is not communicated between Web servers, specific unused network ports must be dedicated on both the client and server side. Thus integration is nonstandard and more complex than in the Web case. Also, IIOP/DCOM traffic wastes many important built-in benefits of HTTP-based communication, such as persistent connections and some of the network-level Web caching or content distribution optimizations built around HTTP-based communication.

Still more troublesome are the problems caused by firewalls. Organizations that have firewalls are often limited to exchanging HTTP traffic only. Using a proprietary port introduces a security risk, not to mention causing a bureaucratic and integration headache. In some cases, IT managers may simply not be able to bridge component systems together because network administrators simply refuse to allow the security risk (and with good reason). As if all of these issues were not enough, there can still be headaches caused by the integration of back-end systems that speak disparate languages (for example, integrating a CORBA and a DCOM system)

The SOAP Solution

SOAP simply and elegantly solves the major problems with both the HTML-based and DCOM/CORBA approaches by using XML over existing HTTP technology. Use of XML yields three important benefits:

  • XML makes the data self-describing and easy to parse.
  • Because XML and XSL separate data from presentation, useful data is distinguished from the rendering metadata. Thus, pages used as data sources for software agents can be reused for human consumption, eliminating the need for redundant data views.
  • XML enables complicated data structures (such as lists or lists of lists) to be easily encoded using flexible serialization rules.

Using XML for encoding data also represents an alternative to ANSI-based Electronic Data Interchange (EDI). While EDI has been successfully used for years, it does have its problems. For example, it is cryptic and difficult to debug. Also, it is more expensive and requires the server and client to have special software installed to handle the format. What's more, EDI over HTTP is problematic: It doesn't completely support important HTTP encryption and authentication standards, and thus secure transactions are limited or simply not possible.

In contrast, SOAP keeps things simple. It's extensible, the data is self-describing, simple to debug, and it can enjoy the benefits of HTTP-based security methods. While a SOAP message requires more bandwidth than an EDI message, bandwidth has become less of a concern as the Internet itself becomes faster - particularly between businesses that can afford high-speed network access.

Finally, you can deploy SOAP over a number of protocols, including HTTP. This capability is important because it allows the firewall issues to be avoided and retains the optimizations that have been built into HTTP.

How SOAP Came to Be

Although it is not explicitly referenced as an inspiration, there is no doubt that previous attempts at Intranet component standardization - most notably COM and CORBA - have greatly suggested the need for an Internet-based standard as well. The battles between COM and CORBA for the foundation of back-end Web applications are well known. While both have their merits in terms of being scalable component-based solutions to application development, they both have their baggage as well. COM has been seen as the typical Microsoft-only, take-no-prisoners approach to component development, while CORBA has looked good on paper but has suffered from the classic problems with overstandardization (committees for everything) and an inability to command a leadership role in the component turf war.

The battle took a major turn when Remote Method Invocation (RMI) and Enterprise Java Beans (EJBs) were let loose. All of a sudden, COM and CORBA got pushed to the back burner. Java already solved the platform incongruency issues; RMI and EJBs defined an RPC and component infrastructure over that virtual operating system. While engineers applauded these more recent technologies that made scalable components easy to develop, they suffered from being an all-Java solution. In particular, RMI is a binary protocol that does not obey an open standard. Also, as Sun Microsystems itself cannot decide what to do as far as independent control of Java standardization, EJBs and RMI-based applications have been doomed to meet resistance as the type of open protocol that could be proposed for applications across the Internet.

An Open Protocol

SOAP avoids trying to declare a winner in this war for component design behind the firewall. Instead, it advocates a simple, open solution for application communication between firewalls. It also updates the priorities of components. Developers rarely laud the platform-independent features of CORBA these days. It was only a few years ago that development managers talked about how they linked together old AS/400 COBOL components with C++ components. This result is not because back-end legacy integration is no longer an issue: It still is and technologies like CORBA work fine for that need.

But the focus has shifted to integration among Internet systems. This shift does not mean that COM and CORBA camps did not prepare for that eventuality. DCOM and IIOP were attempts to address that issue. But, at the same time, IT managers have realized that the Internet itself has enforced a de facto form of platform independence. Now, all you need is a standard on top of that platform - and here is exactly where SOAP steps in.

In all of this raving about SOAP, I should be clear that SOAP obviously does not address component technology itself. You still need CORBA or EJBs or COM objects on the server side to do the actual processing. Instead, SOAP merely defines an interoperability standard between them.

How SOAP Works

So, now that we know what SOAP is and why it is compelling, how does it all work?

Simply put, SOAP consists of communication in terms of messages that are sent from client to server. While SOAP can be used with any message-based networking protocol, the one that obviously is of interest to most people is HTTP. So, our discussion will assume we are using SOAP over this protocol.

As shown in Figure 1, clients post SOAP request messages to servers. These messages contain information about the remote method being invoked, and any input data to that method (in serialized format). Servers reply with SOAP messages that contain the output values framed in a method response. Figure 1 shows a client that is requesting prices for an old arcade game and a server that responds.

SOAP messages are essentially XML documents that each contain a SOAP envelope. The envelope consists of an optional header and a required body. A SOAP header typically contains metadata about the exchange (for example, a transaction identifier). The header is also a means for extending the protocol, but in a decentralized manner. The body focuses on the data itself, namely:

  • The remote method name (or response name)
  • The request (or reply) parameters
  • The serialized data.

Figure 2 shows the specifics of the communication sketched out in Figure 1. Notice that the communication is phrased in XML and that the data is self-describing and serialized. While the example shows the simple marshalling of strings and floating point types between client and server, SOAP does support more complex (compound) data types, such as arrays.

SOAP Details

Figure 2 leaves out a few details peripheral to the general process, but worth mentioning. For example, the fact that SOAP contains support for XML namespaces (including a default namespace) that can be used to fully qualify data names is not shown. Also, SOAP messages can declare that specific encoding rules be used. (A default rule can be found at schemas.xmlsoap.org/soap/encoding.)

While SOAP messages consist of XML- compliant encoding, they can be also be communicated via alternative transport mechanisms, such as RPC. Communication via RPC points back to the history of SOAP in its XML-RPC form. XML- based RPC cuts to the chase: It says, "Let's forget all this stuff about Web servers and Web clients, we just want distributed objects to be interoperable between disparate systems." SOAP over HTTP, in contrast, is a more general form of object-to-object (or agent-to-agent) communication over the Internet. It assumes what is minimally necessary: that objects are accessible via HTTP and that the data they return is self-describing.

What Lies Ahead

Right now, SOAP is in a very early stage. It has only recently been acknowledged by the W3C, and it remains to be seen how many of the major Web application players will endorse it and work toward interoperability. Microsoft and IBM, co-designers of the protocol, are obviously behind it and have a vested stake in its success. This alliance and the W3C blessing should give SOAP considerable firepower. CORBA vendor Iona Technologies Inc. and B2B players Ariba Inc. and Commerce One Inc. are also supporting the protocol. Other supporters include Hewlett-Packard, Compaq Computer Corp., Lotus Development Corp., and SAP, as well as UserLand Software Inc. (one of the major forces behind SOAP's development).

Still, noticeably quiet on the issue are Sun Microsystems and Oracle. Although Sun recently changed its tune regarding its support for SOAP, it has not been terribly enthusiastic. Oracle, on the other hand, probably does not like the smell of Microsoft being involved. Still, the support of these companies - while useful - is not critical. If enough B2B and CORBA vendors support SOAP, CORBA components plugged into Oracle databases running on Solaris boxes will still be able to access COM objects running on an NT box that accesses SQL Server.

Obviously, SOAP has immediate implications for B2B-style applications. It bodes especially well for Intranet applications and distributed corporate portals where automated data exchange is necessary and cost-effective. Companies will be able to write applications that rely on data stored at various company locations, all using the existing messaging infrastructure that connects them today. SOAP will also obviously ease the integration challenges of merging companies. Today, many technology companies that merge must integrate their back- end systems, each built on different component technologies. With SOAP, you have the opportunity to keep these systems running while providing a simple means for integration.

SOAP also suggests an emerging trend: practical deployments of sophisticated software agents and a global network of components for wide-area, distributed computing. Information agents will not only be empowered to issue sophisticated queries to each other using a simple, yet flexible protocol - they will be more robust. Furthermore, component discovery seems to be a logical next step and has already been suggested by the SOAP specification. In the far future, automatically locating functionality on the Web could approach the simplicity of using a kind of Internet search engine that many people manually use today.

SOAPy Waters

It all seems great, right? At least for large-scale, heterogeneous, distributed computing, SOAP is a good step forward. But data integration problems still loom from the days when federated systems were first proposed. Questions remain such as, "How do we give real meaning to the data?" Sure, we can label the data. But, short of creating a global ontology, or way to represent data, how can we integrate the meaning of the data? Perhaps the answer to that will be riding in on the next wave of Web technology.

RESOURCES

Apache SOAP: xml.apache.org/soap/index.html
DevelopMentor Inc.: www.develop.com/soap/
Microsoft: www.microsoft.com/biztalk/default.htm
UserLand Software Inc.: www.userland.com
WWW Consortium (W3C) Spec: www.w3.org/TR/SOAP

Related articles on intgelligentEnterprise.com

"IT and the New Economy," Januaary 30, 2001:
www.intelligententerprise.com/010130/feat1.jhtml
"Pillar of the Community." August 18, 2000:
www.intelligententerprise.com/000818/feat.jhtml



Greg Barish(greg@ultralogic.com) is a consultant specializing in building scalable distributed information systems. He has previously held engineering positions at Healtheon/WebMD Corp. and Oracle.

Return to Article