Introduction
Stream processing saves programs from having to load data
entirely into memory.
Instead, a program gets a hold on a stream instead of the
actual data. The program then starts asking the stream to send a chunk of the
data. The program does the required processing on this chunk, and then asks for
another chunk…this goes on until the entire stream is read and processed.
This has the very important advantage of not having to
load the entire message at once into the program’s memory space. By reading
data in chunk, only the size of the chunk that is being processes is loaded
into memory.
In systems under load and where scalability is a concern,
this can have a huge influence on the success these systems.
This post, explains how stream processing is
implemented in BizTalk Server, and how to take advantage of streaming to build
better BizTalk applications.
Stream Processing Support in BizTalk
BizTalk implements a forward-only streaming design.
Forward-only means that the stream can only be read, and cannot be seeked back.
Stream processing is used across BizTalk:
--Out of the box pipelines are stream-based; they utilize
streaming when fetching data from the adapter and as data passes from one stage
to another. For example, out-of-the box components such as the Flat-File
Disassembler and the XML Disassembler, are built with streaming in mind.
When writing custom pipeline components, although
harder to implement, you should use the streaming approach instead of loading
the entire data into memory; especially for systems that will be under heavy
load. This module takes an in-depth look at pipeline stream programming.
--Orchestrations also support streaming. When calling
.NET components from within an OX, streaming can also be used to improve Ox
performance. Since this discussion requires the understanding of Ox engine, I
will discuss streaming support in orchestrations in the business process
module.
--Mapping also supports streaming. By default, XSLT
typically requires loading messages into memory to perform transformation;
which obviously contradicts with BizTalk streaming model. As such, BizTalk
implements virtual streaming for mappers. If a message size is over 1MB,
BizTalk virtual streaming writes the data into a temporary file, and this file
is now used as the stream source instead of memory. This introduces I/O
overhead, but still much better than any potential memory issues. This might
not be very clear yet, but I will discuss Virtual Streaming later.
--BAM introduces significant overhead over BizTalk
because its constantly saving data into the BAM database. To reduce this
overhead, BAM contains a set of event streams.
Streaming in Pipelines
When used in pipeline components, stream processing
solves two major issues:
--First, as I said before, it reduces memory consumption
by not loading the entire data into memory
--A second advantage, is that it speeds up processing, because the next component does not have to wait for the current component to
finish processing the entire message.
Instead, the current component passes to the next
component the portion of the stream it has finished processing, and starts
working on the next portion.
The No-Streaming Approach
To best appreciate the value of streaming, lets assume
that you have a built a custom pipeline, and in its various stages, you have
created custom pipeline components without adopting the streaming approach.
--An adapter which is listening for data on the wire,
creates a new BizTalk message of type IBaseMessage and if there are any message
parts, also creates parts of type IBasePart. It then attaches the data stream
to this message and promotes context properties related to the receive
endpoint. The adapter then submits the message to the pipeline.
Now before going on, make sure you understood this. Note,
that the adapter has not actually read the data from the wire. It just created
a message type that BizTalk can understand, and it only attached the data
stream to that message, but no data has been read yet.
--Now the first component of the first stage in the
pipeline gets the message. At this moment, the component reads the entire data
stream attached to the memory at once. This effectively loads the entire
message into memory. The component will do whatever processing required, then
it creates another stream componsed of the new data and passes it along to the
other component in the pipeline chain.
--Now this other component, again reads the entire data
stream at once, causing the entire to message to be loaded into memory again,
does the required processing, creates a new stream, and passes it along to the
next component
this process continues until all components of the
pipeline are executed and the final message is submitted by the Message Agent
to the MessageBox.
Now this should give you an idea about how bad such
design will perform, especially for big size messages and systems with high
load. This alone, can cause your system to break.
Actually, this is the most used approach by pipeline
developers since it’s the easiest.
What is also bad, is that although out-of-the box
pipeline components such as the XMLDisassembler or XmlValidator implement
streaming; if you create a custom pipeline and mix these components with your
own custom components that are not utilizing streaming; then this breaks up
the streaming chain, and again loads the entire message into memory.
The Streaming Approach
Now lets see how streaming would change how data is
processed, and how this affects the overall performance.
--The adapter, attaches the data stream to the
IBaseMessage, and passes it to the pipeline.
--The first component receives the message. Now, again
remember that the message has the data stream attached to it, but the data has
not actually been read yet. This component then wraps the message with a
custom stream; without reading any part of it. Now the message with the new
stream attached, is passed on to the next component.
--The next component does the same: it wraps yet another
new stream around the one passed to it, and again passes on the message to the
next stage.
Now the message has a chain of two streams attached to
it…and still no data has been read yet.
The message goes through the same process, while going
through all pipeline components.
--Finally, it’s the EPM that calls the stream’s Read()
method. At this moment, the call gets propagated through all those custom
streams that have been wrapped around the message by the pipeline component,
all the way to the main data stream of the adapter.
--The adapter then performs the first actual data read
and passes it along the stream to the first component.
--Each pipeline component then executes its own
implementation of the Read() method on that specific stream chunk it receives.
--After passing through all components, this stream chunk
is then handed to the EPM which hands it over to the Message Agent.
This continues until the stream is read entirely and
the message is stored in the MessageBox.
Once the message is stored entirely in the MessageBox,
the EPM notifies the adapter so that it performs resources cleanup; for
example, a File adapter deletes the file and a
SQL Adapter disposes the connection.
On the send side, the Message Agent hands the adapter, a
reference to the message. The adapter then starts pulling the stream from the
messagebox and through the send pipeline. Data is then sent to the wire is a
streaming fashion.
So why do you care about understanding streams? Well
beyond the fact that’s cool to understand things, this affects how you design
your pipeline components.
Streaming Sample
To test the performance difference when using the streaming vs. non-streaming approach, I created
a pipeline component which uses the memory approach (non-streaming). nothing special about its Execute method, it just acts as a passthrough component:
Next I created, a custom pipeline component that uses streaming.
First. I created a custom stream class which derives from
System.IO.Stream:
The reason for this, is that I need to provide my own
implementation of the Read method.
Recall from when I explained stream processing, that the
Read method gets called by the EPM to trigger the start of stream processing.
Its within this Read method that you provide the logic
that you want to perform; however, recall that here you are processing just a
chunk of the stream and not the entire message.
Now again, for this demo, I am just passing the data
along with no processing…so here I am simply returning the same set of bytes
unchanged.
In this case, in the Execute method of the Icomponent
interface, and unlike the previous component, where the entire message is
loaded memory;
here I just wrap the incoming body stream with my custom
stream implementation and then replace the stream with the result of my custom
implementation:
Remember that at the first pass of pipeline components,
the execute method just creates a new wrapper stream.
When all components do so, finally the Read method, is
executed where each stream is processed and data is updated.
To run the sample, in my BizTalk application, I have a receive location
which is configured to use the pipeline using the in-memory approach.
In my file system, I have created a large message, which
is around 120 Meg.
Before dropping this file into the in folder, I examined how much memory is the Biztalk host instance process consuming; it was nearly 18K:
When I dropped the file, I examineD the memory
consumed while the host instance is pulling the file. I could see the memory increasing until nearly hitting 300 K…so going from 18k to 300k for consuming a
single file of 120 Megs:
I then changed the pipeline used in my receive
location to the one utilizing streaming.
After dropping the file again and checking the host
instance processes…I could see its rising to nearly 54 K…so there is a difference of around
250 K of memory between the first and second approach:
Now imagine, what this means for a project that is
receiving thousands of messages who in real world can be much larger than 120
Meg...
Out of the Box Streams
Besides writing your own stream implementations, BizTalk
provides out of the box stream implementations in the the
Microsoft.BizTalk.Streaming assembly:
--EventingReadStream is an abstract stream class, which
allows you to hook up event handlers to certain events, such as the first and
last stream chunk reads. Classes such as ForwardOnlyEventingReadStream extend
this class, as shown by this example.
--VirtualStream is probably the most used out-of-the box
stream because its simple to use. Here, data is still held in memory – as if
you are using the in-memory approach – however, this happens only until data
size exceeds a configured value; at which stage it is written into a disk file.
And from this moment, the disk file becomes the stream source.
You can think of VirtualStream as being somewhat
similar to the MemoryStream, however it uses a disk file for data storage
instead of memory.
So, the VirtualStream solves the problem of loading
large data into memory – in the expense of I/O overhead. Despite this I/O
overhead, still its much better than any memory problems.
If implementing custom streams – for whatever reason – is
not an option, or at least a very difficult option to implement, then this
stream is your safest bet.
--The ReadOnlySeekableStream provides seekable read-only
access to a stream.
This example, shows how to use VirtualStream and
ReadOnlySeekableStream together, to wrap the original stream:
This will result in the VirtualStream saving data into a
temporary file system. And the ReadOnlySeekableStream, providing seekable
readonly access to the stream
You can see all stream implementations inside the
Microsoft.BizTalk.Streaming assembly using the following URL:
Best Practices Accessing Stream Data
We already discussed how streams is a much better option
to use in pipeline components as opposed to using XmlDocument which loads the
entire message into memory.
So since we cannot use the known methods of XmlDocument
such SelectNodes and SelectSingleNode, what options do we have natively from
BizTalk Server that will help us navigate the message?
Microsoft.BizTalk.XPathReader assembly provides the
XPathReader and XPathCollection classes.
Below is an example, that shows how these classes can
be used to navigate the stream after it has been wrapped by the VirtualStream
and ReadOnlySeekableStream:
Of course you can use classes such as XPathReader and
XPathCollection, on any stream; including custom streams such as the one I have
showed you before.
Streaming Support in Orchestrations
The other place, where streaming plays an important role
and affects the way you design your code, is inside Oxs.
However, since this requires the knowledge of Ox XLANG
engine, streaming in Oxs will be discussed in a future post.
No comments:
Post a Comment