XTC Short Introduction

Introduction
Quick start
XTC in detail
Technical notes

Introduction

XTC (Xml Tree Compare) is a differential tool for XML files. The intention of the program is to have a 'change detection'-tool for two versions (an old and a new one for example) of a file.

The compare process is kept as generic as possible. The XML documents must be well-formed and that is the only presumption, so XTC can be used for any XML-related format such as SVG, XVL, etc... The result of the compare process is written into a file (result file). There is also a result visualization, showing the XML structure as trees including the marked changes.

XTC is useful for:

Producing a result file from the compare process that can be used in further processes (e.g. XSLT, see also http://xmldifftool.com/xtc_xslt_en.html).
Comparing XML files 'on the fly' and see the changes immediately in the visualization.
Using the server edition for long running compare processes with large files (the file size is limited only by the available memory on your machine).
XTC can be integrated in other processes by using the server edition like a batch process.

Quick Start

A) Gui version:
1. Start XTC by double clicking on the program icon or by selecting the menu entry in the 'programs'-menu of your windows installation. The program's main window appears. Select two XML files to be compared by using the buttons 'XML file 1' and 'XML file 2'. Once the files have been selected their paths are shown left to the buttons in the white line edit boxes.

2. Press the large button 'Diff'.
3. Depending on the sizes of the selected files and the hardware of your computer the comparison may take some seconds. XTC will turn the mouse pointer into an hourglass symbol to indicate that the process is running.
4. A message box informs you as soon as the comparison is finished.
5. After the compare process has finished the result can be viewed by pressing the leftmost button on the tool bar (showing the tree symbol) or by choosing the menu entry File->Visualization. A separate window opens showing the XML structure of XML file 1 on the left hand side and the structure of XML file 2 with changes on the right side. The visualization is interactive: If an element in one of the trees is selected, the corresponding element in the other tree and the selected one are highlighted (note that added or deleted elements don't have a corresponding element).

B) Server edition:
1. Open a shell (DOS box on Windows) and enter the directory where the XTC executable is stored.
2. Type 'xtc.exe xmlfile1 xmlfile2 -batch' where 'xmlfile1' is the path to the first of the to XML files, 'xmlfile2' is the path to the second XML file.
3. Depending on the sizes of the selected files and the hardware of your computer the comparison may take some seconds.
4. After the process has finished, a result file can be found in the directory where 'xmlfile2' is stored. If you can't find the result file, please check the configuration file (xtc_cfg.xml in the current directory or, alternatively in your home directory) for the <writeresultfile> entry. It must look like this:
<xtcparam_bool name="writeresultfile">1</xtcparam_bool>.
Or check the log file xtclog.txt if an error occurred during the comparison.
Result file example (fragment):

XTC in detail

Overview

Motivation

The comparison of XML documents can be motivated by different intentions, such as taking a quick view on what has changed since the last revision of a document. Or by tracking changes in a further step of your XML processing, for example if changes have to be marked in a rendered version of the XML file. Also quality assurance purposes may need the comparison of XML, for example if drawings are stored in an XML related format (such as SVG). A comparison of two versions can reveal all changes, even very small ones not even visible in a graphical editor.

Change Marks

Changes are marked in the result file by adding change marks. A change mark is a processing instruction, it describes the compare result for the following XML element. For each XML element in file B the compare process results in one of the following ten change marks:

complete match
position change
content change
attribute change
attribute and content change
position and content change
position and attribute change
position and content and attribute change
element added
element deleted

Four change marks (position change, position and content change, position and attribute change, position and content and attribute change) indicate that the element has moved. The distance of the move (relative to the parent element) is indicated by a number added to the change mark, it displays the position difference to the element's former position in file A. A number greater then zero indicates a 'move fore ward', that means the distance to the parent element is greater then before (e.g. because a new element has been added in between). A number smaller than zero indicates that the element now is nearer to its parent element (e.g. because an element has been deleted in between)
Example:

file A:

<parent>
<one/>
<two/>
<three/>
</parent>

file B:

<parent>
<one/>
<two/>
<hundred/>
<three/>
</parent>

The result will be:
<parent>
<one/>
<two/>
<?XTC element added?>
<hundred/>
<?XTC position change 1?>
<three/>
</parent>

(In this example 'complete match' change marks have been omitted for the sake of simplicity).
Note that the change marks (here <?XTC element added?>) do not count when calculating the distance of a move, because they are not 'real' content of the document. Other processing instructions (non XTC processing instructions) count as normal content.
The change marks are also used in the visualization.

Configuration

To apply to a large variety of use cases, XTC has been designed as a generic tool. Being well formed is the only presumption made to the files that are to be compared. As with any generic tool XTC has to be configured to serve the user's needs as good as possible. XTC's configuration options will be explained in the following text.

Note: the following text assumes that you use the gui version. If you have the command line version only, you must configure XTC by editing XTC's configuration file xtc_cfg.xml. The configuration file is located in the current directory or in your home directory. XTC looks into the current directory first and if no xtc_cfg.xml found into the user's home directory.

Start the XTC gui (see Quick start). Press the second button on the toolbar or choose the Edit->configuration menu. A dialog box opens showing a tabbed dialog consisting of four tabs ('General Settings', 'Anchor Elements', 'Text Diffing' and 'Change Marks').

The 'General Settings' tab:

The 'Diff mode' group configures the basic features of the comparison.

normalize spaces: Reduces multiple spaces to one and eliminates carriage returns and tabs inside texts (during the compare process only). This is useful since some editor programs add undemanded spaces and carriage returns.
shift mode: In situations where XML elements of the same parent sharing the same name have changed both their positions and their contents the change situation may be ambiguous. Due to this lack of information the compare process can't always decide if an element has changed or has been added as new (sometimes even a human processor would not be sure). If 'Shift mode' is toggled, the program tries to match the 'most similar' element (using a smallest distance algorithm).This works well in many cases, but it can sometimes lead to unwanted results. Without the shift mode such elements are marked as 'added' (and its possible counterparts as 'deleted'). See also section 'Anchor elements' for a more sophisticated approach to deal with the problem.
content mode: Using the content mode the algorithm tries to find the appropriate counterpart of an element by examinig its textual content (all text nodes in the subtree without the attributes). The counterpart will be the element with the most similar text. Example:

File A:
<para>some text</para>
<para>old story</para>
<para>a lot of new stuff</para>

File B:
<para>new story</para>
<para>a lot of newest stuff</para>
<para>some text</para>

In the example we have three elements with the same name. In file B the order of the elements has changed and some text changes have happened. The content mode algorithm now examines the texts and will find the right counterparts by text similarity:

<para>old story</para> ------------- <para>new story</para>
<para>a lot of new stuff</para> --- <para>a lot of newest stuff</para>
<para>some text</para> ------------ <para>some text</para>

Note: In rare cases the content mode may lead to unexspected results in some cases. If you get a strange looking result, try the comparison without the content mode.
Note: The content mode works also on large elements containing lots of subelements, so it can extend the program's run time.
write result file: Untoggle this if you don't need the result file and you are interested only in the visual representation.
Note: Writing large files to the file system can be time consuming. With large XML files the time for writing the result file to disk may take longer than the compare process itself. This is not a problem of XTC but of the underlying file system.
add deleted nodes to result file: If toggled, elements which are part of file A but not part of file B are added to the result file, marked as 'deleted'. Important note: This may lead to a result file that is no longer a valid XML document (the result file is always well formed but using this option can lead – depending on your DTD or schema - to a non valid result document)!

'Root elements' defines, how XTC will handle the XML root elements, since they are not part of the diff process (this is needed sometimes for the use of XTC in batch mode).

ignore root elements: Choosing this option enables the comparison of files with different root elements.
Only names of root elements must match: the XML documents must have root elements with identical names, but attributes may differ
root elements must match completely: names and attributes of the root elements must be the same, the only difference allowed is the attribute order.

'File handling' sets parameters for the writing of the result file

file name supplement: The result file is a copy of the second XML file ('file B') supplemented by the change marks and it is located in the same place. To prevent overwriting, the file name is extended by the additional string. Example: Assuming file name of file B (second of the XML files) is 'book2.xml' and the name addition is '_xtc', the result file's name will be 'book2_xtc.xml'.
Indent step: Indicates the number of spaces nested elements are indented by in the result file (applies only for 'human readable' save mode)
save mode: 'human readable' writes the XML tree by using a separated line for each element, indent nested elements etc... 'no whitespace' saves the the XML tree into one single line, no spaces,carriage returns or tabs between the elements. This can make further processing (e.g. by XSLT) easier since no dealing with whitespace is necessary.

The'Attribute' tab:

ignore attributes: if toggled, attributes are excluded from the comparison
ignore attribute order: if toggled, attributes are regarded as unchanged if a different order is the only difference; if untoggled, the attribute order is a matter of change detection
compare attribute contents: If toggled, an element's attributes are compared to the attributes of its cuonterpart element. Each attribute is examined wether it is unchanged (complete match) its content has changed (value change) newly added to the element (attribute added) or the attribut has been deleted (attribute deleted).
To mark the changes the attributes are altered by adding 'Attribute change marks' to the attribute name. In case of a content change (value change) a textual comparison is applied to the attribute's value (see also the tab 'Text diffing', 'min length of common substring' and the text insertion marks are used here too).

The 'Anchor elements' tab:

Anchor elements are used to 'navigate' through the XML tree during the comparison. They serve as a 'hint' to the algorithm and can force the program to find the right counterpart of an element. An anchor is an XML node or an attribute that does not change its content, thus it is identical in both XML documents. If so, the element can serve as an anchor to its parent.

Example:
File A:

<chapter>
<section>
<sectionmeta>
<sectionid>3</sectionid>
<sectionname>.....</sectionname>
</sectionmeta>
<para>....</para>
...
</section>

File B:

<section>
<sectionmeta>
<sectionid>4</sectionid>
<sectionname>.....</sectionname>
</sectionmeta>
<para>....</para>
...
</section>
<section>
...
</section>
</chapter>

Here the element <sectionid> can be defined as an anchor. If the order of 'section'-elements has changed in the second XML document and the element contents (apart from the 'sectionid'-element of course) too, the program can now find the suitable counterelement 'section' by searching for its subelement 'sectionid' and compare its contents. The matching of the anchor elements is a 'match of element content', here the text child node of <sectionid>. The compare functionality compares the texts of the whole subtree of the anchor element (but omits attributes).
In this example the anchor definition would be:

section sectionmeta/sectionid
(see picture)

The path to the anchor element is separated by / (slash).

Also attributes can be defined as anchors. An example would be:
<chapter>
<section>
<sectionmeta id=”3”>
<sectionname>.....</sectionname>
</sectionmeta>
<para>....</para>
...
</section>

<section>
<sectionmeta id=”4”>
<sectionname>.....</sectionname>
</sectionmeta>
<para>....</para>
...
</section>
<section>
...
</section>
</chapter>

Here the anchor definition would be: section sectionmeta/@id The '@' indicates that the anchor is an attribute.

To define a new anchor for an XML element, push the 'Add element' button.
A small dialog window will appear, enter the element's name and the path to the element's anchor:

Only one anchor should be defined for an XML element. If more than on anchor is defined, only the last one is used. The anchor functionality for attributes works even when 'ignore attributes' is activated. Anchor elements that are defined but not found in the XML structure will be ignored.

The 'absolute' checkbox:

As mentioned before an anchor serves as a way finding mechanism to the element's counterpart. If an appropriate counter element can't be found, two reasons can be distinguished:
1. The anchor's content didn't match any anchor content from the other side (anchor mismatch).
2. The element hasn't got the anchor that has been defined in the configuration.
When the 'absolute' checkbox is checked, elements with anchor mismatches are tagged as 'added' and 'deleted', while elements without the defined anchor remain untouched. If the checkbox is not activated no such distinction is made and all elements for which the anchor property is defined but no counterpart could be found remain untouched.

The 'normalize spaces' checkbox:

Since the anchor's value is an element's content (a textual representation of it), a whole subtree can be an anchor's content. If checked multiple spaces are reduced and carriage returns and tabs inside texts are eliminated. This is useful since some editor programs add undemanded spaces and carriage returns.

The 'Text diffing' tab

To text elements that have changed their contents a textual diff function can be applied. The 'enable text diff' checkbox switches the text diff on and off. The algorithm classifies the text into three categories:

unchanged (not marked)
inserted (marked with text insertion start / text insertion end)
deleted (marked with text deletion start / text deletion end)

The Settings:

case sensitive: if toggled, the text comparison is case sensitive (the general XML comparison with XTC which is case sensitive by default).
Min length of common substring: The core functionality of the text diff relies on the lcss (longest common substring) algorithm. When two texts are compared, there can be often found several common substrings of different lengths. This parameter defines of which length a common substring must be at minimum, to be regarded as a 'valid', useful substring to deal with. Examples:
a)
text1: “xml” text2: “xtc”

if the minimum lcss length is set to 1 (the smallest value allowed) the result looks like:

x[-]ml[/-][+]tc[/+]

“x” is the lcss and remained unchanged. “ml” has been deleted, “tc” has been added.
If the minimum lcss length is set to 2 the result looks like:

[-]xml[/-][+]xtc[/+]

No lcss with length equal or greater than 2 can be found now, so the lcss “x” is insufficient and discarded.

b)
text1: “saturday” text2: “sunday”

if the minimum lcss length is set to 1:

s[-]at[/-]u[-]r[/-][+]n[/+]day

if the minimum lcss length is set to 2 (or 3):

[-]satur[/-][+]sun[/+]day

if the minimum lcss length is set to 4:

[-]saturday[/-][+]sunday[/+]

For real (natural language) texts a minimum length for the LCSS of 3 or 4 is useful.
Finally the text marks for insertion and deletion can be entered. Default values are: [+] text insertion start [/+] text insertion end [-] text deletion start [/-] text deletion end

The 'Change Marks' tab:

This tab provides the possibility to enter your own texts for the change marks.
The checkboxes indicate if a change mark will appear in the result document. Every change mark can be switched on or off.

Technical notes

XTC is programmed in C++ using the QT library from Trolltech Nokia.

A special XML-API has been developed to assure flexibility and very good performance of the tool.

XTC runs under Windows 2000, XP, Vista, Windows 7
Versions for other operating systems will be delivered on demand.

http://xmldifftool.com

Questions:
info@xmldifftool.com

Copyright © 2009-2012 Martin Achtziger