* HTML stands for Hyper Text Markup Language
* An HTML file is a text file containing small markup tags
* The markup tags tell the Web browser how to display the page
* An HTML file must have an htm or html file extension
* An HTML file can be created using a simple text editor
Here's an example :
<strong>Hello</strong>, this is a web page!
<font color=red>This is some red text</font>
* XML stands for EXtensible Markup Language
* XML is a markup language much like HTML
* XML was designed to describe data
* XML tags are not predefined. You must define your own tags
* XML uses a Document Type Definition (DTD) or an XML Schema to describe
* XML with a DTD or XML Schema is designed to be self-descriptive
* XML is a W3C Recommendation
If you use an RSS reader to access blogs or other newsfeeds, you use XML.
Really Simple Syndication produces XML-based feeds summarizing frequently
Here's a real example from nytimes.com:
<?xml version="1.0" encoding="UTF-8"?>
<title>NYT > Sunday Book Review</title>
<copyright>Copyright 2007 The New York Times Company</copyright>
<lastBuildDate>Fri, 24 Aug 2007 20:05:02 GMT</lastBuildDate>
<title>NYT > Sunday Book Review</title>
<title>On the Road Again</title>
<description>The novel that “On the Road” became was inarguably the book that young people needed in 1957, but the sparse and unassuming scroll is the living version for our time.</description>
<pubDate>Sun, 19 Aug 2007 02:56:43 GMT</pubDate>
Elements are surrounded by tags. Tags come in pairs. The open tag
identifies the beginning of the element and the closing tag, denoted by the /
before the tag name, identifies the end of the element. Tag names are
genreally fairly intuitive descriptions of the data that will be contained in
the element. For example, as you might expect, the text between the author
tags is the name of an author. In some cases, you may find empty elements
that look as follows: <description/>.
Elements may contain text, other elements, and attributes (discussed
below). In the example above, the rss element contains one element, channel.
The channel element contains elements title, link, description, language,
copyright, lastBuildData, image, and item.
Elements form a tree structure or hierarchy. We'll talk about trees toward
the end of the semester, but following is some relevant tree terminology:
- root - The root of a tree is outtermost element, in
this case rss.
- child - The children of an element are the elements it
contains. The element channel is a child of rss. The element copyright is
a child of channel. The element author is a child of item.
- sibling - The siblings of an element are the elements
that share its parent. The element image is a sibling of item.
Attributes are name, value pairs that provide some information about the
characteristics of an element. In the example above, the element rss has an
attribute version. The version attribute has a value of 2.0. An element may
have multiple attributes.
There are two models for parsing XML: DOM and SAX.
DOM - Document Object Model
A DOM parser reads an XML document, for example from a file, and builds a
tree in memory. The programmer can then access and manipulate the information
stored in the document by traversing the tree structure. Essentially, the job
of the parser is to identify where elements start and end, and build objects
to represent each element.
SAX - Simple API for XML
A SAX parser reads an XML document and generates events when elements are
found. The user defines the actions be taken as different types of elements
If you take a look at the NPR
Story of the Day, you'll notice that the XML looks a bit different.
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet title="XSL_formatting" type="text/xsl" href="/include/xsl/podcast.xsl"?>
<rss version="2.0" xmlns:npr="http://www.npr.org/rss/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:content="http://purl.org/rss/1.0/modules/content/">
<title>NPR: Story of the Day</title>
<description>Funny, moving, exceptional, or just offbeat -- the NPR story people will be talking about tomorrow. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</description>
<copyright>Copyright 2007 NPR - For Personal Use Only</copyright>
<generator>NPR/RSS Generator 2.0</generator>
<lastBuildDate>Thu, 30 Aug 2007 01:06:17 EDT</lastBuildDate>
<itunes:summary>Funny, moving, exceptional, or just offbeat -- the NPR story people will be talking about tomorrow. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</itunes:summary>
<itunes:subtitle>Editors' Pick. The best of Morning Edition, All Things Considered and other award-winning NPR programs.</itunes:subtitle>
<itunes:author>National Public Radio</itunes:author>
<itunes:keywords>story,of,the,day,NPR,National Public Radio,Story of the Day,Morning Edition,All Things Considered,Fresh Air</itunes:keywords>
<title>Story of the Day</title>
<itunes:category text="Society & Culture"/>
<title>New Orleans Suffers Crisis in Mental Health Care</title>
<description>Two years after Hurricane Katrina, many New Orleans residents need mental health care, but there are few resources and almost no psychiatric beds. With nowhere to turn, people in the city have been forced to take drastic steps.</description>
<pubDate>Thu, 30 Aug 2007 01:06:08 EDT</pubDate>
<itunes:summary>Two years after Hurricane Katrina, many New Orleans residents need mental health care, but there are few resources and almost no psychiatric beds. With nowhere to turn, people in the city have been forced to take drastic steps.</itunes:summary>
<itunes:keywords>NPR,National Public Radio,New Orleans Suffers Crisis in Mental Health Care,</itunes:keywords>
<enclosure url="http://podcastdownload.npr.org/anon.npr-podcasts/podcast/1090/14042689/npr_14042689.mp3" length="6456767" type="audio/mpeg"/>
Among other things, you see a set of tags that have the prefix
itunes. As you might imagine, the elements with tags beginning with
itunes provide information that can be used by the iTunes program
when it processes the feed. A standard RSS reader can process this same feed,
but may ignore any elements with tags in the itunes namespace.
The web page: http://www.feedforall.com/directory-namespace.htm
lists some other common namespaces. Notice that the same tag suffix may
appear in multiple namespaces. For example, two name spaces may support a
summary tag. However, using the namespace prefix enables the
developer to distinguish between say itunes:summary and summary in another
XML and Java
XMLTester.java - a very simple
Java provides both DOM and SAX parsers in the javax.xml.parsers package.
The DOM parser produces a Document object, where Document is in the
org.w3c.dom package. The Document represents the entire XML tree, which is
comprised of Node objects. The Node class provides an API to traverse the
tree. Node has several subclasses, the most notable of which are Text and
Element. All components in the tree are Nodes, but some are Elements and some
are Text, and there are a few other subclasses as well. Below are a few of
the most relevant APIs. For a full listing, see the Java API.
DocumentBuilderFactory - Defines a factory API that
enables applications to obtain a parser that produces DOM object trees from
- DocumentBuilderFactory newInstance() - Obtain a new
instance of a DocumentBuilderFactory.
- DocumentBuilder newDocumentBuilder() - Creates a new
instance of a DocumentBuilder using the currently configured
DocumentBuilder - Defines the API to obtain DOM Document
instances from an XML document. Using this class, an application programmer
can obtain a Document from XML.
- Document parse(File f) - Parse the content of the
given file as an XMLdocument and return a new DOM Document object.
- abstract Document parse(InputSource is) - Parse the
content of the given input source as an XMLdocument and return a new DOM
- Document parse(InputStream is) - Parse the content of
the given InputStream as an XML document and return anew DOM Document
- Document parse(InputStream is, String systemId) -
Parse the content of the given InputStream as an XML document and return
anew DOM Document object.
- Document parse(String uri) - Parse the content of
thegiven URI as an XML document and return a new DOM Document object.
- NodeList getChildNodes() - A NodeList that contains
all children of this node.
- Node getFirstChild() - The first child of this node.
- Node getLastChild() - The last child of this node.
- Node getNextSibling() - The node immediately following
- String getNodeName() - The name of this node,
depending on its type; see the table above.
- String getNodeValue() - The value of this node,
depending on its type; see the table above.
- NodeList getElementsByTagName(String tagname) -
Returns a NodeList of all the Elements in document order with a given tag
name and are contained in the document.
- String getAttribute(String name) - Retrieves an
attribute value by name.
- String getTagName() - The name of the element.
- int getLength() - The number of nodes in the list.
- Node item(int index) - Returns the indexth item in the