- Website Parse Template
Infobox file format
name = Website Parse Template
icon =
extension = .icdl
mime =
type code =
uniform type =
magic =
owner = [http://www.omfica.org/ OMFICA]
genre = Website Parse Template
container for =ICDL Crawling
extended from =XML
extended to =
standard =
url = [http://www.omfica.org/npo_website_template.php WPT]Website Parse Template (WPT) is an
XML based open format which providesHTML structure description ofwebsite pages. WPT format allowsweb crawlers to generate Semantic Web’s RDFs forweb pages . WPT is compatible with existingSemantic Web concepts defined byW3C (RDF and OWL) and UNL specifications.WPT Syntax
Website Parse Template consists of following sections:
* "Ontology ", where publisher defines concepts and relations which are used in thewebsite .
* "Templates", where publisher provides templates for groups ofweb pages which are similar by their content category and structure. Publisher provides the HTML elements’ XPath or TagIDs and links withwebsite Ontology concepts.
* "URLs ", where publisher provides URL Patterns which collect the group ofweb pages linking them to "Parse Template". In theURLs section publisher can separate form URLs the part as a concept and link to websiteOntology .Website Parse Template begins with opening <"icdl"> tag and ends with closing "icdl"> tag. Single Website Parse Template is referred to the same host, while single host may have several Website Parse Templates describing its
HTML structure. It is required to specify the host for Website Parse Template at the beginning in <"icdl"> tag:WPT Ontology
Ontology section contains enumeration and definition of allconcepts used inwebsite . Listed concepts must be enclosed within <"ontology"> "ontology"> tags. It is required to specify the ontology name (any rational string) and indicate supported language ("icdl:ontology", "owl" or "") which is used to specify the concepts.Example 1. Concepts used in
Yahoo! Music for "artist" objectEach concept’s definition should start with <"concept"> tag and ends with "concept"> tag. <"inherit"> tag shows inheritance relations and <"has"> tag shows attributable relations between two
concepts . Either of defined concepts has default attribute -object identifier (id) to be used byweb crawlers to co-ordinate the same object's attributes used in different pages of the samewebsite .Website Parse Template foresees several predefined concepts that are general for all kind of
websites :“"Menu"” -
navigation bar /menu
“"Logo"” -design element/logo
“"Content"” - element that contains main textual content of the page
“"Advertisement"” –advertisement /banner
“"External Link"” – element that contains external linksWPT Templates
Templates section contains number of templates for groups of similarly structured web pages. Either of those templates refers to a single group of similarly structured
web pages . HTML elements’ XPath references or TagIDs are used for linking structured content with definedconcepts . The template description starts with opening <"template"> tags and ends with closing "template"> tag. In <"template"> tag it is required to specify template name and language used for templates description. As a template name can be chosen any string, but for the language it is necessary to indicate supported language type, e.g. "icdl:template", "rdf" or "".Example 2. Simple template for single artist page on
Yahoo! Music The
web page may contain structured repeatable content () included in one main HTML element () that are specified as follows: Example 3. Repeatable content representation
In case of specified complex
HTML element is already described by another template thetag can be used to point to that template block. It makes possible to create hierarchic relations between WPT templates so that web crawlers can use specified reference(s) to identify the same object in different pages of a givenwebsite .Example 4. Hierarchic relations between WPT Templates
URLs section
This section defines the
URLs /URL patterns that are corresponding to groups of similarly structured web pages described in Templates section. In accordance with Templates section URLs section also may consist of several blocks and either of those blocks should start with <"urls"> tag and ends with "urls"> tag.Example 5. URLs/URL patterns
As a
URLs block name can be chosen any string, but for the template it is necessary to indicate certain template name described in previous section. The URL pattern provided in "Example 5" also include the represented realURL . RegExp specifications are used for URL patterns descriptions. The concepts necessary for URL pattern definition (such as "id" and "fullname") are to be defined previously in Ontology section.See also
*
ICDL Crawling
*Open Market For Internet Content Accessibility
*Semantic Web
*World Wide Web Consortium
* RDF
* OWL
*Regular Expressions
*Universal Networking Language External links
* [http://www.omfica.org OMFICA]
* [http://www.omfica.org/editor/index.php ICDL Editor]
* [http://www.w3.org W3C]
* [http://www.regular-expressions.info Regular Expressions]
* [http://www.undl.org UNDL]
Wikimedia Foundation. 2010.