http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/ git-svn-id: https://svn.apache.org/repos/asf/xml/commons/trunk@226242 13f79535-47bb-0310-9956-ffa450edef68
224 lines
9.3 KiB
HTML
224 lines
9.3 KiB
HTML
<!DOCTYPE html PUBLIC
|
|
"-//W3C//DTD HTML 4.01 Transitional//EN"
|
|
"http://www.w3.org/TR/html4/loose.dtd">
|
|
<!--
|
|
Generated: Wed Apr 07 13:11:22 EDT 2004 jfouffa.w3.org
|
|
-->
|
|
<html lang='en-US'>
|
|
<head>
|
|
<title>Accessing code point boundaries</title>
|
|
<link rel='stylesheet' type='text/css' href='./spec.css'>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
|
<link rel='stylesheet' type='text/css' href='W3C-REC.css'>
|
|
<link rel='next' href='idl-definitions.html'>
|
|
<link rel='contents' href='Overview.html#contents'>
|
|
<link rel='copyright' href='copyright-notice.html'>
|
|
<link rel='glossary' href='glossary.html'>
|
|
<link rel='Start' href='Overview.html'>
|
|
<link rel='index' href='def-index.html'>
|
|
<link rel='author' href='mailto:www-dom@w3.org'>
|
|
<link rel='help' href='http://www.w3.org/DOM/'>
|
|
<link rel='prev' href='configuration-settings.html'>
|
|
</head>
|
|
<body>
|
|
<div class='navbar' style='text-align: center'>
|
|
<map id='navbar-top' name='navbar-top' title='Navigation Bar'><p>
|
|
[<a title='Configuration Settings' accesskey='p' href='configuration-settings.html'><strong><u>p</u></strong>revious</a>]
|
|
[<a title='IDL Definitions' accesskey='n' href='idl-definitions.html'><strong><u>n</u></strong>ext</a>] [<a title='Table of Contents' accesskey='c' href='Overview.html#contents'><strong><u>c</u></strong>ontents</a>] [<a title='Index'
|
|
accesskey='i' href='def-index.html'><strong><u>i</u></strong>ndex</a>]</p>
|
|
<hr title='Navigation area separator'>
|
|
</map></div>
|
|
<div class='noprint' style='text-align: right'>
|
|
<p style='font-family: monospace;font-size:small'>07 April 2004</p>
|
|
</div>
|
|
|
|
<div class='div1'><a name='i18n'></a>
|
|
<h1 id='i18n-h1' class='adiv1'>Appendix E: Accessing code point boundaries</h1>
|
|
<dl>
|
|
<dd>Mark Davis, IBM</dd>
|
|
<dd>Lauren Wood, SoftQuad Software Inc.</dd>
|
|
</dl>
|
|
<div class='noprint'>
|
|
<h2 id='table-of-contents'>Table of contents</h2>
|
|
<ul class='toc'>
|
|
<li class='tocline3'><a class='tocxref' href='#i18n-introduction'>E.1 Introduction</a>
|
|
</li>
|
|
<li class='tocline3'><a class='tocxref' href='#i18n-methods'>E.2 Methods</a>
|
|
<ul class='toc'>
|
|
<li class='tocline4'><a href='#i18n-methods-StringExtend'>StringExtend</a></ul></li>
|
|
</ul>
|
|
</div>
|
|
|
|
<div class='div2'><a name='i18n-introduction'></a>
|
|
<h2 id='i18n-introduction-h2' class='adiv2'>E.1 Introduction</h2>
|
|
<p>
|
|
This appendix is an informative, not a normative, part of the Level 3 DOM
|
|
specification.
|
|
<p>
|
|
Characters are represented in Unicode by numbers called <i>code
|
|
points</i> (also called <i>scalar values</i>). These numbers can range
|
|
from 0 up to 1,114,111 = 10FFFF<sub>16</sub> (although some of these values are
|
|
illegal). Each code point can be directly encoded with a 32-bit code unit.
|
|
This encoding is termed UCS-4 (or UTF-32).
|
|
The DOM specification, however, uses UTF-16, in which the most frequent
|
|
characters (which have values less than FFFF<sub>16</sub>) are represented
|
|
by a single 16-bit code unit, while characters above FFFF<sub>16</sub>
|
|
use a special pair of code units called a <i>surrogate pair</i>. For more information,
|
|
see [<cite><a class='noxref normative' href='references.html#Unicode'>Unicode</a></cite>] or the Unicode Web site.
|
|
<p>
|
|
While indexing by code points as opposed to code units is not
|
|
common in programs, some specifications such as [<cite><a class='noxref informative' href='references.html#XPath10'>XPath 1.0</a></cite>] (and therefore XSLT and [<cite><a class='noxref informative' href='references.html#XPointer'>XPointer</a></cite>]) use code point indices. For
|
|
interfacing with such formats it is recommended that the
|
|
programming language provide string processing methods for
|
|
converting code point indices to code unit indices and back. Some
|
|
languages do not provide these functions natively; for these it is
|
|
recommended that the native <code>String</code> type that is bound
|
|
to <a href='core.html#DOMString'><code>DOMString</code></a> be extended to enable this
|
|
conversion. An example of how such an API might look is supplied
|
|
below.
|
|
<p><b>Note:</b>
|
|
Since these methods are supplied as an illustrative example of the type
|
|
of functionality that is required, the names of the methods,
|
|
exceptions, and interface may differ from those given here.
|
|
</p>
|
|
</div> <!-- div2 i18n-introduction -->
|
|
<div class='div2'><a name='i18n-methods'></a>
|
|
<h2 id='i18n-methods-h2' class='adiv2'>E.2 Methods</h2>
|
|
|
|
<dl>
|
|
<dt><b>Interface <i><a name='i18n-methods-StringExtend'>StringExtend</a></i></b></dt>
|
|
<dd>
|
|
<p>Extensions to a language's native String class or interface
|
|
<dl>
|
|
<dt><br><b>IDL Definition</b></dt>
|
|
<dd>
|
|
<div class='idl-code'>
|
|
<pre>
|
|
interface <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend'>StringExtend</a> {
|
|
int <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend-findOffset16'>findOffset16</a>(in int offset32)
|
|
raises(StringIndexOutOfBoundsException);
|
|
int <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend-findOffset32'>findOffset32</a>(in int offset16)
|
|
raises(StringIndexOutOfBoundsException);
|
|
};
|
|
</pre>
|
|
</div><br>
|
|
</dd>
|
|
|
|
<dt><b>Methods</b></dt>
|
|
<dd><dl>
|
|
|
|
<dt><code class='method-name'><a name='i18n-methods-StringExtend-findOffset16'>findOffset16</a></code></dt>
|
|
<dd>
|
|
<div class='method'>
|
|
Returns the UTF-16 offset that corresponds to a UTF-32 offset.
|
|
Used for random access.<p><b>Note:</b>
|
|
You can always round-trip from a UTF-32 offset to a UTF-16
|
|
offset and back. You can round-trip from a UTF-16 offset to
|
|
a UTF-32 offset and back if and only if the offset16 is not
|
|
in the middle of a surrogate pair. Unmatched surrogates
|
|
count as a single UTF-16 value.
|
|
</p>
|
|
<div class='parameters'>
|
|
<b>Parameters</b>
|
|
<div class='paramtable'>
|
|
<dl>
|
|
<dt><code class='parameter-name'>offset32</code> of type
|
|
<code>int</code></dt><dd>
|
|
|
|
UTF-32 offset.
|
|
<br>
|
|
</dd>
|
|
</dl>
|
|
</div></div> <!-- parameters -->
|
|
|
|
<div class='return'>
|
|
<b>Return Value</b>
|
|
<div class='returntable'>
|
|
<table summary='Layout table: the first cell contains
|
|
the type of the return value, the second contains a short description'
|
|
border='0'><tr><td valign='top'><p><code>int</code></p></td><td>
|
|
<p>
|
|
UTF-16 offset</td></tr></table>
|
|
</div></div> <!-- return -->
|
|
|
|
<div class='exceptions'>
|
|
<b>Exceptions</b>
|
|
<div class='exceptiontable'>
|
|
<table summary='Layout table: the first cell contains
|
|
the type of the exception, the second contains
|
|
the specific error code and a short description'
|
|
border='0'>
|
|
<tr><td valign='top'><p><code>StringIndexOutOfBoundsException</code></p></td><td>
|
|
<p>
|
|
if <code>offset32</code> is out of bounds.
|
|
</td></tr>
|
|
</table>
|
|
</div></div> <!-- exceptions -->
|
|
</div> <!-- method -->
|
|
</dd>
|
|
|
|
<dt><code class='method-name'><a name='i18n-methods-StringExtend-findOffset32'>findOffset32</a></code></dt>
|
|
<dd>
|
|
<div class='method'>
|
|
|
|
Returns the UTF-32 offset corresponding to a UTF-16 offset. Used
|
|
for random access. To find the UTF-32 length of a string, use:
|
|
<div class='eg'>
|
|
<pre>len32 = findOffset32(source, source.length());</pre>
|
|
</div>
|
|
<p><b>Note:</b>
|
|
If the UTF-16 offset is into the middle of a surrogate pair,
|
|
then the UTF-32 offset of the <em>end</em> of the pair is
|
|
returned; that is, the index of the char after the end of the
|
|
pair. You can always round-trip from a UTF-32 offset to a UTF-16
|
|
offset and back. You can round-trip from a UTF-16 offset to a
|
|
UTF-32 offset and back if and only if the offset16 is not in
|
|
the middle of a surrogate pair. Unmatched surrogates count as a
|
|
single UTF-16 value.
|
|
</p>
|
|
<div class='parameters'>
|
|
<b>Parameters</b>
|
|
<div class='paramtable'>
|
|
<dl>
|
|
<dt><code class='parameter-name'>offset16</code> of type
|
|
<code>int</code></dt><dd>
|
|
UTF-16 offset<br>
|
|
</dd>
|
|
</dl>
|
|
</div></div> <!-- parameters -->
|
|
|
|
<div class='return'>
|
|
<b>Return Value</b>
|
|
<div class='returntable'>
|
|
<table summary='Layout table: the first cell contains
|
|
the type of the return value, the second contains a short description'
|
|
border='0'><tr><td valign='top'><p><code>int</code></p></td><td>
|
|
<p>
|
|
UTF-32 offset</td></tr></table>
|
|
</div></div> <!-- return -->
|
|
|
|
<div class='exceptions'>
|
|
<b>Exceptions</b>
|
|
<div class='exceptiontable'>
|
|
<table summary='Layout table: the first cell contains
|
|
the type of the exception, the second contains
|
|
the specific error code and a short description'
|
|
border='0'>
|
|
<tr><td valign='top'><p><code>StringIndexOutOfBoundsException</code></p></td><td>
|
|
<p>if offset16 is out of bounds.</td></tr>
|
|
</table>
|
|
</div></div> <!-- exceptions -->
|
|
</div> <!-- method -->
|
|
</dd>
|
|
</dl></dd>
|
|
</dl></dd>
|
|
</dl>
|
|
</div> <!-- div2 i18n-methods --></div> <!-- div1 i18n --><div class='navbar' style='text-align: center'>
|
|
<map id='navbar-bottom' name='navbar-bottom' title='Navigation Bar'><hr title='Navigation area separator'><p>
|
|
[<a title='Configuration Settings' href='configuration-settings.html'><strong><u>p</u></strong>revious</a>]
|
|
[<a title='IDL Definitions' href='idl-definitions.html'><strong><u>n</u></strong>ext</a>] [<a title='Table of Contents' href='Overview.html#contents'><strong><u>c</u></strong>ontents</a>] [<a title='Index'
|
|
href='def-index.html'><strong><u>i</u></strong>ndex</a>]</p>
|
|
</map></div>
|
|
</body>
|
|
</html>
|