commons-xml/java/external/xdocs/dom/core/accessing-code-point-boundaries.html

224 lines
9.3 KiB
HTML

<!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<!--
Generated: Wed Apr 07 13:11:22 EDT 2004 jfouffa.w3.org
-->
<html lang='en-US'>
<head>
<title>Accessing code point boundaries</title>
<link rel='stylesheet' type='text/css' href='./spec.css'>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel='stylesheet' type='text/css' href='W3C-REC.css'>
<link rel='next' href='idl-definitions.html'>
<link rel='contents' href='Overview.html#contents'>
<link rel='copyright' href='copyright-notice.html'>
<link rel='glossary' href='glossary.html'>
<link rel='Start' href='Overview.html'>
<link rel='index' href='def-index.html'>
<link rel='author' href='mailto:www-dom@w3.org'>
<link rel='help' href='http://www.w3.org/DOM/'>
<link rel='prev' href='configuration-settings.html'>
</head>
<body>
<div class='navbar' style='text-align: center'>
<map id='navbar-top' name='navbar-top' title='Navigation Bar'><p>
[<a title='Configuration Settings' accesskey='p' href='configuration-settings.html'><strong><u>p</u></strong>revious</a>]
&nbsp; [<a title='IDL Definitions' accesskey='n' href='idl-definitions.html'><strong><u>n</u></strong>ext</a>] &nbsp; [<a title='Table of Contents' accesskey='c' href='Overview.html#contents'><strong><u>c</u></strong>ontents</a>] &nbsp; [<a title='Index'
accesskey='i' href='def-index.html'><strong><u>i</u></strong>ndex</a>]</p>
<hr title='Navigation area separator'>
</map></div>
<div class='noprint' style='text-align: right'>
<p style='font-family: monospace;font-size:small'>07 April 2004</p>
</div>
<div class='div1'><a name='i18n'></a>
<h1 id='i18n-h1' class='adiv1'>Appendix E: Accessing code point boundaries</h1>
<dl>
<dd>Mark Davis, IBM</dd>
<dd>Lauren Wood, SoftQuad Software Inc.</dd>
</dl>
<div class='noprint'>
<h2 id='table-of-contents'>Table of contents</h2>
<ul class='toc'>
<li class='tocline3'><a class='tocxref' href='#i18n-introduction'>E.1 Introduction</a>
</li>
<li class='tocline3'><a class='tocxref' href='#i18n-methods'>E.2 Methods</a>
<ul class='toc'>
<li class='tocline4'><a href='#i18n-methods-StringExtend'>StringExtend</a></ul></li>
</ul>
</div>
<div class='div2'><a name='i18n-introduction'></a>
<h2 id='i18n-introduction-h2' class='adiv2'>E.1 Introduction</h2>
<p>
This appendix is an informative, not a normative, part of the Level 3 DOM
specification.
<p>
Characters are represented in Unicode by numbers called <i>code
points</i> (also called <i>scalar values</i>). These numbers can range
from 0 up to 1,114,111 = 10FFFF<sub>16</sub> (although some of these values are
illegal). Each code point can be directly encoded with a 32-bit code unit.
This encoding is termed UCS-4 (or UTF-32).
The DOM specification, however, uses UTF-16, in which the most frequent
characters (which have values less than FFFF<sub>16</sub>) are represented
by a single 16-bit code unit, while characters above FFFF<sub>16</sub>
use a special pair of code units called a <i>surrogate pair</i>. For more information,
see [<cite><a class='noxref normative' href='references.html#Unicode'>Unicode</a></cite>] or the Unicode Web site.
<p>
While indexing by code points as opposed to code units is not
common in programs, some specifications such as [<cite><a class='noxref informative' href='references.html#XPath10'>XPath 1.0</a></cite>] (and therefore XSLT and [<cite><a class='noxref informative' href='references.html#XPointer'>XPointer</a></cite>]) use code point indices. For
interfacing with such formats it is recommended that the
programming language provide string processing methods for
converting code point indices to code unit indices and back. Some
languages do not provide these functions natively; for these it is
recommended that the native <code>String</code> type that is bound
to <a href='core.html#DOMString'><code>DOMString</code></a> be extended to enable this
conversion. An example of how such an API might look is supplied
below.
<p><b>Note:</b>
Since these methods are supplied as an illustrative example of the type
of functionality that is required, the names of the methods,
exceptions, and interface may differ from those given here.
</p>
</div> <!-- div2 i18n-introduction -->
<div class='div2'><a name='i18n-methods'></a>
<h2 id='i18n-methods-h2' class='adiv2'>E.2 Methods</h2>
<dl>
<dt><b>Interface <i><a name='i18n-methods-StringExtend'>StringExtend</a></i></b></dt>
<dd>
<p>Extensions to a language's native String class or interface
<dl>
<dt><br><b>IDL Definition</b></dt>
<dd>
<div class='idl-code'>
<pre>
interface <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend'>StringExtend</a> {
int <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend-findOffset16'>findOffset16</a>(in int offset32)
raises(StringIndexOutOfBoundsException);
int <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend-findOffset32'>findOffset32</a>(in int offset16)
raises(StringIndexOutOfBoundsException);
};
</pre>
</div><br>
</dd>
<dt><b>Methods</b></dt>
<dd><dl>
<dt><code class='method-name'><a name='i18n-methods-StringExtend-findOffset16'>findOffset16</a></code></dt>
<dd>
<div class='method'>
Returns the UTF-16 offset that corresponds to a UTF-32 offset.
Used for random access.<p><b>Note:</b>
You can always round-trip from a UTF-32 offset to a UTF-16
offset and back. You can round-trip from a UTF-16 offset to
a UTF-32 offset and back if and only if the offset16 is not
in the middle of a surrogate pair. Unmatched surrogates
count as a single UTF-16 value.
</p>
<div class='parameters'>
<b>Parameters</b>
<div class='paramtable'>
<dl>
<dt><code class='parameter-name'>offset32</code> of type
<code>int</code></dt><dd>
UTF-32 offset.
<br>
</dd>
</dl>
</div></div> <!-- parameters -->
<div class='return'>
<b>Return Value</b>
<div class='returntable'>
<table summary='Layout table: the first cell contains
the type of the return value, the second contains a short description'
border='0'><tr><td valign='top'><p><code>int</code></p></td><td>
<p>
UTF-16 offset</td></tr></table>
</div></div> <!-- return -->
<div class='exceptions'>
<b>Exceptions</b>
<div class='exceptiontable'>
<table summary='Layout table: the first cell contains
the type of the exception, the second contains
the specific error code and a short description'
border='0'>
<tr><td valign='top'><p><code>StringIndexOutOfBoundsException</code></p></td><td>
<p>
if <code>offset32</code> is out of bounds.
</td></tr>
</table>
</div></div> <!-- exceptions -->
</div> <!-- method -->
</dd>
<dt><code class='method-name'><a name='i18n-methods-StringExtend-findOffset32'>findOffset32</a></code></dt>
<dd>
<div class='method'>
Returns the UTF-32 offset corresponding to a UTF-16 offset. Used
for random access. To find the UTF-32 length of a string, use:
<div class='eg'>
<pre>len32 = findOffset32(source, source.length());</pre>
</div>
<p><b>Note:</b>
If the UTF-16 offset is into the middle of a surrogate pair,
then the UTF-32 offset of the <em>end</em> of the pair is
returned; that is, the index of the char after the end of the
pair. You can always round-trip from a UTF-32 offset to a UTF-16
offset and back. You can round-trip from a UTF-16 offset to a
UTF-32 offset and back if and only if the offset16 is not in
the middle of a surrogate pair. Unmatched surrogates count as a
single UTF-16 value.
</p>
<div class='parameters'>
<b>Parameters</b>
<div class='paramtable'>
<dl>
<dt><code class='parameter-name'>offset16</code> of type
<code>int</code></dt><dd>
UTF-16 offset<br>
</dd>
</dl>
</div></div> <!-- parameters -->
<div class='return'>
<b>Return Value</b>
<div class='returntable'>
<table summary='Layout table: the first cell contains
the type of the return value, the second contains a short description'
border='0'><tr><td valign='top'><p><code>int</code></p></td><td>
<p>
UTF-32 offset</td></tr></table>
</div></div> <!-- return -->
<div class='exceptions'>
<b>Exceptions</b>
<div class='exceptiontable'>
<table summary='Layout table: the first cell contains
the type of the exception, the second contains
the specific error code and a short description'
border='0'>
<tr><td valign='top'><p><code>StringIndexOutOfBoundsException</code></p></td><td>
<p>if offset16 is out of bounds.</td></tr>
</table>
</div></div> <!-- exceptions -->
</div> <!-- method -->
</dd>
</dl></dd>
</dl></dd>
</dl>
</div> <!-- div2 i18n-methods --></div> <!-- div1 i18n --><div class='navbar' style='text-align: center'>
<map id='navbar-bottom' name='navbar-bottom' title='Navigation Bar'><hr title='Navigation area separator'><p>
[<a title='Configuration Settings' href='configuration-settings.html'><strong><u>p</u></strong>revious</a>]
&nbsp; [<a title='IDL Definitions' href='idl-definitions.html'><strong><u>n</u></strong>ext</a>] &nbsp; [<a title='Table of Contents' href='Overview.html#contents'><strong><u>c</u></strong>ontents</a>] &nbsp; [<a title='Index'
href='def-index.html'><strong><u>i</u></strong>ndex</a>]</p>
</map></div>
</body>
</html>