commons-xml/java/external/xdocs/dom/core/accessing-code-point-boundaries.html

<!DOCTYPE html PUBLIC
  "-//W3C//DTD HTML 4.01 Transitional//EN"
  "http://www.w3.org/TR/html4/loose.dtd">
<!--
 Generated: Wed Apr 07 13:11:22 EDT 2004 jfouffa.w3.org
 -->
<html lang='en-US'>
<head>
  <title>Accessing code point boundaries</title>
  <link rel='stylesheet' type='text/css' href='./spec.css'>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <link rel='stylesheet' type='text/css' href='W3C-REC.css'>
  <link rel='next' href='idl-definitions.html'>
  <link rel='contents' href='Overview.html#contents'>
  <link rel='copyright' href='copyright-notice.html'>
  <link rel='glossary' href='glossary.html'>
  <link rel='Start' href='Overview.html'>
  <link rel='index' href='def-index.html'>
  <link rel='author' href='mailto:www-dom@w3.org'>
  <link rel='help' href='http://www.w3.org/DOM/'>
  <link rel='prev' href='configuration-settings.html'>
</head>
<body>
<div class='navbar' style='text-align: center'>
<map id='navbar-top' name='navbar-top' title='Navigation Bar'><p>
[<a title='Configuration Settings' accesskey='p' href='configuration-settings.html'><strong><u>p</u></strong>revious</a>]
 &nbsp; [<a title='IDL Definitions' accesskey='n' href='idl-definitions.html'><strong><u>n</u></strong>ext</a>] &nbsp; [<a title='Table of Contents' accesskey='c' href='Overview.html#contents'><strong><u>c</u></strong>ontents</a>] &nbsp; [<a title='Index'
accesskey='i' href='def-index.html'><strong><u>i</u></strong>ndex</a>]</p>
<hr title='Navigation area separator'>
</map></div>
<div class='noprint' style='text-align: right'>
<p style='font-family: monospace;font-size:small'>07 April 2004</p>
</div>

<div class='div1'><a name='i18n'></a>
<h1 id='i18n-h1' class='adiv1'>Appendix E: Accessing code point boundaries</h1>
<dl>
<dd>Mark Davis, IBM</dd>
<dd>Lauren Wood, SoftQuad Software Inc.</dd>
</dl>
<div class='noprint'>
<h2 id='table-of-contents'>Table of contents</h2>
<ul class='toc'>
<li class='tocline3'><a class='tocxref' href='#i18n-introduction'>E.1 Introduction</a>
</li>
<li class='tocline3'><a class='tocxref' href='#i18n-methods'>E.2 Methods</a>
<ul class='toc'>
<li class='tocline4'><a href='#i18n-methods-StringExtend'>StringExtend</a></ul></li>
</ul>
</div>

<div class='div2'><a name='i18n-introduction'></a>
<h2 id='i18n-introduction-h2' class='adiv2'>E.1 Introduction</h2>
<p>
      This appendix is an informative, not a normative, part of the Level 3 DOM
      specification.
    <p>
      Characters are represented in Unicode by numbers called <i>code
      points</i> (also called <i>scalar values</i>). These numbers can range
      from 0 up to 1,114,111 = 10FFFF<sub>16</sub> (although some of these values are
      illegal). Each code point can be directly encoded with a 32-bit code unit.
      This encoding is termed UCS-4 (or UTF-32).
      The DOM specification, however, uses UTF-16, in which the most frequent
      characters (which have values less than FFFF<sub>16</sub>) are represented
      by a single 16-bit code unit, while characters above FFFF<sub>16</sub>
      use a special pair of code units called a <i>surrogate pair</i>. For more information,
      see [<cite><a class='noxref normative' href='references.html#Unicode'>Unicode</a></cite>] or the Unicode Web site.
    <p>
      While indexing by code points as opposed to code units is not
      common in programs, some specifications such as [<cite><a class='noxref informative' href='references.html#XPath10'>XPath 1.0</a></cite>] (and therefore XSLT and [<cite><a class='noxref informative' href='references.html#XPointer'>XPointer</a></cite>]) use code point indices.  For
      interfacing with such formats it is recommended that the
      programming language provide string processing methods for
      converting code point indices to code unit indices and back. Some
      languages do not provide these functions natively; for these it is
      recommended that the native <code>String</code> type that is bound
      to <a href='core.html#DOMString'><code>DOMString</code></a> be extended to enable this
      conversion. An example of how such an API might look is supplied
      below.
    <p><b>Note:</b>
	Since these methods are supplied as an illustrative example of the type
	of functionality that is required, the names of the methods,
	exceptions, and interface may differ from those given here.
      </p>
</div> <!-- div2 i18n-introduction -->
<div class='div2'><a name='i18n-methods'></a>
<h2 id='i18n-methods-h2' class='adiv2'>E.2 Methods</h2>

<dl>
<dt><b>Interface <i><a name='i18n-methods-StringExtend'>StringExtend</a></i></b></dt>
<dd>
<p>Extensions to a language's native String class or interface
<dl>
<dt><br><b>IDL Definition</b></dt>
<dd>
<div class='idl-code'>
<pre>
interface <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend'>StringExtend</a> {
  int                <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend-findOffset16'>findOffset16</a>(in int offset32)
                                        raises(StringIndexOutOfBoundsException);
  int                <a class='noxref' href='accessing-code-point-boundaries.html#i18n-methods-StringExtend-findOffset32'>findOffset32</a>(in int offset16)
                                        raises(StringIndexOutOfBoundsException);
};
</pre>
</div><br>
</dd>

<dt><b>Methods</b></dt>
<dd><dl>

<dt><code class='method-name'><a name='i18n-methods-StringExtend-findOffset16'>findOffset16</a></code></dt>
<dd>
<div class='method'>
Returns the UTF-16 offset that corresponds to a UTF-32 offset.
	      Used for random access.<p><b>Note:</b>
		    You can always round-trip from a UTF-32 offset to a UTF-16
		    offset and back. You can round-trip from a UTF-16 offset to
		    a UTF-32 offset and back if and only if the offset16 is not
		    in the middle of a surrogate pair. Unmatched surrogates
		    count as a single UTF-16 value.
		  </p>
<div class='parameters'>
<b>Parameters</b>
<div class='paramtable'>
<dl>
<dt><code class='parameter-name'>offset32</code> of type
<code>int</code></dt><dd>

		  UTF-32 offset.
		<br>
</dd>
</dl>
</div></div> <!-- parameters -->

<div class='return'>
<b>Return Value</b>
<div class='returntable'>
<table  summary='Layout table: the first cell contains
 the type of the return value, the second contains a short description'
 border='0'><tr><td valign='top'><p><code>int</code></p></td><td>
<p>
UTF-16 offset</td></tr></table>
</div></div> <!-- return -->

<div class='exceptions'>
<b>Exceptions</b>
<div class='exceptiontable'>
<table  summary='Layout table: the first cell contains
 the type of the exception, the second contains
the specific error code and a short description'
border='0'>
<tr><td valign='top'><p><code>StringIndexOutOfBoundsException</code></p></td><td>
<p>
		  if <code>offset32</code> is out of bounds.
		</td></tr>
</table>
</div></div> <!-- exceptions -->
</div> <!-- method -->
</dd>

<dt><code class='method-name'><a name='i18n-methods-StringExtend-findOffset32'>findOffset32</a></code></dt>
<dd>
<div class='method'>

	      Returns the UTF-32 offset corresponding to a UTF-16 offset. Used
	      for random access. To find the UTF-32 length of a string, use:
	      <div class='eg'>
<pre>len32 = findOffset32(source, source.length());</pre>
</div>
	    <p><b>Note:</b>
		If the UTF-16 offset is into the middle of a surrogate pair,
		then the UTF-32 offset of the <em>end</em> of the pair is
		returned; that is, the index of the char after the end of the
		pair. You can always round-trip from a UTF-32 offset to a UTF-16
		offset and back. You can round-trip from a UTF-16 offset to a
		UTF-32 offset and back if and only if the offset16 is not in
		the middle of a surrogate pair. Unmatched surrogates count as a
		single UTF-16 value.
	      </p>
<div class='parameters'>
<b>Parameters</b>
<div class='paramtable'>
<dl>
<dt><code class='parameter-name'>offset16</code> of type
<code>int</code></dt><dd>
UTF-16 offset<br>
</dd>
</dl>
</div></div> <!-- parameters -->

<div class='return'>
<b>Return Value</b>
<div class='returntable'>
<table  summary='Layout table: the first cell contains
 the type of the return value, the second contains a short description'
 border='0'><tr><td valign='top'><p><code>int</code></p></td><td>
<p>
UTF-32 offset</td></tr></table>
</div></div> <!-- return -->

<div class='exceptions'>
<b>Exceptions</b>
<div class='exceptiontable'>
<table  summary='Layout table: the first cell contains
 the type of the exception, the second contains
the specific error code and a short description'
border='0'>
<tr><td valign='top'><p><code>StringIndexOutOfBoundsException</code></p></td><td>
<p>if offset16 is out of bounds.</td></tr>
</table>
</div></div> <!-- exceptions -->
</div> <!-- method -->
</dd>
</dl></dd>
</dl></dd>
</dl>
</div> <!-- div2 i18n-methods --></div> <!-- div1 i18n --><div class='navbar' style='text-align: center'>
<map id='navbar-bottom' name='navbar-bottom' title='Navigation Bar'><hr title='Navigation area separator'><p>
[<a title='Configuration Settings' href='configuration-settings.html'><strong><u>p</u></strong>revious</a>]
 &nbsp; [<a title='IDL Definitions' href='idl-definitions.html'><strong><u>n</u></strong>ext</a>] &nbsp; [<a title='Table of Contents' href='Overview.html#contents'><strong><u>c</u></strong>ontents</a>] &nbsp; [<a title='Index'
href='def-index.html'><strong><u>i</u></strong>ndex</a>]</p>
</map></div>
</body>
</html>