<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>unicode Archives | Programming Zen</title>
	<atom:link href="https://programmingzen.com/tag/unicode/feed/" rel="self" type="application/rss+xml" />
	<link>https://programmingzen.com/tag/unicode/</link>
	<description>Meditations on programming, startups, and technology</description>
	<lastBuildDate>Fri, 11 Oct 2019 16:06:56 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
<site xmlns="com-wordpress:feed-additions:1">1397766</site>	<item>
		<title>String Length in Elixir</title>
		<link>https://programmingzen.com/string-length-in-elixir/</link>
					<comments>https://programmingzen.com/string-length-in-elixir/#comments</comments>
		
		<dc:creator><![CDATA[Antonio Cangiano]]></dc:creator>
		<pubDate>Mon, 14 Oct 2019 12:00:11 +0000</pubDate>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[elixir]]></category>
		<category><![CDATA[elixir-cookbook]]></category>
		<category><![CDATA[standard library]]></category>
		<category><![CDATA[strings]]></category>
		<category><![CDATA[unicode]]></category>
		<guid isPermaLink="false">https://programmingzen.com/?p=2286</guid>

					<description><![CDATA[<p>In a previous post, I wrote about Hello World in Elixir. Using such a simple program allowed me to discuss a few concepts about the language. This post explores strings further, by discussing how to find the length of a string in Elixir. Simple enough, but there is more than meets the eye. Elixir String Length In Elixir, you can return the number of characters in a string with the String.length/1 function: String.length(&#34;Antonio&#34;) # 7 String.length(&#34;&#34;) # 0 String.length(&#34;r&#233;sum&#233;&#34;) # 6 Discussion In the case of a string like, &#34;Antonio&#34;, there isn&#8217;t much to discuss. String.length(&#34;Antonio&#34;) returns the number of </p>
<p>The post <a href="https://programmingzen.com/string-length-in-elixir/">String Length in Elixir</a> appeared first on <a href="https://programmingzen.com">Programming Zen</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>In a previous post, I wrote about <a href="https://programmingzen.com/elixir-hello-world/">Hello World in Elixir</a>. Using such a simple program allowed me to discuss a few concepts about the language. This post explores strings further, by discussing how to find the <strong>length of a string in Elixir</strong>.</p>



<p>Simple enough, but there is more than meets the eye.</p>



<h2 class="wp-block-heading">Elixir String Length</h2>



<p>In Elixir, you can return the number of characters in a string with the <code>String.length/1</code> function:</p>



<pre class="wp-block-code"><code class="prettyprinted">String.length("Antonio") # 7
String.length("")        # 0
String.length("résumé")  # 6</code></pre>



<h2 class="wp-block-heading">Discussion</h2>



<p>In the case of a string like, <code>"Antonio"</code>, there isn&#8217;t much to discuss. <code>String.length("Antonio")</code> returns the number of characters in the string.</p>



<p>The string clearly has 7 characters and since each character in this particular string can be represented with a single byte, its raw representation is 7 bytes as well.</p>



<p>Things become more interesting when a string contains special characters. </p>



<p>Consider the string, <code>"résumé"</code>. In this case, <code>String.length/1</code> returns 6. You can think of this as the number of &#8220;visible&#8221; or &#8220;user-perceived&#8221; characters in the string. However, let&#8217;s investigate its raw representation:</p>



<pre class="wp-block-code"><code class="prettyprinted">iex(2)&gt; i "résumé"
Term
  "résumé"
Data type
  BitString
Byte size
  8
Description
  This is a string: a UTF-8 encoded binary. It's printed surrounded by
  "double quotes" because all UTF-8 encoded code points in it are printable.
Raw representation
  &lt;&lt;114, 195, 169, 115, 117, 109, 195, 169&gt;&gt;
Reference modules
  String, :binary
Implemented protocols
  Collectable, IEx.Info, Inspect, List.Chars, String.Chars</code></pre>



<p>You&#8217;ll notice that the string&#8217;s actual data type is BitString. Specifically, this binary is made up of 8 bytes, even though there are only 6 user-perceived characters. This is because it takes 2 bytes to represent its accented é characters.</p>



<p>If you are interested in the number of bytes within a string, instead of its length, you can use the <code>Kernel.byte_size/1</code> function:</p>



<pre class="wp-block-code"><code class="prettyprinted">iex(3)&gt; string = "résumé"
"résumé"

iex(4)&gt; String.length(string)
6

iex(5)&gt; byte_size(string)
8</code></pre>



<p>From a performance standpoint, it&#8217;s worth noting that <code>Kernel.byte_size/1</code> is more efficient than <code>String.length/1</code>. Unlike the latter, which takes longer as the string grows, <code>Kernel.byte_size/1</code> will return in constant time.</p>



<p><a href="https://hexdocs.pm/elixir/String.html">Strings</a> in Elixir are UTF-8 encoded binaries. You can think of them as collections of code points. A code point is a Unicode character, whose underlying representation might require one or more bytes. For example, the é characters in the string <code>"résumé"</code> are code points whose representation requires two bytes each.</p>



<p>The Unicode standard also defines some special characters as the combination of other characters. In other words, even though they appear to the reader as a single character, they are in fact a combination of two or more code points. These are known as grapheme clusters.</p>



<p>For example, the e-acute letter can be represented as a single code point as we&#8217;ve done so far (this is also known as a precomposed character) or as a combination of two code points (the letter e and a combining acute accent). These look the same but they are technically two different characters as far as Elixir is concerned; so the two strings below end up representing two different binaries:</p>



<pre><code class="prettyprinted">iex(6)&gt; "é" == "e&#769;"
false</code></pre>



<p>When working with strings, you&#8217;ll often want to consider them in terms of the user-perceived characters, rather than their code points or the binary they actually represent.</p>



<p>To help us out, the Elixir <code>String</code> module provides us with <code>String.graphemes/1</code> which returns a list of characters, without splitting grapheme clusters into the underlying code points. If you need the codepoints, you can always use <code>String.codepoints/1</code>.</p>



<p>To see the distinction between code points and graphemes in action, consider the following string (using the grapheme cluster built from two code points):</p>



<pre class="wp-block-code"><code class="prettyprinted">iex(7)&gt; String.codepoints("cliche&#769;")
["c", "l", "i", "c", "h", "e", "&#769;"]

iex(8)&gt; String.graphemes("cliche&#769;")
["c", "l", "i", "c", "h", "e&#769;"]</code></pre>



<p>As you can see, <code>String.codepoints/1</code> shows us a list of code points in the string, and the special e-acute gets split into two code points. If you look closely, you&#8217;ll notice the accent as the last code point in the list.</p>



<p><code>String.graphemes/1</code> simply returns the list of graphemes and is the closest thing that we have to a function that provides a list of user-perceived characters.</p>



<p>Now, consider its length and byte size:</p>



<pre class="wp-block-code"><code class="prettyprinted">ie(9)&gt; String.length("cliche&#769;")
6

iex(10)&gt; byte_size("cliche&#769;")
8</code></pre>



<p><code>String.length/1</code> returns 6, the number of user-perceived characters in the string. <code>Kernel.byte_size/1</code> returns 8 because it takes 3 bytes to represent the special character/grapheme (1 for the letter e, and 2 for the combining acute accent).</p>



<p>We used the expression &#8220;visible&#8221; or &#8220;user-perceived&#8221; characters to give us an intuitive understanding.  Now that you know more about bytes, code points, and graphemes, we can be a little more precise and say that <code>String.length/1</code> returns the number of graphemes in a string.</p>



<p>Finally, if you wanted to know the number of codepoints, you could trivially compose the <code>Kernel.length/1</code> function and the <code>String.codepoints/1</code> function:</p>



<pre class="wp-block-code"><code class="prettyprinted">"cliche&#769;"
  |&gt; String.codepoints()
  |&gt; length()</code></pre>



<p>Note that if you are trying this in IEx, you&#8217;ll need to use the backslash character to continue on new lines:</p>



<pre class="wp-block-code"><code>iex(11)> "cliche&#769;" \
...(11)> |> String.codepoints() \
...(11)> |> length()
7</code></pre>



<p>So in summary, our string contains 6 graphemes, 7 code points, and 8 bytes.</p>
<p>The post <a href="https://programmingzen.com/string-length-in-elixir/">String Length in Elixir</a> appeared first on <a href="https://programmingzen.com">Programming Zen</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://programmingzen.com/string-length-in-elixir/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2286</post-id>	</item>
	</channel>
</rss>
