{"id":23,"date":"2023-08-12T04:32:36","date_gmt":"2023-08-12T04:32:36","guid":{"rendered":"https:\/\/nocoffei.com\/?page_id=23"},"modified":"2023-09-17T16:41:58","modified_gmt":"2023-09-17T16:41:58","slug":"itanic-part-1-running-a-dead-architecture-on-modern-hardware","status":"publish","type":"post","link":"https:\/\/nocoffei.com\/?p=23","title":{"rendered":"Itanic, part 1\u2014 running a dead architecture on modern hardware"},"content":{"rendered":"\n<p>Let\u2019s embark on a tale back to the era of yore, when smartphones not yet ruled the earth, Pentiums were king, and Intel believed that x86 had no future.<\/p>\n\n\n\n<p>I wasn\u2019t much into technology at the time Itanium came out, so I only properly heard of it years later. Sometime in late 2021, I was introduced to Raymond Chens <a href=\"https:\/\/web.archive.org\/web\/20230726120257\/https:\/\/devblogs.microsoft.com\/oldnewthing\/20150727-00\/?p=90821\">Old New Thing<\/a> article series on the subject. I was instantly hooked. In March 2022, I started a thorough investigation of what I could accomplish with this knowledge.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">If You Were Around When Itanium Was A Thing<\/h2>\n\n\n\n<p>Please skip to the end of this article and read the section \u201cStay Tuned\u201d, and keep it in mind as you read this article.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Anything you can do, I can do more of (at the same time)<\/h2>\n\n\n\n<p>There are two ways to make a series of sequential tasks faster\u2013 do each one quicker, or do multiple in parallel. Operations continue to get faster and faster (by way of clock speed increases and general optimizations), but if you have a string of instructions, how do you make them work in parallel?<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img decoding=\"async\" src=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113522-1024x570.png\" alt=\"An image of a series of pseudo-assembly instructions. The instructions are: &quot;load a, load b, load c, load d, add e = a + b, add f = c + d, store e, store f&quot;\" class=\"wp-image-65\" style=\"width:900px\" width=\"900\" srcset=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113522-1024x570.png 1024w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113522-300x167.png 300w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113522-768x428.png 768w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113522.png 1336w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Todays computers do what\u2019s called Out of Order\/Superscalar execution. The idea is, most execution streams that <em>appear<\/em> linear aren\u2019t necessarily dependent on each other. You might notice that the calculation of <code>e<\/code> and <code>f<\/code> aren\u2019t actually dependent on each other\u2013 you load <code>a<\/code> and <code>b<\/code> to calculate <code>e<\/code>, and <code>c<\/code> and <code>d<\/code> to calculate <code>f<\/code>. So that\u2019s what modern computers do\u2013 they look ahead in the instruction stream and figure out what needs to be done, then analyze data dependencies, re-order the independent instructions, and finally execute those independent streams in parallel.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"570\" src=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113611-1024x570.png\" alt=\"An image meant to convey how OoO\/SS execution works. The image shows how the linear stream of instructions listed above are re-ordered in to independent instruction streams-- &quot;load a, load b, add e = a + b, store e&quot;, and &quot;load c, load d, add f = c + d, store f&quot;.\" class=\"wp-image-66\" srcset=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113611-1024x570.png 1024w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113611-300x167.png 300w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113611-768x428.png 768w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113611.png 1336w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Sound complicated? That\u2019s because it is. These principles work, but they didn\u2019t exist at all in the 90s, when Itanium was being developed. At the time, Intel had a very different idea of what parallel execution would look like\u2026<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"570\" src=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113636-1024x570.png\" alt=\"Example of how the above instructions would be re-ordered in the Itanium system. It reads: &quot;load a -- load b -- no-op;;, load c -- load d --  add e = a + b;;, store e -- add f = c + d -- no op;;, store f -- no op -- no op&quot;\" class=\"wp-image-67\" srcset=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113636-1024x570.png 1024w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113636-300x167.png 300w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113636-768x428.png 768w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113636.png 1336w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Instructions of Itanium are always grouped into blocks of three, called \u201cbundles\u201d. Each bundle has a \u201ctemplate\u201d that defines what each instruction in the bundle looks like (for example, MII means the bundle has one Memory instruction and two Integer instructions.) Bundles are executed in parallel, unless the bundle encodes a \u201cstop\u201d (denoted as <code>;;<\/code>) which denotes that the next instruction in the bundle is dependent on the previous instructions in the bundle. As long as no stops are present, the CPU can continue dispatching bundles in parallel, making a fully parallelized \u201cinstruction group\u201d. For instance, the original Itanium 2 can dispatch up to six instructions (two bundles) in parallel if there are no stops.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"872\" height=\"115\" src=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113712-1.png\" alt=\"Image of what Itanium bundle format looks like. The image denotes how instructions are 128 bits long, with 41 bits for three instructions and a five-bit &quot;template type&quot;.\" class=\"wp-image-69\" srcset=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113712-1.png 872w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113712-1-300x40.png 300w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113712-1-768x101.png 768w\" sizes=\"auto, (max-width: 872px) 100vw, 872px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"872\" height=\"282\" src=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113836.png\" alt=\"Image of a series of example template types. One example shows &quot;M-unit, (WHITE BAR), M-unit, I-unit&quot;. Another shows &quot;M-unit, F-unit, I-unit&quot;. Another shows &quot;M-unit, I-unit, B-unit, (WHITE BAR).\" class=\"wp-image-70\" srcset=\"https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113836.png 872w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113836-300x97.png 300w, https:\/\/nocoffei.com\/wp-content\/uploads\/2023\/09\/Screenshot_20230917_113836-768x248.png 768w\" sizes=\"auto, (max-width: 872px) 100vw, 872px\" \/><figcaption class=\"wp-element-caption\"><em>The Itanium instruction format and examples of templates that fit into them. The white bars indicate stops. Taken from the Intel\u00ae Itanium\u00ae Architecture Software Developer\u2019s Manual Volume 3: Intel\u00ae Itanium\u00ae Instruction Set Reference, Chapter 4.<\/em><\/figcaption><\/figure>\n\n\n\n<div class=\"wp-site-blocks\"><main id=\"wp--skip-link--target\" class=\"wp-block-group is-layout-flow\">\n<div class=\"entry-content wp-block-post-content has-global-padding is-layout-constrained\">\n<p>Notice that this is <em>much<\/em> less work for the silicon of the device. The <em>compiler<\/em> is the entity responsible for generating parallelism. So instead of carefully analyzing and tracking instructions while they\u2019re in flight, the CPU just has to do dumb execution of the program, knowing that it\u2019s getting massive speedups by executing code in parallel!<\/p>\n<p>Now I must point out that Itanium and its ideas were a failure. This should be obvious, given that it required significant introduction. It struggled to take off in the server space early on, then AMDs 64-bit extensions to x86 killed off any remaining chance of success it had. More importantly, however, the fundamental idea of Itanium\u2013 parallel instruction decoding\u2013 <em>does not work<\/em>. There are two main reasons for this:<\/p>\n<ol>\n<li>It\u2019s not always possible to fit instructions into groupings of three. Notice that the code I wrote above, which simply calculates two numbers, had to have \u201cno-ops\u201d inserted to make a complete bundle. That\u2019s both because dependencies between instructions had to be accounted for, and because the number of templates is very limited (there are only 32) so even if parallelism could be present it may not always be possible to encode it as such. Compare this to other architectures, which have none of these limitations\u2013 just write instructions and you\u2019re set.If it seems that this is a contrived example meant to forcibly showcase the issue, let\u2019s take a look at the first instructions found in an <code>objdump<\/code> of libc:<\/li>\n<\/ol>\n<\/div>\n<\/main><\/div>\n\n\n\n<pre class=\"wp-block-code\"><code>000000000002e580 &lt;.plt&gt;:\n   2e580:    0b 10 00 1c 00 21   &#91;MMI]       mov r2=r14;;\n   2e586:    e0 60 28 63 4a 00               addl r14=1218700,r2\n   2e58c:    00 00 04 00                     nop.i 0x0;;\n   2e590:    0b 80 20 1c 18 14   &#91;MMI]       ld8 r16=&#91;r14],8;;\n   2e596:    10 41 38 30 28 00               ld8 r17=&#91;r14],8\n   2e59c:    00 00 04 00                     nop.i 0x0;;\n   2e5a0:    11 08 00 1c 18 10   &#91;MIB]       ld8 r1=&#91;r14]\n   2e5a6:    60 88 04 80 03 00               mov b6=r17\n   2e5ac:    60 00 80 00                     br.few b6;;\n   2e5b0:    11 78 00 00 00 24   &#91;MIB]       mov r15=0\n   2e5b6:    00 00 00 02 00 00               nop.i 0x0\n   2e5bc:    d0 ff ff 48                     br.few 2e580 &lt;__h_errno@@GLIBC_PRIVATE+0x2e50c&gt;;;\n   2e5c0:    11 78 04 00 00 24   &#91;MIB]       mov r15=1\n   2e5c6:    00 00 00 02 00 00               nop.i 0x0\n   2e5cc:    c0 ff ff 48                     br.few 2e580 &lt;__h_errno@@GLIBC_PRIVATE+0x2e50c&gt;;;<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li>Cache hierarchies. With a multi-core, multi-process, modern operating system computing environment, it\u2019s practically impossible to know how physically close your data is to your execution unit.<\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Is it in the L1 cache, which is right next to the execution units?<\/li>\n\n\n\n<li>Is it in the L2 cache, which is shared by multiple nearby cores?<\/li>\n\n\n\n<li>Is it in the L3 cache, which is shared by the entire CPU?<\/li>\n\n\n\n<li>Or is it all the way out in DRAM, which is so slow it might as well be on another planet?<br><br>You simply can\u2019t know how long your data accesses will take because the interaction of multiple cores running multiple processes which all need to access data, and a modern OS scheduling those processes based on incredibly detailed minutiae, makes it so that data could be <em>anywhere<\/em> in that massive hierarchy. Out of Order CPUs can absorb unexpected massive latency by tracking another execution stream where data <em>is<\/em> ready and executing that instead. The Itanium approach can\u2019t do this. It instead stalls on <em>all<\/em> execution until <em>one<\/em> load in a bundle is finished, even if successive bundles don\u2019t depend on that load.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Dredging Up The Itanic<\/h2>\n\n\n\n<p>Itanium has been dead and forgotten to most for almost 20 years. Obviously, I need to dig it up and making something cool with it. But what? How about a shellcoding challenge?<\/p>\n\n\n\n<p>If I want to make a shellcoding challenge, I\u2019ll have to spin up an environment where I can develop and run code for Itanium.<\/p>\n\n\n\n<p>This ended up taking longer than actually writing the challenge.<\/p>\n\n\n\n<p>The obvious first step is finding a working Itanium system. After all, The Register <a href=\"https:\/\/web.archive.org\/web\/20230726120257\/https:\/\/www.theregister.com\/2021\/07\/30\/end_of_itanium_shipments\/\">wrote an article<\/a> about finding many hundreds of Itanium CPUs on eBay. This is true! You too can own your very own Itanium CPU for a mere $30 in the U.S. However, that silicon won\u2019t be especially useful without a motherboard to put it in.<\/p>\n\n\n\n<p>It is not currently possible to purchase a standalone Itanium motherboard.<\/p>\n\n\n\n<p>The next step was looking at purchasing blades for an Itanium server. (Itanium primarily targeted the server market.) With a fair amount of effort searching on eBay, it was possible in April 2022 to find an Itanium blade for $300.<\/p>\n\n\n\n<p>Thankfully, before I dropped $300 on one of these things (quite a lot of money for me to spend on anything, let alone a side project) an IT friend who I was discussing this asinine project with pointed out an important issue to me\u2013 blades typically use bizarre, proprietary power supply connections and fit into a blade chassis which delivers that power. Without a blade chassis ($1500) I would be purchasing a very heavy and expensive paperweight. Thanks, Travis!<\/p>\n\n\n\n<p>As an aside, it seems that at the time of writing (May 2023) it\u2019s become easier to find those blades at that price or even slightly lower (there\u2019s one at $145 right now.) I\u2019m also seeing a single listing for a complete Itanium 2 server (with a standard power supply jack) that was probably released around 2003 at $425. Either way, this wasn\u2019t the case last year, so I moved on to emulation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Ski<\/h2>\n\n\n\n<p>A resource I stumbled upon early on was Sergei Trofimovich\u2019s post on the <a href=\"https:\/\/web.archive.org\/web\/20230726120257\/https:\/\/trofi.github.io\/posts\/199-ia64-machine-emulation.html\">Ski emulator<\/a>, which described how he used it to fix a bug in the kernel. Ski was an emulator developed by HP back when they were building Itanium to test the software they were making on their \u201clegacy\u201d x86 machines. Ski was made in an era when multi-core processors didn\u2019t exist and the Pentium brand name was still the gold standard for quality\u2013 so naturally, it\u2019s a little dated. Thankfully, Trofi had managed to get a version working on modern Linux, and Ski is good enough to bring up an entire OS image! I was able to compile and boot it on my Fedora desktop.<\/p>\n\n\n\n<p>The next step was to get a working OS image. I needed something good enough to develop my shellcoding challenge\u2013 it needed Vim, GCC and GDB. Trofi suggested cross-compiling Gentoo for this task in his article\u2026 But I didn\u2019t have a working knowledge of Gentoo, so I turned to digging through old archives for OS images. I found some CentOS and Debian ISOs, but there was a big problem with them\u2013 Ski has no understanding of ISO images. It can only boot a kernel binary with a raw hard disk image attached, so these DVDs aren\u2019t useful. I ended up reaching out to him and got some excellent, highly detailed instructions on how <em>exactly<\/em> to build the system by creating a Gentoo chroot and setting up environment within it, including some updates for 2022 that would have otherwise caused the finished build to fail to boot. One particular piece of advice is that <a href=\"https:\/\/web.archive.org\/web\/20230726120257\/https:\/\/git.kernel.org\/pub\/scm\/linux\/kernel\/git\/torvalds\/linux.git\/commit\/?id=fc5bad03709f9c79ea299f554b6f99fc6f4fe31c\">support for Ski was removed from the kernel in 2019<\/a>, so we\u2019ll have to build a kernel before that time\u2013 4.19.241 will do.<\/p>\n\n\n\n<p>Out of the box, the image that Gentoo built for me had <code>gcc<\/code> but no other tooling, so I set out to build the tools myself. Ski is, as to be expected, very slow, and building <code>vim<\/code> for starters didn\u2019t go so well. On a Zen 2 desktop, I set it to compile\u2026 and after <em>six hours<\/em> attempting to compile a single file it ran out of memory. Attempting to add more memory to the emulated processor made it lock up, and I had no idea why, but it was clear that \u201cnative\u201d compilation wasn\u2019t the way to go anyway.<\/p>\n\n\n\n<p>Another friend who had been following along with this story, <a href=\"https:\/\/web.archive.org\/web\/20230726120257\/https:\/\/github.com\/xxc3nsoredxx\">xxc3nsoredxx<\/a> was a full-time Gentoo user and was able to within minutes point out to me that one of Gentoos strengths is cross-compilation for obscure architectures. A quick tutorial on <code>emerge<\/code> later, and I was able to successfully build <code>vim<\/code> and <code>gdb<\/code> and run it within Ski.<\/p>\n\n\n\n<p>The complete instructions for how to build a working Itanium environment on Linux can be found <a href=\"https:\/\/web.archive.org\/web\/20230726120257\/https:\/\/nocoffei.com\/?page_id=23\">here<\/a> or in the website header..<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Stay Tuned<\/h2>\n\n\n\n<p>In the second part of this article, I\u2019ll discuss using these newfound tools to create an extremely irritating shellcoding challenge.<\/p>\n\n\n\n<p>A lot of people worked on Itanium, from the original idea of EPIC to its slow and painful death. The biggest, and most surprising, lesson I took away from reconstructing a working development environment in 2023 is that almost all of those stories have been forgotten. I\u2019d find one person talk about supporting Itanium in disused comments sections, old, forgotten documentation that had to be carefully coaxed out of the Internet Archive, and some still-living docs on Intels website\u2026 But for all the labor that had to have gone into the Itanic I found little evidence that <em>people<\/em> worked on this. So if you worked on Itanium, or know someone who did, please reach out and tell me your story in the comments below.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let\u2019s embark on a tale back to the era of yore, when smartphones not yet ruled the earth, Pentiums were king, and Intel believed that x86 had no future.<\/p>\n","protected":false},"author":2,"featured_media":63,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-23","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/nocoffei.com\/index.php?rest_route=\/wp\/v2\/posts\/23","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nocoffei.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nocoffei.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nocoffei.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/nocoffei.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=23"}],"version-history":[{"count":0,"href":"https:\/\/nocoffei.com\/index.php?rest_route=\/wp\/v2\/posts\/23\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/nocoffei.com\/index.php?rest_route=\/wp\/v2\/media\/63"}],"wp:attachment":[{"href":"https:\/\/nocoffei.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=23"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nocoffei.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=23"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nocoffei.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=23"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}