Something has been weighing heavily on my conscience for a while. Readium suffers from a serious lack of technical documentation. The closest I have come to putting anything into writing was creating this wiki page over two months ago. If we want to attract contributors to the project, we need to make diving into the source code as streamlined and painless as possible. Our architecture wiki is a pretty pathetic attempt in this regard.
Unpacking an EPUB
I felt that the most natural place to start diving into Readium is the beginning. Readium currently ships with an empty library, so the first thing that every new user of our system does is import a file into their library.
How do you wanna do this?
As of writing, Readium offers users three ways to import content into their library: from a remote URL, from their local filesystem in packed format and from their local filesystem as an unpacked directory. Of these three cases, importing from the local filesystem in packed form is the most complex (the other two options are really just subcases of this problem), so we will explore this use case for the rest of the post.
Pick a file any file
To enable a user to select a file from their local filesystem Readium makes use of a plain html
<input type='file'> element. But if you have done any web development before you will notice that what happens next is completely non-traditional. In a classic web application an HTML form is really a user friendly tool for adding parameters to an HTTP request. When you click the submit button of an HTML form, the browser serializes whatever you typed in the form’s fields into
application/x-www-form-urlencoded data, tacks it on as the payload of an HTTP POST request (usually) and sends to the server.
But Readium is 100% client side; there is no server to send the data to. We use the
Number type to the equivalent 32bit signed integer, perform the operation, and then convert back to its
This performance constraint is what motivated us to start allowing users to load an unpacked directory into Readium. To get this working we use the nifty
webkitdirectory attribute on the form
<input>. If you have never seen this in action before (or never knew of its existence) there is a pretty good demo available here or better yet, check it out in the Readium source code.
All of Readium’s logic for importing an epub from both zipped and unpacked formats is contained in /scripts/extractBook.js. For those of you who speak the Unified Modeling Language, here is a rundown of what is in there:
Perched at the top of the hierarchy is
Backbone.Model. This is not actually defined in Readium. It is an object provided by Backbone.js, a framework used widely throughout Readium. I am planning on explaining our decision to use Backbone in more depth in a future post, but for now the most important part to understand is its Events module. Basically Backbone makes triggering and subscribing to custom event super simple. Here is a quick rundown of it in code:
1 2 3 4 5 6 7 8 9 10 11
Backbone.Model also have a set of observable attributes, that is to say, they fire
change events when they are changed automagically. The entire control flow of unpacking logic is driven by Backone events. Event firing happens both explicitly (ie
this.trigger("something")) and implicitly by changing an observable attribute (ie
The next level in the hierarchy is
BookExtractorBase. This is where all of the shared logic for parsing, managing and writing the imported EPUB to disk lives. I am going to come back to this in the next section.
At the bottom of the hierarchy are
UnpackedBookExtractor. Each of these contains the logic for going through either a ZIP archive or unpacked EPUB and calling the necessary extraction / importing logic (located primarily in
BookExtractorBase). Let’s consider
Constructor logic is added to a
Backbone.Model by overloading its
initialize() method. The
Readium.ZipBookExtractor.Initialize method does a few things:
- It makes sure that the a URL has been passed in to the constructor, if not, it throws an exception.
- If a URL has been passed in, it takes an MD5 hash of the string formed by concatenating a timestamp to the URL. This serves as a unique identifier, which readium uses to keep track of the file and also the name of the root filesystem directory Readium will write its contents to.
- It captures the src filename of the users upload. This is captured only for the purpose of displaying it later in the UI, and is not needed for any technical reason.
ZipBookExtractor instance has been created, the importing process is initiated by calling its
extract() method. This method sets up the order in which the rest of the extraction process should execute by hooking up each method of the process to the event that will be fired upon completion of the previous step. Finally, it calls the asynchronous constructor of
Readium.FileSystemApi, which is our own very thin wrapper around the HTML5 filesystem API.
Once the filesystem is opened, the process continues by calling the constructor of our zip library. This constructor is also asynchronous. When initialized, it executes callback with the
Zip object as its
this context. The
Zip has not been decompressed at this stage, only the archive data structures have been parsed, so it does contain information about all of the archive entries. We decompress individual entries as needed (the code for this is in the
What to unpack first
After we perform some very rudimentary validations on the
Zip object (mostly checking for the existence of a few essential items) we get into the brunt of unpacking the book. In general when unpacking an epub, what you want to get your hands on is the OPF Package Document but first you need to find it. In order to find it, you have to parse another xml file first:
META-INF/container.xml. All epubs must have this file and it must be located at this absolute path in the archive. The
container.xml specifies the absolute path of
package document within the archive on an xml
<rootfile> node. This is the only information that we take from this file.
Parsing the package document
Once we are able to find the
package document within the archive the next step is to parse it. The logic for this is contained in the PackageDocumentBase class definition. This is separated from the unpacking code because we also parse package documents right before we display a publication.
At this stage when parsing the
package document we are really only interested in the meta data that it contains. Essentially what we are looking for is everything we need to show the publication in the library view (title, author, cover image etc). Once we have this information, we add in a bit more of our own (most importantly where we are going to write the package document on disk) and then save it to the browser’s local storage for quick access. We then iterate through the remaining archive contents and extract them one by one and write them to the HTML filesystem.
The last step of the unpacking process is something that I have termed Monkeypatching the URLs. There are a couple of reasons for this but the main motivation is that when we open an EPUB in the viewer, there are no guarantees that it will be loaded in a document with a URL pointing to where it actually exists on disk. The implication is that in order for us to ensure that URLs in individual spine items are handled correctly by the browser (for example consider the
src attribute of an
<img> tag) we need to convert them into fully qualified, absolute URLs, pointing to where we saved the file in the HTML5 filesystem. This logic is contained in scripts/models/path_resolver.js. This process is a little muddy right now and there are a few bugs left to iron out (so if you are looking for some way to contribute :) this might be a good spot to jump in).
Time to open
Once the book is fully unpacked, monkey patched and written to disk, we fire off an event to let the library code know that we are done. In response, we open the file in the viewer. How this happens, I will save for a future post.