Chapter 1: Core concepts

The goal of this book is to use technology to meet a series of requirements in the context of securing information that is stored in documents. These requirements include:

  • Integrity: we want the assurance that a document is genuine. We want a way to discover if the content was changed somewhere in the workflow after the document was created,

  • Authenticity: we want assurance about who created the document and when the document was created,

  • Non-repudiation: we want the assurance that the author can't deny his authorship.

Other requirements will be discussed later on, such as the requirement for different people to approve the content of a document, or the requirement for the document integrity to be preserved for the long-term.

But let's focus on integrity, authenticity, and non-repudiation first.

Description of the requirements

The issues we're going to solve aren't new. They started when human history started; that is: when the first words were written. How do we know who wrote those words? How do we know when those words were written? How do we know those words weren't changed over time?

Figure 1.1: Authenticity
Figure 1.1: Authenticity

Did Abraham Lincoln really say that "the problem with quotes on the Internet is that it is hard to verify their authenticity" as shown in figure 1.1? Obviously not, but some of these "who wrote what when?" questions are quite interesting for history books. We can ask similar questions in a business context, e.g. in the context of a contract: who signed the contract? When was it signed? Did someone change the contract after it was signed?

Let's take a look at some other examples.

Integrity

Meet Eddy Haerens (figure 1.2). Eddy was looking forward to move to his new appartment. He paid the contractor who built the appartment in installments. Everything went well until he received the (paper) invoice for the eight installment. A crook intercepted that invoice, and forged it in such a way that it was identical to the original, except for the wiring information. The crook even made sure to use an account at the same bank of the contractor. Unaware of the fraud, Eddy asked his bank to execute the payment. Two weeks later, he was informed that his payment never reached the contractor. In the meantime, the money had been transferred to another account in Thailand. The bank was not to blame; Eddy's instructions had been executed to the letter. The money –30,000 euro!– was lost.

Figure 1.2: News paper article about forged invoices
Figure 1.2: News paper article about forged invoices

In this chapter, we're going to study technical concepts that can be used to ensure the integrity of a (digital) document. If such a document is changed somewhere in the workflow –either by mistake or deliberately–, we want to get some kind of warning, saying: Watch out! The content of this document has been altered since it was originally issued.

Authenticity

In a 13th century fresco (figure 1.3), we see how Emperor Constantine the Great transferred the authority over Rome and the western part of the Roman Empire to pope Sylvester the first. The deed of this "Donation of Constantine" is dated in the early 4th century. It was mentioned for the first time in the 8th century, and in the 11th century, its content was used by pope Leo IX in a debate that would lead to the schism between the East and the West of the former Roman Empire. In the ages that followed, it was cited frequently in the context of conflicts between the papacy and the secular powers.

During the Middle Ages, the Donation was widely accepted as authentic. It wasn't until the mid-15th century, with the revival of Classical scholarship and textual criticism, that humanists, and eventually the papal bureaucracy, began to realize that the document could not possibly be genuine. In the meantime, the decree had caused nothing but trouble. In his Divine Comedy (1308-1320), Dante wrote: "Ah, Constantine, how much evil was born, / not from your conversion, but from that donation / that the first wealthy Pope received from you!"

Figure 1.3: The donation of Constantine
Figure 1.3: The donation of Constantine

In this chapter, we're going to search for a way to ensure the authenticity of a (digital) document. We want to know who signed the document, and when it was signed.

Non-repudiation

In figure 1.4, we see how Bart Simpson admires the name he just wrote in wet cement. We know that Bart can be mischievous, although he doesn't always admit it. He even released a song entitled "I didn't do it!" in 1994.

In the case of the "signature" in the wet concrete, Bart Simpson could argue that it was put there by someone else, by some other Bart, or by someone pretending to be Bart. However, if we look closely, we see a long shadow covering part of Bart and his name in the concrete. We recognize the hairstyle of Bart's mother, Marge. Marge was probably standing behind Bart while he put his name there, and if Bart denies that fact, Marge can testify that he was indeed the culprit.

Figure 1.4: Bart's name in concrete
Figure 1.4: Bart's name in concrete

In this chapter, we're going to look how we can make sure that the signer of a (digital) document can't deny that he actually signed the document.

People have tried to counter these three issues in many different ways in the past:

  • In ancient Rome, the last will and testament of wealthy people were stored in a temple, guarded by virgins, known as the Vestals. The Vestal Virgins made sure that no unauthorized third party had access to those documents, for instance in an attempt to change them.

  • In more recent times, we also know the concept of notarizing documents, where a notary witnesses the act of signing in person. This notary is a person of trust, someone with legal training who is licensed by the government.

  • ...

But let's not make this a history book, let's take a look at some technical concepts to achieve the same or –preferably– a better result.

Core concepts

When we described the requirements we want to meet, we always used physical presentations of a document: a paper invoice, a parchment deed, and even a large tile of concrete. The core concepts we are going to discuss next, assume that all the documents are available in digital form. There will be references to the physical world in chapter 3 when we discuss HSM's, USB tokens, and smart cards, and in chapter 6 when we talk about registering tangible objects such as a driver's license. But from now on, we'll assume that the data, the information, the document content that we want to protect, can be represented in a series of bits and bytes. Starting with the next chapter, we'll focus on using the Portable Document Format (PDF) to represent that data, information, or content.

Cryptographic hash functions

We'll start with an example in which only a small piece of information is involved: a password.

Use case 1: hashing passwords

Many online services allow you to log in using credentials that consist of your user name –which can be public– and your password –which is private. Ideally, you should be the only person who knows your password, but if that's the case, how can the online service verify your credentials? Storing usernames and the corresponding passwords in a secure database isn't an option, as there's no such thing as a secure database. Databases with user information get hacked, and when that happens, passwords stored in clear text are exposed.

One way to avoid passwords to be exposed, is not to store the actual password, but a message digest of the password. Such a message digest can be created with a cryptographic hash function. In the PasswordMD example, we have two member-variables, digest and md. These member-variables are instantiated in the constructor:

protected byte[] digest;
protected MessageDigest md;
 
protected PasswordMD(String password, String algorithm, String provider)
    throws GeneralSecurityException {
    if (provider == null)
        md = MessageDigest.getInstance(algorithm);
    else
        md = MessageDigest.getInstance(algorithm, provider);
    digest = md.digest(password.getBytes());
}

The MessageDigest class is a Java class in the java.security package. It uses a cryptographic hash function to turn an arbitrary block of data into a fixed-size bit string.

The block of data is often called the message, and the resulting value after executing the function is referred to as the message digest or hash. The result of the function is always identical, provided the message hasn’t been altered. Any accidental or intentional change to the data will result in a different hash value. It’s also a one-way algorithm: it's impossible to derive the original message from a message digest.

In the next couple of methods, we ask the PasswordMD class for the length of the digest in bytes, and for a hexadecimal representation of the message digest. We'll use these methods to retrieve some information about the hash of a password.

public int getDigestSize() {
    return digest.length;
}
public String getDigestAsHexString() {
    return new BigInteger(1, digest).toString(16);
}

In our example, the block of data is a password. We convert it to a hash that can be stored in a database. It's impossible to derive the password from its corresponding hash, but the next time we execute the cryptographic hash function on the password, the result will be the same hash as stored in the database.

In this case, we have created a situation where only one person –you– knows your password. Your password isn't stored anywhere else. Verifying your credentials can be done by creating a hash of your password, and comparing that hash with the hash stored in the database. If there's a match, the verification succeeds.

This is done in the checkPassword() method:

public boolean checkPassword(String password) {
    return MessageDigest.isEqual(digest, md.digest(password.getBytes()));
}

There were two parameters we haven't discussed before: the algorithm and the provider parameter.

The algorithm parameter is used to select a specific cryptographic hash function, for instance: "SHA-256":

showTestDefault("SHA-256");

In the showTestDefault() method, we don't pass a provider value yet; instead we create a new PasswordMD instance with the provider set to null.

  1. public static void showTestDefault(String algorithm) {
  2. try {
  3. PasswordMD app = new PasswordMD("password", algorithm, null);
  4. System.out.println("Digest using " + algorithm + ": "
  5. + app.getDigestSize());
  6. System.out.println("Digest: " + app.getDigestAsHexString());
  7. System.out.println("Is the password 'password'? "
  8. + app.checkPassword("password"));
  9. System.out.println("Is the password 'secret'? "
  10. + app.checkPassword("secret"));
  11. } catch (NoSuchAlgorithmException e) {
  12. System.out.println(e.getMessage());
  13. } catch (GeneralSecurityException e) {
  14. System.out.println(e.getMessage());
  15. }
  16. }

This is what happens, when we execute the showTestDefault() method:

  • First, we create a PasswordMD instance for the password "password", using a specific algorithm algorithm, but no provider (line 3).

  • Then, we write the size and the hexadecimal representation of the message digest of the password to the System.out (line 4-6).

  • Finally, we check the password: is the password "password"? (line 7-8) Or is the password "secret"?

The output of this method will look like this:

Digest using SHA-256: 32
Digest: 5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8
Is the password 'password'? true
Is the password 'secret'? false

If you run this method with different values for the password, you will see that the size of the Digest is always 32 bytes long. You'll also see that the hexadecimal value that is shown in the output changes in a very random way if you make small changes to the password.

By default, the standard Java SDK supports the following values for the hashing algorithm: "MD2", "MD5", "SHA" (used as a synonym for "SHA-1"), "SHA-224", "SHA-256", "SHA-384", and "SHA-512". This is a very limited set, and some of these algorithms (such as MD5 and SHA-1) are seriously flawed.

You may want to extend the list of supported algorithms by introducing another security provider.

Using a different security provider

BouncyCastle is an example of such a provider, and this is how we add BouncyCastle as a security provider:

Security.addProvider(new BouncyCastleProvider());

The showTestBC() method is identical to the showTestDefault() method, except for this line:

PasswordMD app = new PasswordMD("password", algorithm, "BC");

We now tell the PasswordMD constructor to use BouncyCastle as security provider. We can now execute this method using "SHA-256" as parameter for the algorithm:

showTestBC("SHA-256");

The output of this method will be identical to what we had before:

Digest using SHA-256: 32
Digest: 5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8
Is the password 'password'? true
Is the password 'secret'? false

If we had used "RIPEMD160" as parameter for the showTestDefault() method, we would have had the following output:

RIPEMD160 MessageDigest not available

With the showTestBC() method, the output looks like this:

Digest using RIPEMD160: 20
Digest: 2c08e8f5884750a7b99f6f2f342fc638db25ff31
Is the password 'password'? true
Is the password 'secret'? false

These small examples were meant as simple illustrations of the concept of hashing.

If a hacker obtains a database containing hashes of passwords, he can still use brute force to recover passwords. Nowadays, there are other ways to provide secure access to a web site. In cases where this concept is still used for password protection, at least salting the hash and applying multiple iterations is necessary for security, but that’s outside the scope of this paper.

What matters is what we have learned from these examples: we can use the concept of hashing to check the integrity of a message. Performing a cryptographic hash function on a message, should always result in the same message digest. If not, the message was changed.

Use case 2: checking the integrity of a document

Suppose that you were a student at University A, and you want to enroll in University B. It's not unusual for University B to ask you to share some information about the grades you obtained at University A. In the past, you'd show University B your transcript of records or your diploma, and based on those documents you'd get admitted, or not.

Once in a while, a dishonest student would present a forged credit of records or a forged diploma to get admitted.

  • One way to avoid this, would be for University B to contact University A to check if the documents that were presented are genuine. This happens in some cases, but it's quite a cumbersome process, especially if the Universities are in different countries or even on different continents.

  • Another way would be for University A to put all those documents online, so that they can be consulted by University B. If that ever happened, it would be a serious infringement of privacy of the students. You can imagine that not every student wants his or her grades to be publicly available for everyone to see.

A better solution would be for University to put a hash of the documents online. That way no one could retrieve which grades a student obtained, nor any other personal information about the student for that matter. That hash is only useful for whoever has access to a digital copy of the document. It can be used to verify the integrity of the document. This is shown in figure 1.5.

Figure 1.5: Checking a document on a secure server
Figure 1.5: Checking a document on a secure server

University B has access to a document with a specific ID. University A provides a secure server from which a hash can be obtained for a document with this ID. University B can create a hash from the document in its possession, retrieve the hash of the corresponding server, and compare both hashes. If these hashes are identical, University B knows that the document that was provided by a student exists at University A, and that the document hasn't been altered.

Obviously, this requires that University B uses the same cryptographic hash function as was used by University A. Let's take a look at the different families of cryptographic hash functions.

Cryptographic Hash function overview

There are three major families of cryptographic hash functions:

  • The Message Digest (MD) algorithm: this is a series of message digest algorithms designed by Professor Ron Rivest of MIT of which MD5 is the most widely known one. It was designed in 1991. The MD5 algorithm allows you to create a 128-bit (16-byte) digest of a message. MD5 is still widely used, but it’s no longer considered secure.

  • The Secure Hash Algorithm (SHA): this is a family of cryptographic hash functions published by the National Institute of Standards and Technology (NIST) as a U.S. Federal Information Processing Standard (FIPS):

    • SHA-1 was designed by the US National Security Agency (NSA) and released in 1995. The 160-bit (20 bytes) hash was considered safer than MD5, but nowadays, it's also no longer considered secure.

    • SHA-2 is a set of hash functions first released in 2001: SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256. The length of the message digest in SHA-2 can be 224 bits (28 bytes), 256 bits (32 bytes), 384 bits (48 bytes), or 512 bits (64 bytes).

    • SHA-3 is the latest member of the Secure Hash Algorithm family of standards, released in 2015. It's a subset of the broader cryptographic primitive family Keccak. It includes SHA3-224, SHA3-256, SHA3-384, SHA3-512, SHAKE128 and SHAKE256.

  • The RACE Integrity Primitives Evaluation Message Digest (RIPEMD): This algorithm was developed at the Catholic University of Leuven, and first released in 1996. There's a 128, 160, 256, and 320 bit version (16, 20, 32, and 40 bytes).

If you dig deeper into the matter, you'll discover that there are some other algorithms, but these are the most important ones.

When looking at the list, you see that some of the older algorithms are no longer considered secure, and that new algorithms emerge, replacing the older ones. What's the problem with the old algorithms?

Collision resistance

When you have a function that converts an arbitrary block of data into a fixed-size bit string, it's always possible for two different messages to result in the same message digest.

Let's take MD5 as an example: there are 2128 possible hashes for an infinite possible number of messages.

2128 equals 340,282,366,920,938,463,463,374,607,431,768,211,456.

That's 340 trillion, 282 billion, 366 million, 920 thousand, 938, followed by 24 zeroes, or in more correct –but less understandable– words: three hundred and forty undecillion, two hundred and eighty-two decillion, three hundred and sixty-six nonillion, nine hundred and twenty octillion, nine hundred and thirty-eight septillion, four hundred and sixty-three sextillion, four hundred and sixty-three quintillion, three hundred and seventy-four quadrillion, six hundred and seven trillion, four hundred and thirty-one billion, seven hundred and sixty-eight million, two hundred and eleven thousand, four hundred and fifty-six.

Although this is a huge number, there is still a chance that two different blocks of data (documents, passwords,...) result in the same digest. This is known as hash collision. The quality of a hash function is determined by how easy it is to create a collision. An algorithm is vulnerable if you can create a different document with the same hash value in a way that is faster than using brute force.

For instance, suppose that you have a digital transcript of records from University A with plenty of C grades, and that University A has stored a hash of that document on a server using an algorithm that isn't collision resistant. In that case, it would be feasible for you to change the document where the C's are changed into A's without resorting to brute force.

This is the case with algorithms such as MD5 and SHA-1. They should no longer be used.