December 10, 2015

URL Encoding

Building for the web (in any language) inevitably forces you to work with URIs or a compact sequence of characters that identifies an abstract or physical resource. Or maybe each web interfacing developer works with URLs? What’s the difference?

URIs identify and URLs locate; however, locators are also identifiers, so every URL is also a URI, but there are URIs which are not URLs.

Roger Pate

Wow that was confusing! But really this blog is about URL encoding, not the differences between URIs and URLs or even URNs. Lets get to the meat of things, how do we build a URL that is safe across different browsers for use for HTTP.

The end goal is to implement a way to transform an expected URL (which could have various paths, weird domain names, and even endless query strings) that might have unsafe or reserved characters in them. Of course the goal is to do this without just simply removing the unsafe or reserved characters, we must encode them.

We need to encode them because URLs have a very well defined specification (way back in 1994) which is often not followed by the modern way we use the web and URLs. The specification has of course been updated and updated to accommodate changes in the technology but consumer and business requirements dictate us developers to encode unsafe and reserved characters.

Classification	Sample Characters	Encoding Required?
Safe Characters	Alphanumeric characters `[0-9a-zA-Z]`, special characters `$-_.+!*'()`, and reserved characters used for their reserved purposes	No
ASCII Control Characters	Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal)	Yes
Non-ASCII Characters	Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal)	Yes
Reserved Characters	`$` `&` `+` `,` `/` `:` `;` `=` `?` `@` not used for their reserved purposes	Yes
Unsafe Characters	`"` `<` `>` `#` `%` `{` `}` `\` `^` `~` `[` `]` `\|` ` (including a space)	Yes

How do we encode? #

Special, reserved (if not used for their reserved purposes), and unsafe characters shouldn’t be used without encoding them in URLs as far as HTTP and RFC3986 are concerned. So if we want to use these characters we must encode.

Encoding the URL is not a simply a search and replace for each character to the encoded character. In order to escape those that fall outside of the reserved set, we have to know which reserved set is active for each part we want to encode. So it is impossible to have an algorithm that takes a URL string and spits out an encoded URL string without knowing about its specific parts.

@gpellen URL-encoding the query string parameter values is safest in terms of compatibility.
— Mathias Bynens (@mathias) October 28, 2014

A URL cannot be analyzed after decoding #

After a URL gets decoded, it cannot be guaranteed to retain its syntactic meaning. Reserved or unsafe characters may appear which may construe the meaning of the URL.

Example: http://chrisrng.svbtle.com/Harry%2FPotter%3Fchosen+one

Part	Value
Scheme	http
Host	chrisrng.svbtle.com
Path segment	Harry%2FPotter%3Fchosen+one
Decoded Path segment	Harry/Potter?chosen+one

If we decode it without having any idea about the syntactic meaning of it, we have this url:

Decoded URL: http://chrisrng.svbtle.com/Harry/Potter?chosen+one

Part	Value
Scheme	http
Host	chrisrng.svbtle.com
Path segment	Harry
Path segment	Potter
Query parameter name	chosen one

The decoded URL analysis is clearly wrong, we must check the path and reserved characters before we decode the URL. Rewrite rules must take care to never decode a URL before attempting to match it iff reserved characters are allowed to be URL-encoded (which may or may not be the case depending on you application).

Decoded URLs cannot be reencoded to the same form #

From the same example above, the URL http://chrisrng.svbtle.com/Harry%2FPotter%3Fchosen+one when decoded becomes http://chrisrng.svbtle.com/Harry/Potter?chosen+one. However the decoded URL http://chrisrng.svbtle.com/Harry/Potter?chosen+one cannot be reencoded to the same original URL since it is a valid URL. It’s just very different from the original URL that we have.

Handling URLs in JavaScript #

We can use regex to try to parse a valid URL to then parse for their parts, but it’s really hard to do.

Do not encode the whole URL String using encodeURIComponent #

Encoding the whole URL String using encodeURIComponent results in an incorrect result.

encodeURIComponent('http://www.chrisrng.svbtle.com/Harry/Potter?Chosen+One')

“http%3A%2F%2Fwww.chrisrng.svbtle.com%2FHarry%2FPotter%3FChosen%2BOne”

Whereas encodeURI('http://www.chrisrng.svbtle.com/Harry/Potter?Chosen+One') gives us

“http://www.chrisrng.svbtle.com/Harry/Potter?Chosen+One”

Do not construct URLs without encoding each part #

This is an important thing to remember when creating query strings. When we encode a query string ?wand=elder&scar=true we obtain ?wand=elder%26scar=true which is not proper.

Do not use escape() #

This has been deprecated since ECMAScript v3.

Use encodeURI() #

Use encodeURI when you want a working URL.

Use encodeURIComponent() #

Use encodeURIComponent when you want to encode a URL parameter.

Differences between encodeURI() and encodeURIComponent() #

The differences are only 11 characters.

Ran using the code:

let arr = [];
for(let i = 0; i < 256; i++) {
  const char = String.fromCharCode(i);
  if (encodeURI(char) !== encodeURIComponent(char)) {
    arr.push({
      character: char,
      encodeURI: encodeURI(char),
      encodeURIComponent: encodeURIComponent(char)
    });
  }
}
console.table(arr);

How to create a Query String #

We start with an object that represents the query string being a KV store. Then we flatten it out and encode the query string before it is complete.

const queryStringData = {
  'wand': 'Elder',
  'scar': 'lightning'
}

const queryStringKeys = Object.keys(queryStringData);

const queryStringResult = queryStringKeys.reduce(
  (queryStringArray, currKey) => {
    const encodedKey = encodeURIComponent(currKey);
    const encodedValue = encodeURIComponent(queryStringData[currKey]);
    queryStringArray.push(`${encodedKey}=${encodedValue}`);
    return queryStringArray;
  }, []).join('&');

console.log(`Query String is: ${queryStringResult}`);

// Query String is: wand=Elder&scar=lightning

Kudos