URL Encoding
Building for the web (in any language) inevitably forces you to work with URIs or a compact sequence of characters that identifies an abstract or physical resource. Or maybe each web interfacing developer works with URLs? What’s the difference?
URIs identify and URLs locate; however, locators are also identifiers, so every URL is also a URI, but there are URIs which are not URLs.
Wow that was confusing! But really this blog is about URL encoding, not the differences between URIs and URLs or even URNs. Lets get to the meat of things, how do we build a URL that is safe across different browsers for use for HTTP.
The end goal is to implement a way to transform an expected URL (which could have various paths, weird domain names, and even endless query strings) that might have unsafe or reserved characters in them. Of course the goal is to do this without just simply removing the unsafe or reserved characters, we must encode them.
We need to encode them because URLs have a very well defined specification (way back in 1994) which is often not followed by the modern way we use the web and URLs. The specification has of course been updated and updated to accommodate changes in the technology but consumer and business requirements dictate us developers to encode unsafe and reserved characters.
Classification | Sample Characters | Encoding Required? |
---|---|---|
Safe Characters | Alphanumeric characters [0-9a-zA-Z] , special characters $-_.+!*'() , and reserved characters used for their reserved purposes |
No |
ASCII Control Characters | Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal) | Yes |
Non-ASCII Characters | Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal) | Yes |
Reserved Characters | $ & + , / : ; = ? @ not used for their reserved purposes |
Yes |
Unsafe Characters | " < > # % { } \ ^ ~ [ ] | ` (including a space) |
Yes |
How do we encode? #
Special, reserved (if not used for their reserved purposes), and unsafe characters shouldn’t be used without encoding them in URLs as far as HTTP and RFC3986 are concerned. So if we want to use these characters we must encode.
Encoding the URL is not a simply a search and replace for each character to the encoded character. In order to escape those that fall outside of the reserved set, we have to know which reserved set is active for each part we want to encode. So it is impossible to have an algorithm that takes a URL string and spits out an encoded URL string without knowing about its specific parts.
@gpellen URL-encoding the query string parameter values is safest in terms of compatibility.
— Mathias Bynens (@mathias) October 28, 2014
A URL cannot be analyzed after decoding #
After a URL gets decoded, it cannot be guaranteed to retain its syntactic meaning. Reserved or unsafe characters may appear which may construe the meaning of the URL.
Example: http://chrisrng.svbtle.com/Harry%2FPotter%3Fchosen+one
Part | Value |
---|---|
Scheme | http |
Host | chrisrng.svbtle.com |
Path segment | Harry%2FPotter%3Fchosen+one |
Decoded Path segment | Harry/Potter?chosen+one |
If we decode it without having any idea about the syntactic meaning of it, we have this url:
Decoded URL: http://chrisrng.svbtle.com/Harry/Potter?chosen+one
Part | Value |
---|---|
Scheme | http |
Host | chrisrng.svbtle.com |
Path segment | Harry |
Path segment | Potter |
Query parameter name | chosen one |
The decoded URL analysis is clearly wrong, we must check the path and reserved characters before we decode the URL. Rewrite rules must take care to never decode a URL before attempting to match it iff reserved characters are allowed to be URL-encoded (which may or may not be the case depending on you application).
Decoded URLs cannot be reencoded to the same form #
From the same example above, the URL http://chrisrng.svbtle.com/Harry%2FPotter%3Fchosen+one when decoded becomes http://chrisrng.svbtle.com/Harry/Potter?chosen+one. However the decoded URL http://chrisrng.svbtle.com/Harry/Potter?chosen+one cannot be reencoded to the same original URL since it is a valid URL. It’s just very different from the original URL that we have.
Handling URLs in JavaScript #
We can use regex to try to parse a valid URL to then parse for their parts, but it’s really hard to do.
Do not encode the whole URL String using encodeURIComponent #
Encoding the whole URL String using encodeURIComponent
results in an incorrect result.
encodeURIComponent('http://www.chrisrng.svbtle.com/Harry/Potter?Chosen+One')
“http%3A%2F%2Fwww.chrisrng.svbtle.com%2FHarry%2FPotter%3FChosen%2BOne”
Whereas encodeURI('http://www.chrisrng.svbtle.com/Harry/Potter?Chosen+One')
gives us
“http://www.chrisrng.svbtle.com/Harry/Potter?Chosen+One”
Do not construct URLs without encoding each part #
This is an important thing to remember when creating query strings. When we encode a query string ?wand=elder&scar=true
we obtain ?wand=elder%26scar=true
which is not proper.
Do not use escape() #
This has been deprecated since ECMAScript v3.
Use encodeURI() #
Use encodeURI when you want a working URL.
Use encodeURIComponent() #
Use encodeURIComponent when you want to encode a URL parameter.
Differences between encodeURI() and encodeURIComponent() #
The differences are only 11 characters.
Ran using the code:
let arr = [];
for(let i = 0; i < 256; i++) {
const char = String.fromCharCode(i);
if (encodeURI(char) !== encodeURIComponent(char)) {
arr.push({
character: char,
encodeURI: encodeURI(char),
encodeURIComponent: encodeURIComponent(char)
});
}
}
console.table(arr);
How to create a Query String #
We start with an object that represents the query string being a KV store. Then we flatten it out and encode the query string before it is complete.
const queryStringData = {
'wand': 'Elder',
'scar': 'lightning'
}
const queryStringKeys = Object.keys(queryStringData);
const queryStringResult = queryStringKeys.reduce(
(queryStringArray, currKey) => {
const encodedKey = encodeURIComponent(currKey);
const encodedValue = encodeURIComponent(queryStringData[currKey]);
queryStringArray.push(`${encodedKey}=${encodedValue}`);
return queryStringArray;
}, []).join('&');
console.log(`Query String is: ${queryStringResult}`);
// Query String is: wand=Elder&scar=lightning