Tuesday, January 13, 2015

Allow external URLs in my site?

If you are running a website that allows end users to create content, almost any interactive site nowadays, you have to be very careful in how you allow user inputed URLs in it.

For example if you are Facebook you allow users to paste URLs in their wall.
You might even allow URLs, to perform a Javascript or meta tag redirection.

Some developers might opt to use an HTML encoder, breaking many URLs.

More user conscious developers might opt to create very thorough regular expressions allowing most common characters, but that may prevent legit users from inputting real URLs.

So I decided to try to solve this problem, I googled a bit, but couldn't find the answer.
I found some discussions in Stackoverflow, and even some people asked how Stackoverflow actually did it, but no reference to a library or code.

In the past I've used OWASP ESAPI, and I usually recommended it in my classes.
Theres an encoder that I have never used, but thought the name, "encodeForUrl" was self descriptive.

Even the description sounds promising:

"Encode for use in a URL. This method performs URL encoding on the entire string."

So I tried it.

What I got after this sample  URL:

"http://www.google.com/dir/?test=test&test2=test<script>alert(1);<ScRiPt>"

Was:


http%3A%2F%2Fwww.google.com%2Fdir%2F%3Ftest%3Dtest%26test2%3Dtest%3Cscript%3Ealert%281%29%3B%3CScRiPt%3E

of course the link if used in HTML such as

<a href="http%3A%2F%2Fwww.google.com%2Fdir%2F%3Ftest%3Dtest%26test2%3Dtest%3Cscript%3Ealert%281%29%3B%3CScRiPt%3E">

wont work.

If you look to the source code of the method it's just a call to the URLEncoder.encode method.
It's not a bug, I just misunderstood how that library works.

I could have continued googling or looking at the source code of maybe an open source forum such as phpbb, but I decided to take matters into own hands.

I decided to first make a validation of the URL, being harsh with how it's composed till the domain name finishes. I might left out some legit URLs (not even considered IDN).


if it passes this regex validation

^(https?:\\/\\/)([a-zA-Z0-9-_\\.]+)(:[0-9]{1,5})?((\\?|\\/)(.*))?$

It then URL encodes everything after the domains finishes and decodes certain special characters such as / ? = + & . ,

This double work might not be efficient, but I would rather blacklist everything and then make a whitelist of what I allow, than the other way around, where I cant forget to blacklist some dangerous character.

The code can be found here or below.
You are encouraged to use it and change it, but I'm not responsible for it.

This code assumes that the input is Canonicalized first, so no URL encoded input.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
 
 
public class encodeUrl {
 
    public static void main(String[] args) {
 
  String testUrl="http://www.google.com/dir/?test=test&test2=test+1,1.<script>alert(1);<ScRiPt>";
  
  if (args.length>0){
   testUrl=args[0];
  }
  
  
  System.out.println("in:  " + testUrl);
 
  String pattern = "^(https?:\\/\\/)([a-zA-Z0-9-_\\.]+)(:[0-9]{1,5})?((\\?|\\/)(.*))?$";
 
  Pattern r = Pattern.compile(pattern);
  Matcher m = r.matcher(testUrl);
 
 
  if (m.matches()){
   String out="";   
   out += (m.group(1)!=null) ? m.group(1): "";
   out += (m.group(2)!=null) ? m.group(2): "";
   out += (m.group(3)!=null) ? m.group(3): "";
   out += (m.group(5)!=null) ? m.group(5): "";   
   
   try {    
    out+=m.replaceFirst(URLEncoder.encode(m.group(6), "UTF-8")).replace("%2F","/").replace("%3F","?").replace("%23","#").replace("%3D","=").replace("%26","&").replace("%2B","+").replace("%2C",",").replace("%2E",".");
    System.out.println("out: " + out);
    System.out.println("just encode: " + URLEncoder.encode(testUrl, "UTF-8"));
   }
   catch (UnsupportedEncodingException e){
    System.err.println(e);
   }
 
  } 
  else {
   System.out.println("out: not a valid url");
  }
    }
}


After writing this post, I started thinking if all of this is needed, why don't I escape the quote and double quote characters, and make sure that HTML attributes are enclosed with them.
At the beginning that seemed to solve the problem in a much easier way.

But I thought of using the newline character %0A in the input (Firefox in OSx doesn't need the %0D for this attack to work)
Some thing like this:
"http://www.victim.com?param=%0A</script><script>alert(1);<script>"

In the response I got something like this:


<a href="http://www.victim.com?param=
</script><script>alert(1);<script>" >link</a>

(the browser does the newline, thats why it's 2 lines here)

Which triggered the "alert(1)".

I didn't test my encodeForUrl with the canonicalized version of
"http://www.victim.com?param=%0A</script><script>alert(1);<script>"
but it shoud look something like this

http://www.victim.com?param=%0A%3C/script%3E%3Cscript%3Ealert%281%29%3B%3Cscript%3E

which is harmless.
(there is no newline as before, it looks like that because of the Post width.)



No comments:

Post a Comment