Introduction
I have used Beautiful Soup with Python in the past for screen scraping. I was immediately excited at the possibilities.
JSoup is a Java API for extracting data, and manipulating the DOM in HTML.
jsoup
implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
I did a quick proof of concept just to see what it would do with my "dirty" code.
It results in an interesting output that could be useful if used properly. If you put in garbage, you will get "less" garbage out. It is better than nothing.
I decided that this still could be really useful especially combined with Hibernate Validators and JSF.
Hibernate Validator - @SafeHtml
I was looking at the
Hibernate Validators to see about cleaning up some input from users to prevent XSS issues. I noticed that there was a validator called
@SafeHtml(whitelistType=, additionalTags=, additionalTagsWithAttributes=)
. It uses the
JSoup HTML parser.
Alas, I am full of sorrow. I can not seem to get the <code>@SafeHtml</code> annotation to work. GlassFish vomits and complains it can not find it. I even tried to add it to every
lib
directory in GlassFish without success. Failing to succeed, I tried Tomcat 8 next. Again, nothing but bitterness and disappointment. It just will not get picked up.
I tried looking for a working example of the validator, and didn't find any that worked. I am not sure of the what is going on, but if I can't figure it out. I imagine I am not alone. I just blog about it. ;-)
Undeterred
Well I decided that I didn't need Hibernate anyway! I feel like I should be in Aesop's Fables. I mentioned my Proof of Concept (POC) earlier. I figured I would look at trying to remove some
<script />
tags from my code and even encoded them too to see what it would do. The whole point here is to help prevent XSS.
Here is my
Apache Maven project on
BitBucket:
jsoup-cleaner
Note: See the actual code for a more complete representation of the actual code I am trying to strip. The Syntaxhighlighter is having issues with the nested script tags. The same applies to the output.
I was surprised by the result actually. It stripped out the
<script />
tags, but totally missed the encoded tags. That is a major issue.
Improvements
I was looking for some solutions for the encoded JavaScript issue when I discovered a blog post called
Jersey Cross-Site Scripting XSS Filter for Java Web Apps.
This was not exactly what I needed, but it did contain a method which used
JSoup and another framework called
ESAPI.
Enterprise Security API (ESAPI) was developed by OWASP to enhance the security of Enterprise applications. OWASP has a lot more than this framework.
ESAPI can strip out the encoded bits to help prevent XSS.
I shamelessly used the following method from the blog post.
This does effectively remove any encoded
<script />
tags from the output. It does not however prevent errors in judgement on the part of the developer. For example taking the results of the output and using them directly in an HTML JavaScript attribute like
onmouseover
, or
onclick
.
I created an example project called XSS Scripter's Delight which I demonstrated at the Greenville Java Users Group. It demonstrates what happens when you don't validate inputs from users. The name is satirical, but does demonstrate in a non-malicious way what you can do if you are not careful.
The
Apache Maven project developed with
NetBeans can be found on
Bitbucket here:
xss-scripters-delight.