I have used Beautiful Soup with Python in the past for screen scraping. I was immediately excited at the possibilities. JSoup is a Java API for extracting data, and manipulating the DOM in HTML.
I did a quick proof of concept just to see what it would do with my "dirty" code.
jsoupimplements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
It results in an interesting output that could be useful if used properly. If you put in garbage, you will get "less" garbage out. It is better than nothing.
I decided that this still could be really useful especially combined with Hibernate Validators and JSF.
Hibernate Validator - @SafeHtmlI was looking at the Hibernate Validators to see about cleaning up some input from users to prevent XSS issues. I noticed that there was a validator called
@SafeHtml(whitelistType=, additionalTags=, additionalTagsWithAttributes=). It uses the JSoup HTML parser.
Alas, I am full of sorrow. I can not seem to get the <code>@SafeHtml</code> annotation to work. GlassFish vomits and complains it can not find it. I even tried to add it to every
libdirectory in GlassFish without success. Failing to succeed, I tried Tomcat 8 next. Again, nothing but bitterness and disappointment. It just will not get picked up.
I tried looking for a working example of the validator, and didn't find any that worked. I am not sure of the what is going on, but if I can't figure it out. I imagine I am not alone. I just blog about it. ;-)
UndeterredWell I decided that I didn't need Hibernate anyway! I feel like I should be in Aesop's Fables. I mentioned my Proof of Concept (POC) earlier. I figured I would look at trying to remove some
<script />tags from my code and even encoded them too to see what it would do. The whole point here is to help prevent XSS.
Here is my Apache Maven project on BitBucket: jsoup-cleaner
Note: See the actual code for a more complete representation of the actual code I am trying to strip. The Syntaxhighlighter is having issues with the nested script tags. The same applies to the output.
I was surprised by the result actually. It stripped out the
<script />tags, but totally missed the encoded tags. That is a major issue.
This was not exactly what I needed, but it did contain a method which used JSoup and another framework called ESAPI. Enterprise Security API (ESAPI) was developed by OWASP to enhance the security of Enterprise applications. OWASP has a lot more than this framework. ESAPI can strip out the encoded bits to help prevent XSS.
I shamelessly used the following method from the blog post.
This does effectively remove any encoded
I created an example project called XSS Scripter's Delight which I demonstrated at the Greenville Java Users Group. It demonstrates what happens when you don't validate inputs from users. The name is satirical, but does demonstrate in a non-malicious way what you can do if you are not careful.
The Apache Maven project developed with NetBeans can be found on Bitbucket here: xss-scripters-delight.