Once of the most important things when developing a website is making sure that it is easy for people to find the information they need. Site maps and site searches are probably the most commonly implemented functionalities for making a sites content easily accessible. Whenever I build a site that is more than just a few pages, I usually create a site map that dynamically generates links to every page on the site. Then I use the script below which reads the sitemap and then crawls the whole site and indexes the content into a verity collection to power my search functionality.


<!--- Create a function to remove HTML from a string --->
function RemoveHTML(source){
   // Remove HTML Development formatting
   // Replace line breaks with space
   var result = Replace(source,chr(13), " ","ALL");
   // Remove repeating spaces becuase browsers ignore them
   result = ReReplace(result, "( )+", " ","ALL");
   // Remove the header (prepare first by clearing attributes)
   result = ReReplace(result, "<( )*head([^>])*>","<head>", "ALL");
   result = ReReplace(result, "(<( )*(/)( )*head( )*>)","</head>", "ALL");
   result = ReReplace(result, "(<head>).*(</head>)","", "ALL");
   // remove all scripts (prepare first by clearing attributes)
   result = ReReplace(result, "<( )*script([^>])*>","<script>", "ALL");
   result = ReReplace(result, "(<( )*(/)( )*script( )*>)","</script>", "ALL");
   result = ReReplace(result, "(<script>).*(</script>)","", "ALL");
   // remove all styles (prepare first by clearing attributes)
   result = ReReplace(result, "<( )*style([^>])*>","<style>", "ALL");
   result = ReReplace(result, "(<( )*(/)( )*style( )*>)","</style>", "ALL");
   result = ReReplace(result, "(<style>).*(</style>)","", "ALL");
   // insert tabs in spaces of <td> tags
   result = ReReplace(result, "<( )*td([^>])*>","   ", "ALL");
   // insert line breaks in places of <BR> and <LI> tags
   result = ReReplace(result, "<( )*br( )*>",chr(13), "ALL");
   result = ReReplace(result, "<( )*li( )*>",chr(13), "ALL");
   // insert line paragraphs (double line breaks) in place
   // if <P>, <DIV> and <TR> tags
   result = ReReplace(result, "<( )*div([^>])*>",chr(13), "ALL");
   result = ReReplace(result, "<( )*tr([^>])*>",chr(13), "ALL");
   result = ReReplace(result, "<( )*p([^>])*>",chr(13), "ALL");
   // Remove remaining tags like <a>, links, images,
   // comments etc - anything thats enclosed inside < >
   result = ReReplace(result, "<[^>]*>","", "ALL");
   // replace special characters:
   result = ReReplace(result, "&nbsp;"," ", "ALL");
   result = ReReplace(result, "&bull;"," * ", "ALL");
   result = ReReplace(result, "&lsaquo;","<", "ALL");
   result = ReReplace(result, "&rsaquo;",">", "ALL");
   result = ReReplace(result, "&trade;","(tm)", "ALL");
   result = ReReplace(result, "&frasl;","/", "ALL");
   result = ReReplace(result, "&lt;","<", "ALL");
   result = ReReplace(result, "&gt;",">", "ALL");
   result = ReReplace(result, "&copy;","(c)", "ALL");
   result = ReReplace(result, "&reg;","(r)", "ALL");
   // Remove all others. More special character conversions
   // can be added above if needed
   result = ReReplace(result, "&(.{2,6});", "", "ALL");
   // Thats it.
   return result;


<!--- Create a function to Find URLs in a string --->
<cffunction name="FindURLs" output="true" returntype="array">

<cfargument name="text" type="string" required="yes">
<!--- Define local variables --->
<cfset var results=ArrayNew(1)>
<cfset var pos=1>
<cfset var subex="">
<cfset var done=false>

<cfloop condition="not done">

<!--- Perform search --->
<cfset subex=reFind("href=""http://(.*?)""", arguments.text, pos, true)>
<!--- Anything matched? --->
<cfif subex.len[1] is 0>
<cfset done=true>
<!--- Got one, add to array --->
       <cfif not listfind(arraytolist(results),mid(text,subex.pos[1]+6,subex.len[1]-7))>
       <cfset arrayappend(results,mid(text,subex.pos[1]+6,subex.len[1]-7))>
<!--- Reposition start point --->
<cfset pos=subex.pos[1]+subex.len[1]>

<!--- and return results --->
<cfreturn results>


<!--- Get the sitemap source code from my site --->
<cfhttp url="http://www.mywebsite.com/sitemap.cfm" method="GET"></cfhttp>

<!--- create and array of all the urls in the site mape --->
<cfset URLArray = FindURLs(cfhttp.FileContent)>

<!--- create a query to hold the data I want to put into verity --->
<cfset SearchData = querynew("title,key,body,custom1,custom2,URLpath")>

<!--- Loop through the URLS --->
<cfloop from="1" to="#arraylen(URLArray)#" index="i">

<!--- Don't index the login page --->
<cfif not URLArray[i] contains "checkLogin.cfm">
   <!--- get the HTML source via HTTP --->
   <cfif URLArray[i] contains "?">
      <cfhttp url="#URLArray[i]#&search=Y" method="GET"></cfhttp>
      <cfhttp url="#URLArray[i]#?search=Y" method="GET"></cfhttp>
   <!--- Get Title --->
   <cfset startpos = find("<title>",cfhttp.filecontent,1)>
   <cfset endpos = find("</title>",cfhttp.filecontent,startpos)>
   <cfset tmpTitle = mid(cfhttp.filecontent,startpos+7,endpos-startpos-7)>
   <!--- add the data I need to the query --->
   <cfset queryaddrow(SearchData)>
   <cfset querysetcell(SearchData, "title", "#tmpTitle#")>
   <cfset querysetcell(SearchData, "key", "#URLArray[i]#")>
   <cfset querysetcell(SearchData, "body", "#RemoveHTML(cfhttp.filecontent)#")>
   <cfset querysetcell(SearchData, "custom1", "")>
   <cfset querysetcell(SearchData, "custom2", "")>
   <cfset querysetcell(SearchData, "URLpath", "#URLArray[i]#")>
   <!--- dump any errors --->
   <cfcatch type="Any">
   <cfdump var="#cfcatch#">
<!--- Lock the collection to prevent searching While the collection is updated --->
<cflock name="MyVerityLock" type="EXCLUSIVE" timeout="5">
      <cfindex action="PURGE" collection="MyCollection">
      <cfindex action="UPDATE" collection="MyCollection" query="SearchData" type="CUSTOM" title="title" body="body" key="key">
   <cfcatch type="Any">
   Indexing Error

You will see in the code that as the script is crawling each page of the site, it adds "search=Y" to the URLs query string. I set up my sites so that if URL.Search equals "Y", the pages do not display the sites header, footer, or side navigation. This way my verity index only contains the content in the body of the page. By doing this, the verity searches return more accurate results. However, you do want to make sure that the <title> is still there, as that is used in the collection. Also, you will notice that I am stripping out the HTML from the content before putting it into the body field of my query. This makes it so Verity only indexes the actual text on that page, otherwise the verity collection would index the HTML tags too, If a user were to then search for "img", it would return every page with an <img> tag .

Also, you will see that I used an exclusive named cflock when updating the collection. I also put a read-only cflock (see code sample below) with the same name around the cfsearch tag on my sites search page. This way people can't search while the collection is being updated. This preserves the integrity of the index. Verity collections can easily get corrupted when you are reading and writing to them at the same time.

   <cflock name="MyVerityLock" type="READONLY" timeout="1" throwontimeout="Yes">
         <cfsearch name = "searchResults" collection = "MyCollection" criteria = "#variables.crit#">
         <cfcatch type="Any">
            <b>The search criteria you entered contains invalid characters and/or parameters.</b>
            <cfset searcherror = 1>
   <cfcatch type="Lock">
   <b>Our search index is currently being updated please try again in a few moments.</b>
   <cfset searcherror = 1>

The script usually needs a little tweaking to tailor it to a particular site. For example, you may have noticed in the code that I had a conditional statement preventing the log in page from being indexed. Once you have the script indexing your site the way you want it, you would then add a ColdFusion scheduled task to execute this script as often as is necessary for your site.

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
Michael Evangelista's Gravatar Thanks for this article.

This part especially:
You will see in the code that as the script is crawling each page of the site, it adds "search=Y" to the URLs query string. I set up my sites so that if URL.Search equals "Y", the pages do not display the sites header, footer, or side navigation. This way my verity index only contains the content in the body of the page.
is SO smart!

I found a similar solution for the indexing via sitemap but ended up jumping through a few hoops to strip out everything before and after the main content area of the page, using a comment in the code. Of course, without that comment I'd be out of luck on any given page. Very cool solution here.

Also appreciate the info about the lock and possible corruption. I will be sure to revisit this code when I do my next verity-via-sitemap setup... soon!
# Posted By Michael Evangelista | 11/30/07 3:05 PM
Jason's Gravatar This code looks great and I would really like to try it out as a standard in my coding for verity search.

What should the sitemap.cfm page contain?

Just a list of links to pages on the site like below?

eg: <a href=index.cfm>home</a>
<a href=index.cfm?pageid=2>about us</a>
<a href=index.cfm?pageid=3>products</a>
<a href=index.cfm?pageid=4>services</a>
<a href=index.cfm?pageid=5>contact us</a>

I look forward to your reply.
Many thanks in advance.
# Posted By Jason | 11/3/08 6:55 AM
Scott Bennett's Gravatar @Jason,

This script is set up to read a sitemap where all the href attributes in the links contain full urls.

<a href='http://www.mysite.com/index.cfm' >home</a>
<a href='http://www.mysite.com/index.cfm?pageid=2" target="_blank">http://www.mysite.com/index.cfm?pageid=2' >about us</a>
<a href='http://www.mysite.com/index.cfm?pageid=3" target="_blank">http://www.mysite.com/index.cfm?pageid=3' >products</a>
<a href='http://www.mysite.com/index.cfm?pageid=4" target="_blank">http://www.mysite.com/index.cfm?pageid=4' >services</a>
<a href='http://www.mysite.com/index.cfm?pageid=5" target="_blank">http://www.mysite.com/index.cfm?pageid=5' >contact us</a>

However in indexsite.cfm you can change cfhttp tag that reads the sitemap to set the resolveurl attribute to "yes" and then cfhttp will change all your relative links into full urls.

<cfhttp url="http://www.mywebsite.com/sitemap.cfm"; resolveurl="Yes" method="GET"></cfhttp>
# Posted By Scott Bennett | 11/3/08 1:08 PM
Jason's Gravatar Cool! Thanks Scott.

Hopefully this provides me with a effective solution.
I'll post back and let you know how it works or if I have any other questions.
# Posted By Jason | 11/3/08 8:16 PM
Jason's Gravatar Hey Scott,

When testing the search, it doesn't seem to be searching the contents/body of the pages. It only returns results where the search term used matches what is in the page s <title></title>.

How do I get it to search the body as well?

Also, I cannot display what is stored as "body" and "URLpath".

Sorry I am sounding like such a newbie... this is my first time using Verity, normally I use queries across multiple tables, which is fairly slow.

Thanks in advance for all your help.
# Posted By Jason | 11/4/08 11:49 PM
Katty Lee's Gravatar Thanks for that to work on new ideas, ColdFusion perfectly complements Google!
Welcome to the site http://www.queentorrent.com
Here you can download a lot of interesting information.
# Posted By Katty Lee | 7/8/09 3:53 PM