Tables explained

The DB consists of three different tables: states, screenshots, websites.

Table States

Each phishig website has been assigned a state when manually verifying it. This state is used to categorized websites that have not been a phish anymore when the bot has visited them. The states table contains explanations to the states used in the websites table. The most important state is "3" all websites with this state are real phishing websites that have been assigned a parent website.

Table Screenshots

The screeshots table contains information about all the screenshot-image files made. All those files are in the subfolder "screenshots" that is part of the distribution of this database.

FieldDescription
id:A unique id for each screenshot
website: the website this screenshot belongs to
type: the type of the screenshot taken (1=FULL all content of the website; 2=CONTENT croppped to the visible portion of the website; 3=WINDOW showing the whole browser window as it was rendered)
name: the name of the file as it is stored in the screenshot folder
width: image width in pixels
height: image height in pixels
md5: an md5 hash of the file contents. This may allow to compare files that have exactly the same content

Table Websites

The websites table contains all the information that was collected for the websites. Not all the fields may be applicable for all websites

FieldDescription
id:a unique id for the website
alexarank:for websites paresed from the top 1000 of alexa.com this is the rank of the websites, otherwise null
isPhish:is this webpage url from a phishinglist (1) or non-fraudulent (0)
parent:what is the parent website for this website (for phishes this contains the verified original website) otherwise null
parentCount:A counter how many parents have been found for this website (this is redundant as it could be calculated counting all websites that have their parent field set to the id of the website)
url:the url that was originally provided for the scan
urlHash:an md5 hash of this url for quicker finding of identical urls
urlBasedomain:the basedomain of this url (this usually means the top-level domain plus the domain part in front of it e.g. "google.com" for some special domain names this may include some more e.g. "google.co.uk")
finalUrl:the url that was finally parsed after following all redirect requests
finalUrlBasedomain:same as urlBaseDomain for finalUrl
name:a name identifiert that is sometimes assigned to websites (otherwise null)
quality:reserverd for future use (always null)
scanned:A UNIX-Timestamp when this url was visited
rescan:used internally for rescanning already scanned website (always 0)
statusCode:the statusCode that was returned for this website. Only websites with 200 were manually checked and assigned to states
htmlContent:the HTML content of the page that was loaded under finalUrl
loadTime:The time it took querying the content and taking all screenshots
phishtank_XYZ:fields parsed from the phishtank.com list of fraudulent websites.
state:The state of the phishing website (3 ist most important here (see states table for more information))
duplicatedFrom:in some cases websites have been assigned more quickly by assigning them as duplicated to another phishing websites. This was stored here. This field does NOT denote that those websites are all duplicated that can be found for that website neither does it denote that the websites look 100percent simmilar.
created:UNIX-Timestamp when this website was created and stored in the table (usually prior to scanned)