Tables explained
The DB consists of three different tables: states, screenshots, websites.
Table States
Each phishig website has been assigned a state when manually verifying it. This state is used to categorized websites that have not been a phish anymore when the bot has visited them. The states table contains explanations to the states used in the websites table. The most important state is "3" all websites with this state are real phishing websites that have been assigned a parent website.
Table Screenshots
The screeshots table contains information about all the screenshot-image files made. All those files are in the subfolder "screenshots" that is part of the distribution of this database.
Field | Description |
---|---|
id: | A unique id for each screenshot |
website: | the website this screenshot belongs to |
type: | the type of the screenshot taken (1=FULL all content of the website; 2=CONTENT croppped to the visible portion of the website; 3=WINDOW showing the whole browser window as it was rendered) |
name: | the name of the file as it is stored in the screenshot folder |
width: | image width in pixels |
height: | image height in pixels |
md5: | an md5 hash of the file contents. This may allow to compare files that have exactly the same content |
Table Websites
The websites table contains all the information that was collected for the websites. Not all the fields may be applicable for all websites
Field | Description |
---|---|
id: | a unique id for the website |
alexarank: | for websites paresed from the top 1000 of alexa.com this is the rank of the websites, otherwise null |
isPhish: | is this webpage url from a phishinglist (1) or non-fraudulent (0) |
parent: | what is the parent website for this website (for phishes this contains the verified original website) otherwise null |
parentCount: | A counter how many parents have been found for this website (this is redundant as it could be calculated counting all websites that have their parent field set to the id of the website) |
url: | the url that was originally provided for the scan |
urlHash: | an md5 hash of this url for quicker finding of identical urls |
urlBasedomain: | the basedomain of this url (this usually means the top-level domain plus the domain part in front of it e.g. "google.com" for some special domain names this may include some more e.g. "google.co.uk") |
finalUrl: | the url that was finally parsed after following all redirect requests |
finalUrlBasedomain: | same as urlBaseDomain for finalUrl |
name: | a name identifiert that is sometimes assigned to websites (otherwise null) |
quality: | reserverd for future use (always null) |
scanned: | A UNIX-Timestamp when this url was visited |
rescan: | used internally for rescanning already scanned website (always 0) |
statusCode: | the statusCode that was returned for this website. Only websites with 200 were manually checked and assigned to states |
htmlContent: | the HTML content of the page that was loaded under finalUrl |
loadTime: | The time it took querying the content and taking all screenshots |
phishtank_XYZ: | fields parsed from the phishtank.com list of fraudulent websites. |
state: | The state of the phishing website (3 ist most important here (see states table for more information)) |
duplicatedFrom: | in some cases websites have been assigned more quickly by assigning them as duplicated to another phishing websites. This was stored here. This field does NOT denote that those websites are all duplicated that can be found for that website neither does it denote that the websites look 100percent simmilar. |
created: | UNIX-Timestamp when this website was created and stored in the table (usually prior to scanned) |