I'm working on a web site classification method. Papers in field (Text Classification / Web Page Classification) are usually using:
- a) Ancient, but well known & widely adopted, datasets like WebKB 7-sectors
- b) Their own datasets, collected from domains found at a public categorized web directory
What puzzles me:
- after review of recent research, I found that for b) cases, authors are not that happy to make dataset (corpus) publicly available.
They clearly state where they harvested web domains and do provide parameters of the crawl - but, I assume it would make scientific sense to publish exact corpus - so, future comparisons have more solid ground. Yet, I rarely find dataset link in a paper (interestingly enough: in few cases, I recall: authors published the corpus but never mentioned it in the article - I found it accidentally while searching GitHub / Google for PDF of the original source)
Now, I'm worried if there are legal concerns here - that I might not be aware of. I would appreciate very much thoughts & feedback from your practice / experience:
Should we allow our self to interpret copyright notice (~100% of pages in the sample) "all rights reserved" less rigidly, for our case, case of a scientific research?
- But how to sleep well if I published gigabytes of web pages, having such copyright notice, on a Data Set repository with Creative Commons or GPLv3 licence? I mean, it is hard to play dumb if ever called out for such "disagreement" between the repository content and the licence.
Is it about me, or you agree - that it sounds really insane if we should ask for written permission to each of 1400 legal entities (in my current sample), running business web site - being researched?
how you guys handle this dilemma?
or would you stick only with ancient/mainstream datasets?
My particular research has focus more toward "web site" than "web page / text" classification aspect. Hence, I have strong motive to crawl an additional dataset (from domains harvested from https://dmoztools.net/), beside using the mainstream ones (since these rarely provide web-site context, and mostly contain web pages / single documents). If I take that methodological path, I have to make it available -- but: are there hidden problems in such path?
Thanks for all thoughts in advance & this is not an easy one :)