1

I'm working on a web site classification method. Papers in field (Text Classification / Web Page Classification) are usually using:

  • a) Ancient, but well known & widely adopted, datasets like WebKB 7-sectors
  • b) Their own datasets, collected from domains found at a public categorized web directory

What puzzles me:

  • after review of recent research, I found that for b) cases, authors are not that happy to make dataset (corpus) publicly available.

They clearly state where they harvested web domains and do provide parameters of the crawl - but, I assume it would make scientific sense to publish exact corpus - so, future comparisons have more solid ground. Yet, I rarely find dataset link in a paper (interestingly enough: in few cases, I recall: authors published the corpus but never mentioned it in the article - I found it accidentally while searching GitHub / Google for PDF of the original source)

Now, I'm worried if there are legal concerns here - that I might not be aware of. I would appreciate very much thoughts & feedback from your practice / experience:

  • Should we allow our self to interpret copyright notice (~100% of pages in the sample) "all rights reserved" less rigidly, for our case, case of a scientific research?

    • But how to sleep well if I published gigabytes of web pages, having such copyright notice, on a Data Set repository with Creative Commons or GPLv3 licence? I mean, it is hard to play dumb if ever called out for such "disagreement" between the repository content and the licence.
  • Is it about me, or you agree - that it sounds really insane if we should ask for written permission to each of 1400 legal entities (in my current sample), running business web site - being researched?

  • how you guys handle this dilemma?

  • or would you stick only with ancient/mainstream datasets?

My particular research has focus more toward "web site" than "web page / text" classification aspect. Hence, I have strong motive to crawl an additional dataset (from domains harvested from https://dmoztools.net/), beside using the mainstream ones (since these rarely provide web-site context, and mostly contain web pages / single documents). If I take that methodological path, I have to make it available -- but: are there hidden problems in such path?

Thanks for all thoughts in advance & this is not an easy one :)

4
  • 3
    You're asking several questions. Would you be able to break your question down into smaller questions? Commented Sep 13, 2018 at 12:25
  • Sorry (for confuse style), the dilemma is about two conflicting "interests" (if you will) - as Buffy broken down: two issues: copyright vs reproducibility - it could be presented better, but essentially: it is problem with several variables :)
    – hardyVeles
    Commented Sep 13, 2018 at 13:05
  • 1
    By breaking down into smaller questions I think @RichardErickson meant making smaller, separate posts.
    – Scientist
    Commented Sep 13, 2018 at 14:04
  • Copyrights - I have no issue on that. Making the dataset open source - again, no issue there. --- but copyright vs open souring dataset -- that's the dilemma. (everything else, about particular research, datasets used etc. is just for illustration -- again, sorry for the style, I'll do better next time)
    – hardyVeles
    Commented Sep 13, 2018 at 14:21

1 Answer 1

1

There seem to me to be two issues here: copyright and reproducibility of research. You pose them as being in conflict, but I don't think that is the case.

First, in most jurisdictions, publishing someone's copyright work is a violation. "Just for research purposes" probably won't save you in these times, though once it might have done so.

However, the reproducibility issue isn't quite as you pose it here. Suppose I publish an algorithm (I'm assuming it is an algorithm, actually - deterministic). Suppose I also publish the data set on which it is run. If someone else runs my algorithm on my data set does anyone expect that it won't produce the same result. Even a randomized "algorithm" should produce the same results within statistical limits when run on the same data set.

But that isn't really the meaning of reproducible. I make an hypothesis about what should occur in a situation. I test it. Others should be able to test it under similar constraints. But I doubt that my hypothesis was "X happens in Y dataset with Z methodology". More likely Y here is treated as a category of datasets, not a specific one.

So, if you publish the mechanism by which you gather the dataset, rather than the dataset itself (along with Z), then others can test your work and the results will be more meaningful than if you publish the dataset itself. Thus you avoid the copyright problem altogether.

One reason that authors don't want to publish some datasets is to avoid stepping in quagmires such as copyright. Others involve human subject privacy, etc. But, that shouldn't matter if the hypothesis isn't overly tied to a specific dataset, but to a category of similarly gathered datasets.

2
  • But, we (primarily) publish data sets to run "your" algorithm on "my" dataset (too). I mean, that's the most widely followed logic in papers proposing enhancement or completely novel method. I agree on the main point: mechanism, not dataset. My concern is time offset: list of domains + crawl parameters/method, would produce very different set of documents after 1+ year. However, I agree with your rationale - and it seems general practice in the field.
    – hardyVeles
    Commented Sep 13, 2018 at 13:02
  • 1
    On the whole "what is reproducibility" issue, there is an excellent paper by Benureau et al. with a lot of examples of what can go wrong: arxiv.org/abs/1708.08205. In their terms, reproducible means "someone else can take your code and data, and obtain the same results." I also think that this is a very useful property, because it allows me to directly compare two algorithms and make sure that you actually obtained the results you claim on your own dataset. Plus, if I find errors in your code, I can get new results on the original dataset. So, it's not a fringe concern, I think
    – malexmave
    Commented Sep 13, 2018 at 13:19

Not the answer you're looking for? Browse other questions tagged .