Task overview:
Task overview Collecting the training corpus on laptop descriptions for the 4 languages of the consortium
178 web sites about laptops provided by the consortium members
79 for English, 25 for French, 40 for Greek, 34 for Italian
330 pages about laptop descriptions collected
101 for English, 72 for French, 86 for Greek, 71 for Italian
Corpus collection important parameters:
Corpus collection important parameters
Site accessibility
Site productivity
Site accuracy
Site accessibility:
Site accessibility 178 web sites were tested with the cc-tool
1/3 were unreachable for several reasons:
the site does not exist anymore (or temporally unreachable)
the site uses Javascript or Multimedia flash that stop the prototype
the site rejects softbots
Site productivity:
Site productivity 123 reachable sites, 1/2 produce useful pages (65)
No laptop description but:
computer components,
computer goods,
electronic goods.
Site accuracy:
Site accuracy Out of 2014 pages containing more than 20 words in common with the ontology, only 330 were about product descriptions
For several reasons:
Components of laptops (hard disks, batteries)
Computer goods for laptops (desktops, suitcases, printers)
Computer goods similar to laptops (palm pilots, cell phones)
About laptops but not about product description (“How to chose a laptop ?”, “What is a laptop ?”).
Site accessibility/productivity:
Site accessibility/productivity
What the task has taught us:
What the task has taught us Some sites are unreachable or need a heavy work for small results
Many pages contain more than one product description (1/3)
Ranking pages needs a precise evaluation function: using NERC could be a solution (see WP2)
We should be careful when choosing the second product description