ICDC WP1 Rome

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Collection of on-line laptop descriptions for the training corpus: 

Collection of on-line laptop descriptions for the training corpus CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001

Task overview: 

Task overview Collecting the training corpus on laptop descriptions for the 4 languages of the consortium 178 web sites about laptops provided by the consortium members 79 for English, 25 for French, 40 for Greek, 34 for Italian 330 pages about laptop descriptions collected 101 for English, 72 for French, 86 for Greek, 71 for Italian

Corpus collection important parameters: 

Corpus collection important parameters Site accessibility Site productivity Site accuracy

Site accessibility: 

Site accessibility 178 web sites were tested with the cc-tool 1/3 were unreachable for several reasons: the site does not exist anymore (or temporally unreachable) the site uses Javascript or Multimedia flash that stop the prototype the site rejects softbots

Site productivity: 

Site productivity 123 reachable sites, 1/2 produce useful pages (65) No laptop description but: computer components, computer goods, electronic goods.

Site accuracy: 

Site accuracy Out of 2014 pages containing more than 20 words in common with the ontology, only 330 were about product descriptions For several reasons: Components of laptops (hard disks, batteries) Computer goods for laptops (desktops, suitcases, printers) Computer goods similar to laptops (palm pilots, cell phones) About laptops but not about product description (“How to chose a laptop ?”, “What is a laptop ?”).

Site accessibility/productivity: 

Site accessibility/productivity

What the task has taught us: 

What the task has taught us Some sites are unreachable or need a heavy work for small results Many pages contain more than one product description (1/3) Ranking pages needs a precise evaluation function: using NERC could be a solution (see WP2) We should be careful when choosing the second product description