Presses universitaires de Louvain
onixsuitesupport@onixsuite.com
20240329
eng
COM.ONIXSUITE.9782874630828
03
01
Presses universitaires de Louvain
01
SKU
76399
02
2874630829
03
9782874630828
15
9782874630828
10
BC
01
Cahiers du CENTAL
Traitement automatisé du langage
Numéro 4
01
Building and Exploring Web Corpora (WAC3 - 2007)
Proceedings of the 3rd web as corpus workshop, incorporating cleaneval
01
GCOI
29303100002220
1
B01
Cédrick Fairon
Fairon, Cédrick
Cédrick
Fairon
<p>Cédrick Fairon est professeur à l'Université catholique de Louvain où il dirige le Centre de traitement automatisé du langage (CENTAL).</p>
2
B01
Hubert Naets
Naets, Hubert
Hubert
Naets
3
B01
Adam Kilgarriff
Kilgarriff, Adam
Adam
Kilgarriff
4
B01
Gilles-Maurice de Schryver
de Schryver, Gilles-Maurice
Gilles-Maurice
de Schryver
1
01
eng
02
eng
182
00
182
03
LAN009000
29
2012
3147
Linguistique, Sciences du langage
93
C
01
06
01
<P><STRONG>WAC</STRONG></P><P>More and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of the talks presented at the 3 rd WAC in Louvain-la-Neuve (Belgium).The focus is on the description of Web corpus collection projects, the exploration of Web datacharacteristics from a linguistics/NLP perspective, and on the use of crawled Web data for NLPpurposes.</P><P><STRONG>CLEANEVAL</STRONG></P><P>Any use of Web data requires that it be cleaned in order to get rid of unwanted material including,for example, HTML markup, navigation bars, advertisements. To date there has been no sharingof resources or expertise in this particular domain and the cleaning has often been done minimally.Cleaneval was an exercise aimed at promoting collaboration and improving our understandingof the issues. Results and perspectives are presented in this book. </P>
03
<P><STRONG>WAC</STRONG></P><P>More and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of the talks presented at the 3 rd WAC in Louvain-la-Neuve (Belgium).The focus is on the description of Web corpus collection projects, the exploration of Web datacharacteristics from a linguistics/NLP perspective, and on the use of crawled Web data for NLPpurposes.</P><P><STRONG>CLEANEVAL</STRONG></P><P>Any use of Web data requires that it be cleaned in order to get rid of unwanted material including,for example, HTML markup, navigation bars, advertisements. To date there has been no sharingof resources or expertise in this particular domain and the cleaning has often been done minimally.Cleaneval was an exercise aimed at promoting collaboration and improving our understandingof the issues. Results and perspectives are presented in this book. </P>
02
WACMore and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of...
01
<P><STRONG>WAC</STRONG></P><P>More and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of the talks presented at the 3 rd WAC in Louvain-la-Neuve (Belgium).The focus is on the description of Web corpus collection projects, the exploration of Web datacharacteristics from a linguistics/NLP perspective, and on the use of crawled Web data for NLPpurposes.</P><P><STRONG>CLEANEVAL</STRONG></P><P>Any use of Web data requires that it be cleaned in order to get rid of unwanted material including,for example, HTML markup, navigation bars, advertisements. To date there has been no sharingof resources or expertise in this particular domain and the cleaning has often been done minimally.Cleaneval was an exercise aimed at promoting collaboration and improving our understandingof the issues. Results and perspectives are presented in this book. </P>
03
<P><STRONG>WAC</STRONG></P><P>More and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of the talks presented at the 3 rd WAC in Louvain-la-Neuve (Belgium).The focus is on the description of Web corpus collection projects, the exploration of Web datacharacteristics from a linguistics/NLP perspective, and on the use of crawled Web data for NLPpurposes.</P><P><STRONG>CLEANEVAL</STRONG></P><P>Any use of Web data requires that it be cleaned in order to get rid of unwanted material including,for example, HTML markup, navigation bars, advertisements. To date there has been no sharingof resources or expertise in this particular domain and the cleaning has often been done minimally.Cleaneval was an exercise aimed at promoting collaboration and improving our understandingof the issues. Results and perspectives are presented in this book. </P>
02
WACMore and more people are using Web data for linguistic and NLP research. The Web as Corpusworkshop (WAC) provides a venue for exploring how we can use it effectively and the advancementsto which this could lead.This book is a collection of the...
04
<p>Table of Contents .................................................................................................... vii</p>
<p>Preface ..................................................................................................................... 1</p>
<p>WAC3 ..................................................................................................................... 3</p>
<p>Kevin P. SCANNELL, The Crúbadán <i>Project: Corpus building for underresourced</i></p>
<p><i>languages </i>..........................................................................................5</p>
<p>Sebastian BLOHM, Philipp CIMIANO, <i>A Human Evaluation of Filtering</i></p>
<p><i>Functions for Pattern-based Extraction of Arbitrary Relations from the</i></p>
<p><i>Web </i>.....................................................................................................................17</p>
<p>Emmanuel CARTIER, <i>TextBox, a Written Corpus Tool for Linguistic Analysis </i>...... 33</p>
<p>William H. FLETCHER, <i>Implementing a BNC-Compare-able Web Corpus </i>............ 43</p>
<p>Fabrice ISSAC, <i>Yet Another Web Crawler </i>................................................................ 57</p>
<p>Igor LETURIA, Antton GURRUTXAGA, Iñaki ALEGRIA, Aitzol EZEIZA, <i>CorpEus,</i></p>
<p><i>a 'web as corpus' tool designed for the agglutinative nature of Basque </i>...........69</p>
<p>Serge SHAROFF, <i>Classifying Web corpora into domain and genre using</i></p>
<p><i>automatic feature identification </i>.........................................................................83</p>
<p>Anil Kumar SINGH, Jagadeesh GORLA, <i>Identification of Languages and</i></p>
<p><i>Encodings in a Multilingual Document </i>............................................................. 95</p>
<p>CLEANEVAL .......................................................................................................... 109</p>
<p>Daniel BAUER, Judith DEGEN, Xiaoye DENG, Priska HERGER, Jan GASTHAUS,</p>
<p>Eugenie GIESBRECHT, Lina JANSEN, Christin KALINA, Thorben KRÜGER,</p>
<p>Robert MÄRTIN, Martin SCHMIDT, Simon SCHOLLER, Johannes STEGER,</p>
<p>Egon STEMLE, Stefan EVERT, <i>FIASCO: Filtering the Internet by Automatic</i></p>
<p><i>Subtree Classification, Osnabrück </i>..................................................................... 111</p>
<p>Stefan EVERT, <i>StupidOS: A high-precision approach to boilerplate removal </i>........ 123</p>
<p>Weizheng GAO, Tony ABOU-ASSALEH, <i>GenieKnows Web Page Cleaning</i></p>
<p><i>System </i>................................................................................................................. 135</p>
<p>Christian GIRARDI, <i>Htmcleaner: Extracting the Relevant Text from the Web Pages </i>..... 141</p>
<p>Katja HOFMANN, Wouter WEERKAMP, <i>Web Corpus Cleaning using Content</i></p>
<p><i>and Structure </i>...................................................................................................... 145</p>
<p>Michal MAREK, Pavel PECINA, Miroslav SPOUSTA, <i>Web Page Cleaning with</i></p>
<p><i>Conditional Random Fields </i>............................................................................... 155</p>
<p>Xabier SARALEGI, Igor LETURIA, <i>Kimatu, a tool for cleaning non-content text</i></p>
<p><i>parts from HTML docs </i>....................................................................................... 163</p>
43
La collection est une publication du Centre de traitement automatique du langage. Ses objectifs principaux sont de contribuer à la diffusion des travaux en linguistique et linguistique informatique et de participer à la promotion de ces disciplines.
44
<p> La collection Cahiers du CENTAL est une publication du Centre de traitement automatique du langage. Ses objectifs principaux sont de contribuer à la diffusion des travaux en linguistique et linguistique informatique et de participer à la promotion de ces disciplines.</p>
99
BE
07
03
01
https://pul.uclouvain.be/resources/titles/29303100002220/images/f5e536083a438cec5b64a4954abc17f1/THUMBNAIL/9782874630828.jpg
20160216
02
https://pul.uclouvain.be/book/?GCOI=29303100002220
06
3052405007518
Presses universitaires de Louvain
01
06
3052405007518
Presses universitaires de Louvain
Louvain-la-Neuve
BE
04
20070101
2007
01
WORLD
01
9.45
in
02
6.30
in
03
1.05
in
08
10.69
oz
01
24
cm
02
16
cm
03
1.05
cm
08
303
gr
27
03
9782874635045
002
PDF
06
3012405004818
CIACO - DUC
03
WORLD 01 2
20
1
02
00
02
02
STD
02
19.70
EUR
BE
R
6.00
18.58
1.12
06
3019000200508
Librairie Wallonie-Bruxelles
33
www.librairiewb.com/
http://www.librairiewb.com/
02
FR 01 2
20
1
04
00
02
02
STD
02
19.70
EUR
R
5.50
18.67
1.03