This document summarizes a project exploring challenges in mining historical text. The project aims to help historians discover patterns and explore hypotheses about global commodity trading from 1850-1914 using archival text. Key challenges discussed include preprocessing historical texts due to issues like low-quality OCR, different languages, and extracting information from tables. Improvements made to OCR helped reduce word errors. Language identification was added and most texts were in English. Plans to study feasibility of a table mining algorithm were also mentioned.
This document summarizes key topics from a chapter on analyzing a company's marketing environment. It discusses the microenvironment which includes actors close to the company like its departments, suppliers, marketing intermediaries, customers, and publics. It also analyzes the macroenvironment external forces such as demographic, economic, natural, technological, political, and cultural environments that affect marketing. These forces create opportunities and threats that companies must respond to in order to build successful relationships with customers.
Este documento proporciona instrucciones para diseñar el disfraz de domadoras para el Carnaval, incluyendo una camiseta roja, un tutú negro, un gorro negro y un accesorio de mano en forma de león naranja y amarillo para que las niñas puedan llevarlo "de paseo".
El documento habla sobre las diferentes partes del cuerpo humano que los estudiantes deben aprender, incluyendo la cabeza, pelo, ojos, nariz, boca, brazos, manos, pecho, espalda, estómago, cintura, muslos, rodillas, piernas y pies. El documento también proporciona detalles de contacto de una escuela concertada en Ourense, España.
This document discusses improper integrals, which are integrals with infinite limits of integration or an integrand that is unbounded within the limits. There are four types of improper integrals: 1) when the upper limit is infinity, 2) when the lower limit is infinity, 3) when both limits are infinity, and 4) when the integral is unbounded. The document provides examples of each type and explains that an improper integral converges if the limit exists as the limiting value is approached, and diverges if the limit does not exist. Tests for convergence like the p-integral test are also mentioned.
This reading discusses the origins of the Pizza Margherita. In 1889, Queen Margherita of Italy visited Naples and asked the chef from the best pizzeria to make her a pizza. The chef, Rafaele Esposito, created a new pizza for the queen using the colors of the Italian flag - tomatoes, mozzarella cheese, and herbs. The queen loved this pizza, and it became known as the Pizza Margherita. It then became the most popular pizza style in Naples and served as the basic model for modern pizza.
Los niños de primero de Primaria se disfrazarán de magos y magas para el carnaval y se solicita la ayuda de las familias para conseguir la indumentaria del disfraz, que consiste en pantalones o falda negra, camisa blanca, cintura roja con listón y pajarita del mismo color, capa y sombrero negros. Los tutores estarán disponibles para resolver cualquier duda.
Alexis K. Allegra has over 15 years of experience in non-profit administration, program management, and clinical work focused on child development, trauma-informed care, and positive youth development. She has held several leadership roles with organizations serving homeless youth, managing budgets, government contracts and data systems, designing new programs, and providing direct services including case management, counseling, and crisis intervention. Her experience also includes consulting to help non-profits evaluate and improve their programs.
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6Andrei Zmievski
n the halcyon days of early 2005, a project was launched to bring long overdue native Unicode and internationalization support to PHP. It was deemed so far reaching and important that PHP needed to have a version bump. After more than 4 years of development, the project (and PHP 6 for now) was shelved. This talk will introduce Unicode and i18n concepts, explain why Web needs Unicode, why PHP needs Unicode, how we tried to solve it (with examples), and what eventually happened. No sordid details will be left uncovered.
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
Free Webinar on the Lynx Services Platform LySP: Architecture and basic Services
The main objective of the Lynx research and innovation project is to create an ecosystem of smart cloud services to better manage compliance, based on a Legal Knowledge Graph (LKG) which integrates and links multilingual and heterogeneous compliance data sources including legislation, case law, standards, regulations and other private contracts, beside others.
This webinar will provide insights into all smart services of the Lynx Services Platform (LySP) including demos of these LySP services, as for instance: Named Entity Extraction (NER) by DFKI, Relation Extraction and Question-Answering by SWC, Machine Translation by Tilde or the Lexicala cross-lingual lexical data service by KDictionaries.
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
This document profiles Linda Andersson, the CEO of Artificial Researcher. It provides details on her background, awards, research fields, and academic merits. It then discusses how domain knowledge enables artificial intelligence systems to be smarter by allowing them to understand language and text in particular domains at a deeper level. Finally, it provides an overview of Artificial Researcher's natural language processing and text mining technologies and services for tasks like passage retrieval, ontology generation, and semantic search.
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
The document discusses the IMPACT project, which is supported by the European Community under the FP7 ICT Work Programme. The project aims to develop targeted language resources for digitizing historical collections. It is coordinated by the National Library of the Netherlands. The document outlines some of the challenges of optical character recognition (OCR) and information retrieval (IR) on historical texts due to factors like image quality, historical language variants, and unknown words. It proposes the development of specialized language resources like lexicons and language models to help address these challenges.
How To Build And Launch A Successful Globalized App From Day One Or All The ...agileware
Significant compromises are often made taking a product to market that cause downstream pain—success can mean endless hours re-architecting and retrofitting to go global, get past 508 compliance at universities or integrate partners. The good news is there are freely available technologies and strategies to avoid the pain. Learn from Zimbra’s experiences with ZCS and Zimbra Desktop (an offline-capable AJAX email application) including a checklist of do’s and don’ts and a deep dive into: i18n and l10n, 508 compliance (Americans with Disabilities Act), skinning, templates, time-date formatting and more.
From http://en.oreilly.com/oscon2008/public/schedule/detail/4834
Here is a presentation I created quite a few years back when giving a presentation to students on programming languages. I have updated it with some recent trends.
The EPO document collection:A technical treasure chestGO opleidingen
Presentation of Georg Schiwi, Documentation Information Manager at the European Patent Office.
The EPO holds one of the largest digital repositories of public knowledge in the world. This vast store is accessed daily by thousands of users and its usage is constantly increasing. Each year about 40 Terabytes, the equivalent of 40 million books, are downloaded from the EPO search collection both by internal and external users. This figure is a perfect illustration of EPO‘s unique contribution to the knowledge economy. The presentation will give an overview on the patent and non-patent collection that is used by examiners for prior-art search. In a second part, the move from a paper documentation collection to an electronic one and the particular challenges in this process will be outlined.
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
This document discusses evaluating data quality in Europeana by developing metrics for multilinguality. It identifies processes that contribute to multilinguality in metadata and proposes dimensions like completeness, consistency, conformity and accessibility to quantify multilinguality. Results of applying these metrics to Europeana data are presented, including the number of languages, language-tagged literals and their distribution. A demo of the analysis is also provided. Future work includes embedding the metrics into Europeana's workflow and further evaluation.
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
( **Natural Language Processing Using Python: - https://www.edureka.co/python-natural... ** )
This PPT will provide you with detailed and comprehensive knowledge of the two important aspects of Natural Language Processing ie. Stemming and Lemmatization. It will also provide you with the differences between the two with Demo on each. Following are the topics covered in this PPT:
Introduction to Big Data
What is Text Mining?
What is NLP?
Introduction to Stemming
Introduction to Lemmatization
Applications of Stemming & Lemmatization
Difference between stemming & Lemmatization
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Datalift is a project that aims to catalyze the publication and interconnection of data on the web. It provides tools and services to help with various steps in the data publication process including:
- Dataset publication and conversion tools to automate publishing raw data as RDF
- Infrastructure for storing and querying published RDF data
- Linkage tools to interconnect published datasets by finding equivalence links between resources
- Applications that visualize and make use of published and interlinked data
The goal is to make it easier for organizations to publish their data on the web in a way that is machine-readable and interoperable through the use of semantic web standards and vocabularies. This will help realize the promises of
Datalift is a project that aims to catalyze the publication and interconnection of data on the web. It provides tools and services to help with various steps in the data publication process including:
- Dataset publication and conversion tools to automate publishing raw data as linked data using RDF.
- Infrastructure for storing and querying published RDF data using SPARQL endpoints and RDF stores.
- Linkage tools to help interconnect published datasets by finding equivalence links between resources.
- Applications that visualize and make use of published and interlinked datasets to demonstrate the value of linked open data.
The Use of Big Data Techniques for Digital ArchivingSven Schlarb
These slides were used in a presentation at the "Our Digital Future - Multidisciplinary Perspectives on Long Term Data Preservation and Access" conference in Cambridge/UK in March 2016 in the session "Current and Future perspectives on technology for data preservation and sharing". They describe work in progress in the E-ARK project, which is co-funded by the European Commission and has as its main objective the creation of a scalable open source, digital archiving system offering efficent search and access content of very large digital object collections. The focus of this presentation lies on describing the core big data technologies (Apache Hadoop, Apache Hbase, and the document repository Lily developed by NGData), the architecture of the E-ARK integrated prototype implementation, and data mining use cases related to geographical data, named entitity extraction, and OLAP data analysis.
Digitizing a newspaper clippings collection: a case study in small-scale digi...Molly Knapp
This document outlines the process of digitizing a collection of newspaper clippings from 1933 to present day about the history of health sciences in Louisiana. It describes the original collection and its deteriorating condition over time. It then details the timeline and workflow established to scan, process, catalog and make the collection available online through a digital library consortium. This included considering standards, training needs, documentation, challenges of buy-in, sustainability and providing access within copyright restrictions. The results were a searchable online historic archive, increased visibility, opportunities for future projects and mentoring others through the process.
Paper presented at the 12th International Conference on Digital Preservation, November 2-6, 2015. University of North Carolina at Chapel Hill.
Abstract:
This paper describes how the E-ARK project (European Archival Records and Knowledge Preservation) aims to develop an
overarching methodology for curating digital assets. This methodology must address business needs and operational issues, proposing a technical wall-to-wall reference implementation for the core OAIS flow – Ingest, Archival Storage and Access. The focal point of the paper is the Access part of the OAIS flow. The paper first lays out the access vision of the E-ARK project, and secondly describes the method employed to enable information processing and to pin-point the functional and non-functional requirements. These requirements will allow the E-ARK project to create a standardized format for the Dissemination Information Package (DIP), and to develop the access tools that will process this format. The paper then proceeds to describe the actual DIP format before detailing what the access solution will look like, which tools will be developed and, not least, why the E-ARK Access system will be used and work.
Dirk Goldhahn: Introduction to the German Wortschatz Projectmbruemmer
Dirk Goldhahn (University of Leipzig, NLP group) was the only speaker presenting a linguistic dataset from the academic field. He introduced the the Leipzig Corpora Collection. The dataset comprises corpus-based full form monolingual dictionaries for more than 220 languages which comes with a variety of meta-data, e.g. word frequencies, POS tagging and co-occurrences. Furthermore, the corpora are enriched with statistical annotations such as POS, topics, word and co-occurrence frequencies. At the moment the NLP group is working on a conversion of their data into a linked data format. At the same time integration work of external sourced still needs to be done.
The document discusses the Natural Language Processing Interchange Format (NIF), which aims to achieve interoperability between NLP tools and language resources through representing them using RDF and OWL. NIF defines URI schemes for identifying text elements, an ontology for common NLP terms, and supports various use cases including integrating tools via workflows. It is maintained by the AKSW group and supported by several standards bodies and implementations seeking to advance linked data in NLP.
Similar to Exploring Challenges in Mining Historical Text (20)
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Best Practices for Effectively Running dbt in Airflow.pdfTatiana Al-Chueyr
As a popular open-source library for analytics engineering, dbt is often used in combination with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This webinar will cover a step-by-step guide to Cosmos, an open source package from Astronomer that helps you easily run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Standard ways of running dbt (and when to utilize other methods)
- How Cosmos can be used to run and visualize your dbt projects in Airflow
- Common challenges and how to address them, including performance, dependency conflicts, and more
- How running dbt projects in Airflow helps with cost optimization
Webinar given on 9 July 2024
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
Support en anglais diffusé lors de l'événement 100% IA organisé dans les locaux parisiens d'Iguane Solutions, le mardi 2 juillet 2024 :
- Présentation de notre plateforme IA plug and play : ses fonctionnalités avancées, telles que son interface utilisateur intuitive, son copilot puissant et des outils de monitoring performants.
- REX client : Cyril Janssens, CTO d’ easybourse, partage son expérience d’utilisation de notre plateforme IA plug & play.
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...Toru Tamaki
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models" arXiv2023
https://arxiv.org/abs/2307.12980
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Comparison Table of DiskWarrior Alternatives.pdfAndrey Yasko
To help you choose the best DiskWarrior alternative, we've compiled a comparison table summarizing the features, pros, cons, and pricing of six alternatives.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
1. Exploring Challenges in Mining Historical Text
Beatrice Alex, Claire Grover, Richard Tobin and Ewan Klein
Working with text: Tools, techniques and approaches for text mining
Edinburgh - 07/07/2012
2. Overview
‣ Project
‣ Data
‣ Preprocessing historical text
‣ Improvements to OCR
‣ Language identification
‣ Text mining tables
‣ Text-mining
‣ Improved commodity identification
‣ Ports-based geo-grounding
‣ Relation extraction
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
3. Project (01/2012-12/2014)
‣ Funded by Digging into Data (round 2)
‣ Partners
Ewan Klein, Claire Grover, Bea Alex (text mining)
Colin Coates, Jim Clifford (historical analysis)
James Reid (data integration)
Aaron Quigley, Uta Hinrichs (information
visualisation)
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
4. Trading Consequences
‣ What does archival text say about the
economic and environmental consequences of
global commodity trading during the
nineteenth century?
‣ Help historians to discover novel patters and
explore new hypotheses.
‣ Example questions:
‣ What were the routes and volumes of international
trade in resource commodities 1850-1914?
‣ What were the local environmental consequences of
this demand for these resources?
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
5. Geolocating Cinchona
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
6. Trading Consequences
‣ Scope: global but with focus on Canadian
natural resource flows to test reliability and
efficacy of our methods
‣ Methods:
‣ Text mining and geo-parsing to transform the text
into structured data, e.g. relational database
‣ Query interface targeted at historians
‣ Information visualisation for interactive exploration
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
7. Historical Data
‣ Digitised sources from the 19th century
British Empire, currently processing
‣ Early Canadiana Online: 83,038 files
‣ JSTOR data: 1,000 XML files
‣ House of Commons Parliamentary Papers: 4,135
files
‣ Books: selected books on nineteenth century trade
‣ Further sources:
‣ ProQuest data
‣ Encyclopaedia Britannica, Jstor Plants, Forestry
Journals?, The Botanist?
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
8. Processing Historical Data
‣ Challenges so far:
‣ Different formats
‣ Low-quality OCRed text
‣ Old/low-quality prints, quality of OCR
technology
‣ Historical English: historical word variants,
ſ (long s) characters mixed up with f by OCR
‣ Artefacts in original documents: headers/footers,
page numbers, notes in margins, end-of-line
hyphenation
‣ Text in different languages
‣ Information in tables
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
9. Processing Historical Data
‣ Challenges so far:
‣ Different formats
‣ Low-quality OCRed text
‣ Old/low-quality prints, quality of OCR
technology
‣ Historical English: historical word variants,
ſ (long s) characters mixed up with f by OCR
‣ Artefacts in original documents: headers/footers,
page numbers, notes in margins, end-of-line
hyphenation
‣ Text in different languages
‣ Information in tables
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
10. Improvements to OCR
‣ Normalisation and post-correction
‣ Fixed end-of-line hyphenation
‣ Dehyphen all token-splitting hyphens using a
dictionary-based approach (dictionary is the system
dictionary + the text of the current document)
‣ Added f-to-s conversion
‣ Convert all false f characters to s using a corpus-
based a approach (corpus is a collection of historical
documents from the Gutenberg Project)
‣ Example: reduced number of words
unrecognised by spell checker from 61 to 21 -
> approx. 67% improvement
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
11. Improvements to OCR
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
12. Improvements to OCR
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
13. Improvements to OCR
‣ Extensive evaluation of both tools against
human corrected/normalised gold standard
‣ Reduce word error rate by 12.5% in a random
Canadiana sample (word acc: 0.776 -> 0.804)
‣ Improvements have an effect on later text
mining steps and would also be beneficial for
searching text in any IR system (e.g. Jstor
database search for “French colonifts”)
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
14. Language Identification
‣ Most sources do not ISO Code
eng
Language
English
Frequency
2,677,498
contain language fra French 1,208,811
deu German 2,886
information like chn Chinook jargon 2,488
Canadiana does moh Mohawk 1,547
oji Ojibwa 1,395
‣ The table displays emg Eastern
Meohang
835
the number of text enb
cre
Markweeta
Cree
666
501
elements in iro Iroquoian 324
alg Algonquian 210
Canadiana per nge Ngemba 157
language ignoring nld
lat
Dutch
Latin
131
119
notes and titles mic Micmac 61
gla Scottish Gaelic 22
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
15. Language Identification
‣ Make use of automatic language
identification using TextCat, especially for the
JSTOR data which is also multi-lingual.
‣ LID is done for each paragraph and for the
entire document by taking the most frequent
language tag assigned.
‣ Can limit processing to English (and French)
documents only.
‣ 740 English documents (out of 1,000)
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
16. Text Mining Tables
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
17. Text Mining Tables
‣ Tables contain a lot of relevant information
but are difficult to mine.
‣ HCPP documents contain coordinates for
each table entry.
<w p="961,1777,1026,1807" v="d">Rio</w>
<w p="1026,1777,1170,1807" v="d">Janeiro</w>
...
<w p="961,1892,1087,1921" v="n">Culcutta</w>
<w p="1496,1530,1565,1555" v="o">141</w>
<w p="1565,1525,1631,1555" v="d">bags</w>
<w p="1227,1774,1336,1804" v="d">Wood</w>
<w p="1353,1791,1366,1799" v="o">-</w>
<w p="1494,1776,1565,1804" v="o">338</w>
<w p="1565,1783,1676,1803" v="d">planks</w>
<w p="1704,1791,1718,1799" v="o">-</w>
‣ Planning to do a feasibility study for a table
mining algorithm.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
18. Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
‣ Tokenisation
‣ Part-of-speech tagging
‣ Lemmatisation
‣ Wordnet lookup to find commodities
‣ Named-entity recognition including commodity
lexicon lookup
‣ Port-based Geo-grounding
‣ Chunking
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
19. Text Mining Pipeline
‣ Steps after that OCR improvements and LID:
‣ Tokenisation
‣ Part-of-speech tagging
‣ Lemmatisation
‣ Wordnet lookup to find commodities
‣ Named-entity recognition including commodity
lexicon lookup
‣ Port-based Geo-grounding
‣ Chunking
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
20. Commodities Identification
‣ WordNet lookup using an approximation of
commodity named entities:
‣ Noun phrases with hypernyms such as substance,
physical matter, plant or animal in WordNet.
‣ Each NP which leads to a match is assigned a
wn=”true” attribute.
‣ Commodities gazetteer lookup using a list of
commodities derived by historians.
‣ Strings matching the entries in the gazetteer are
assigned a commlex=”true” attribute.
‣ Words/phrases with wn=”true” and
commlex=”true” are good candidates.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
21. Ports-based Geo-grounding
‣ Started with non-optimised geo-resolution.
‣ Incorporated the list of ports.
Locations are assigned with an is_port="1" or an
is_port="0" attribute.
‣ Grounding now ignores non-port candidates in case
of ambiguous location mentions.
‣ is_port locations are also given a higher weight in
the scoring.
‣ Hypothesis: ports are more likely to be
significant locations in historic documents
about trade.
‣ Not tested yet as need gold standard data.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
22. Ports-based Geo-grounding
‣ Example:
Dalhousie is in the list of ports as:
DALHOUSIE -66.4 48.1
Geo-grounding in non-optimised resolver:
<ent id="rb3" type="location" lat="32.5333300" long="75.9833300" in-country="IN"
gazref="geonames:1273648" feat-type="ppl" pop-size="7601">
<parts>
<part ew="w136" sw="w136">Dalhousie</part>
</parts>
</ent>
Geo-grounding in ports-dependent resolver:
<ent id="rb2" type="location" lat="48.0550200" long="-66.3847200" in-
country="CA" gazref="geonames:6943599" feat-type="ppl" pop-size="0">
<parts>
<part ew="w97" sw="w97">Dalhousie</part>
</parts>
</ent>
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
23. Ports-based Geo-grounding
‣ Geo-grounding assumes that each text is a
coherent whole. All locations contribute to
the resolution of all others. May have to
change that.
‣ Segmentation (e.g. of books) into smaller
units might improve the resolution.
‣ Need to consider old spellings of place
names.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
24. Relation Extraction
‣ Crude way to identify commodity-location
relations:
‣ Sentences (s) containing words (w) with the
commlex="true" and wn="true" and a location.
Good: The quantity of raw cotton imported annually into the United Kingdom—take for
example, the year 1854—amounted to, at least, 887,335,9041bs., of which the United States
supplied 722,154,101 lbs.
Of interest: Another kind of quinine-yieldmg bark has been discovered on the western side of the
Cordillera, which produces more sulphate than the common cinchona; and as the cinchona
grows on both sides of the Cordillera, it may be inferred that the new plant will be found also in
the lands of Gualaquiza and Canelos.
Bad: The first-class refreshment room, Central Station, Leeds, has a notice that only five-year old
whisky is sold there. OR
This paper was concealed in the handle of a spear, carried from Omdurman to Gedarif in that
way.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
25. Relation Extraction
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
26. Relation Extraction
‣ Need to improve the relation extraction.
‣ Will look at pattern-based relation extraction
exploiting vocabulary like "import", "export",
"ship", "shipment", "trade", “manufacture”,
“grow” etc.
‣ Will annotate a small test corpus for
evaluation.
‣ Need to distinguish between irrelevant or
false commodity-location relations and
commodity-location relations referring to
trade.
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
27. Thank You
‣ Questions?
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
28. Example Input
‣ Different sources converted into common
XML format
Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012