1. The document introduces an English word similarity calculation method based on WordNet. It extracts synonyms from WordNet and uses a vector-based approach to calculate word similarity.
2. The vector includes three types of information from WordNet: synonyms, word classes, and sense explanations.
3. Experiments show this is a feasible way to quantitatively study the semantic relationships between English words and provide methods and insights for information processing and machine translation.
This document discusses signal processing and digital signals. It describes how signal processors can accurately process signals while avoiding errors. It also examines the precision of signal processors and how they reliably operate as intended without issues.
The document is a PDF file containing binary data and metadata. It includes information like page dimensions, fonts, images, and encryption details. The main content appears to be graphical in nature, likely containing visual elements like text, shapes, and photos.
The document discusses educational blogging projects and the use of social networks in education. It provides background information on the history of blogs, popular blogging platforms, and types of tasks that can be assigned in educational blog projects, such as developing communication skills, vocabulary, writing abilities, and finding information in English. The document also contains examples of specific tasks and assignments that could be included in a blog-based project on topics like communication, audiencing skills, grammar, writing, and information literacy.
The document discusses memory layout in programs, specifically the stack and heap. It covers topics like stack organization using LIFO, dynamic memory allocation on the heap using malloc() and free(), and how the heap, stack, bss, and data segments are organized in memory. Later sections discuss exploiting heap overflows by modifying the GOT to control EIP and calling desired functions. Liveoverflow is referenced for further learning on these concepts.
The document provides an overview of domain-specific languages (DSLs) and language-oriented programming. It discusses how DSLs are specialized computer languages for a particular domain and provides examples of DSLs. It also describes how language-oriented programming uses DSLs to define programming abstractions and implementations through language tools and workbenches. Finally, it outlines how a DSL for object-relational mapping was developed in PHP using a lexer, parser, and Eclipse integration.
This document discusses TypeCast, an open source mobile blogging platform developed by Six Apart. It filters HTML for different mobile carriers in Japan, splitting content into multiple pages if needed. It uses the Atom format and API to access blog content and generates common HTML. Performance is optimized by caching Atom feeds and generated HTML. The goal is to make Six Apart's blogging platforms accessible on mobile via the Atom API.
This document provides an overview of Cambrian Education Group, one of the top colleges in Bangladesh, over the past 20 years. It summarizes the college's establishment and expansion over time, including opening new campuses and schools. It also outlines their academic achievements like high pass rates on national exams, rankings, and international accreditations. Finally, it describes some of their extracurricular programs and health services available to students.
This document discusses signal processing and digital signals. It describes how signal processors can accurately process signals while avoiding errors. It also examines the precision of signal processors and how they reliably operate as intended without issues.
The document is a PDF file containing binary data and metadata. It includes information like page dimensions, fonts, images, and encryption details. The main content appears to be graphical in nature, likely containing visual elements like text, shapes, and photos.
The document discusses educational blogging projects and the use of social networks in education. It provides background information on the history of blogs, popular blogging platforms, and types of tasks that can be assigned in educational blog projects, such as developing communication skills, vocabulary, writing abilities, and finding information in English. The document also contains examples of specific tasks and assignments that could be included in a blog-based project on topics like communication, audiencing skills, grammar, writing, and information literacy.
The document discusses memory layout in programs, specifically the stack and heap. It covers topics like stack organization using LIFO, dynamic memory allocation on the heap using malloc() and free(), and how the heap, stack, bss, and data segments are organized in memory. Later sections discuss exploiting heap overflows by modifying the GOT to control EIP and calling desired functions. Liveoverflow is referenced for further learning on these concepts.
The document provides an overview of domain-specific languages (DSLs) and language-oriented programming. It discusses how DSLs are specialized computer languages for a particular domain and provides examples of DSLs. It also describes how language-oriented programming uses DSLs to define programming abstractions and implementations through language tools and workbenches. Finally, it outlines how a DSL for object-relational mapping was developed in PHP using a lexer, parser, and Eclipse integration.
This document contains summaries of several Russian language documents related to measuring and billing of electric energy and power using automated information systems. The summaries describe metrological verification (MVI) of electric parameters and energy consumption for various utility companies and railway systems using different information and measurement technologies. Keywords mentioned include electric energy, power, automated information and measurement systems, traction substations, and branches of energy companies.
The document discusses the future of blogging and how it will evolve beyond its current form. It suggests blogging will move towards being more conversational as the internet enables new types of human interactions not possible before. Specifically, it proposes blogs will incorporate more serial and video content and microblogging elements. The future of blogs is seen as conversations that are enabled by new technologies and media forms on the internet.
The document discusses various tools for natural language processing of Russian text, including Mystem, Snowball, Gate, AOT, and WordNet. Mystem and Snowball are open-source morphological analyzers. Gate is an open-source framework for natural language engineering that includes plugins for Mystem and Snowball. AOT is a commercial platform for Russian, English, and German that includes modules for grammar analysis, morphology, and interfaces for demonstrating module functionality.
The document discusses WPF application development. It begins with an overview of what WPF is and its key features like high performance graphics using GPU processing. It then covers topics like controls in WPF that are similar to Windows Forms controls, differences in data binding and events. It also discusses layouts using panels like Grid and Canvas. Overall the document provides guidance on controls, data handling and layouts to consider for WPF application development.
The document discusses the transition from Web 1.0 to Web 2.0 and the implications for religious organizations. It notes that many religious groups have established websites but mainly focus on information dissemination rather than interactive sharing. It suggests religious organizations should embrace the collaborative and participatory principles of Web 2.0 by allowing free sharing of user-generated content and moving away from a top-down approach.
This document discusses information architecture and usability. It provides an overview of information architecture and defines it. It then discusses various roles related to information architecture like information architects, interaction designers, and user experience designers. It provides tips for conducting information architecture work, such as creating site structures and wireframes. It also discusses methods for research like card sorting and provides examples of prototyping tools.
This document discusses Windows 7 logo certification, including an overview of the certification process and requirements. It states that the certification process involves downloading and installing the Windows 7 logo certification tool, which runs automated tests and reports results. It also outlines the technical requirements and guidelines that applications must meet, such as being uninstallable, supporting 64-bit Windows systems, and following UAC standards.
This document discusses how the internet and access to information has changed the world. It provides tips on how to search more effectively online through keywords, external identifiers and limiting searches to specific websites. It recommends some free ebook download websites.
The document discusses the role of an information architect and user experience design. It provides examples of how information architects work with designers to structure websites and ensure the architecture meets user needs. The information architect's role is to organize information and ensure the structure and navigation of a site works intuitively for users.
The document discusses different metaphors used in web design and development such as interfaces, films/multimedia, buildings/spaces, systems/applications, and industrial products. It provides examples of teams, deliverables, and artifacts for each metaphor. The document was presented by Charles Chen and provides further reading on communicating design and effective prototyping.
This document provides a biography of Frank Toney, an expert in project management. It discusses Toney's extensive experience and credentials in project management, including his role leading benchmarking forums involving over 130 large companies. The document also discusses how benchmarking, or observing and duplicating best practices of other high-performing organizations, has been an important decision-making approach throughout history, dating back to ancient Chinese and Italian thinkers. Benchmarking allows companies to immediately improve their project management capabilities.
The document provides an overview and vocabulary for an intermediate Chinese vocabulary study on various topics, including industries, software/hardware, genuine/pirated products, services, customers, and business environment. Sample sentences are given to practice the vocabulary words in context. A reading comprehension exercise follows to test understanding of a passage related to the growth and challenges facing the GPS industry in China.
The document discusses Windows 7 compatibility, including:
- Windows 7 provides greater compatibility with existing applications than Vista, as it has the same version number of 6.1.
- User account control continues to separate administrator and standard user privileges, with some refinements.
- Installation and deployment should use MSI packages to maintain compatibility with privilege escalation.
Ontology-based Content Management System (ICIM 2008)Brian Hsu
This document discusses using ontologies for content management systems. It proposes using both classes and relationships (metadata) in an ontology to organize digital resources. Specifically, it recommends combining a taxonomy module with Drupal, an open source content management system, to build an ontology-based system that can describe the relations between different types of content.
- The document discusses design thinking research conducted by the author. It provides an overview of the author's academic background and research interests, which center around understanding the cognitive aspects of design processes and behaviors.
- The author's research uses protocol analysis and the Design Content Oriented Coding Scheme to analyze recorded think-aloud protocols from design experiments in order to understand designers' cognitive processes.
- Key areas of focus include conceptual design, perception in design, knowledge and situatedness, creativity, and methodological studies of design processes. The goal is to better understand design cognition and modeling of design processes.
The document provides an introduction to Lucene, an open-source text search engine library written in Java. It discusses Lucene's history and architecture at a high level, how it parses query terms and fields, and supports modifiers and Boolean operators to connect terms. The summary also lists some common sub-projects built with Lucene like Solr.
UGC allows users to generate and share content such as text, images, and videos by placing it on online platforms for others to view. Examples of UGC include blogs, wikis, video and photo sharing sites, podcasts, microblogging, and social networks. Research estimates that by 2010, the number of times UGC content is downloaded will exceed 650 billion times, generating $8.5 billion in revenue for UGC-based businesses. Common categories of UGC include blogs, wikis, video and photo sharing sites, podcasts, microblogging, social networks, news sites, functional tools, and educational content.
The Charter 08 manifesto was signed by over 303 Chinese intellectuals and published on December 10, 2008, marking the 60th anniversary of the Universal Declaration of Human Rights. The manifesto calls for political reform and improvement of human rights conditions in China. It proposes 19 principles focused on ideas like freedom, human rights, popular sovereignty, and constitutional government. It was met with some opposition from the Chinese government and two signatories were detained.
This document provides information about the Sysview Confidential DataBoard, a digital signage and interactive whiteboard product. It describes the key features and functions of the DataBoard, including its touchscreen capabilities, automatic recording and sharing of audio and visual materials, ability to display various types of media content, and remote management system. Applications of the DataBoard discussed include modern digital education, conferences, multimedia classrooms, mobile cinemas, and digital signage.
This document summarizes Barack Obama's successful 2008 presidential election campaign, which made innovative use of internet technologies and social media. It discusses how Obama's campaign embraced new forms of online organizing, fundraising, and mobilizing volunteers. Through social networks, online videos, and interactive web platforms, Obama's campaign engaged and energized grassroots supporters, particularly younger voters, helping propel him to victory and make history as the first African American president.
The document discusses the qualities needed in a physician based on passages from Plato. It states that a physician should:
1) Be elderly and experienced, able to make wise judgments.
2) Care most for the health and well-being of the city, with the city's benefit as their primary motivation.
3) Educate through persuasion rather than force, guiding people to willingly understand both health and sickness.
The document discusses ASP.NET 2.0 web application security features including user authentication, authorization, and profile management. It describes how ASP.NET 2.0 provides built-in controls and APIs for user login, password recovery, role management, and storing user profiles and membership data in a database using SQL Server providers. It also provides code examples for creating new user accounts, validating user logins, and retrieving user information.
This document summarizes the technical challenges of blog mining and analysis in Chinese. It discusses how the large volume of Chinese social media data presents opportunities but also challenges from the Chinese language itself, including word segmentation, ambiguity, and network language. It also covers the need for multi-dimensional analysis of data over time on topics, sentiments, industries and products to gain deeper insights. Solutions proposed include using OLAP data cubes to enable fast, flexible analysis of aggregated blog data.
Chinese Blogger Conference CIC Presentation SlidesDenis Yu
This document discusses the technical challenges of blog mining and analysis in Chinese. It provides statistics on internet and blog usage in China and explores natural language processing challenges like word segmentation for the Chinese language. It also covers approaches to blog content categorization, sentiment analysis, and data processing techniques like OLAP for multi-dimensional analysis of blog data. Finally, it argues that while Chinese language analysis presents major challenges, technologies are becoming increasingly mature to make better sense of data from blogs and social media in China.
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
Choose our Linux Web Hosting for a seamless and successful online presencerajancomputerfbd
Our Linux Web Hosting plans offer unbeatable performance, security, and scalability, ensuring your website runs smoothly and efficiently.
Visit- https://onliveserver.com/linux-web-hosting/
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
Best Programming Language for Civil EngineersAwais Yaseen
The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era.
Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Bert Blevins
Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
find out more about the role of autonomous vehicles in facing global challenges
基于WordNet的英语词语相似度计算
1. 基于 WordNet 的英语词语相似度计算
颜伟 荀恩东
北京语言大学 语言信息处理研究所 北京 100083
E-mail: {yanwei,edxun}@blcu.edu.cn
摘要:本文主要介绍一种基于 WordNet 的英语词语相似度的实现方法,我们从 WordNet
中提取同义词并采取向量空间方法计算英语词语的相似度,我们的向量包括三方面:
。
(1)WordNet 的同义词词集(Synset)(2) 类属信息(Class) (3)意义解释(Sense explanation)
实验结果表明这是计算英语词语相似度的一种可行的方法。
关键词:WordNet,词语相似度
English Word Similarity Calculation Based
on WordNet
Yan Wei Xun Endong
Language Information Processing Center
Beijing Language and Culture University Beijing 100083
E-main: {yanwei,edxun}@blcu.edu.cn
Abstract: In our approach, English WordNet is used as the general synonym resource. The algorithm
extracts synonym and use vector based method to calculate the English word similarity. The vector
includes three kinds of information.(1)WordNet synonym. (2) Class. (3)Sense explanation. It is a
feasible way to calculate the similarity of the English words in the experiments.
Keywords: WordNet, word similarity
1 引言
随着自然语言处理技术的发展,语义研究,特别是词汇语义研究,成为目前自然语言
处理领域的热点和前沿课题。本文介绍一种基于 WordNet 英语词语相似度计算的实现方
法,希望能对英语词语间语义关系进行一些数量化研究。同时也希望能对中文信息处理及
双语翻译提供一些方法和借鉴。
词语距离的计算方法大体上可以分成两类:一类是根据某种世界知识(ontology)来
计算,主要是基于按照概念间结构层次关系组织的语义词典的方法,根据在这类语言学资
源中概念之间的上下位关系和同位关系来计算词语的相似度。在这方面,许多学者已经基
于 WordNet 做了大量的工作。另一类方法利用大规模的语料进行统计,这种基于统计的方
法,主要将上下文信息的概率分布作为词汇语义相似度的参照。第一类方法建立在两个词
汇具有一定的语义相关性当且仅当它们在概念间的结构层次网络图中存在一条通路这样
4. 3.2.1:特征提取
我们利用 WordNet 提供的接口函数,从 WordNet 的同义词词集(Synset)
、属类词(Class
word)和意义解释(Sense explanation)这三个集合中抽取出候选同义词,然后进行特征提取,
计算出 feature (SW ) :
feature ( SW ) = {{Ws}, {Wc}, {We}}
{Ws} : WordNet 中 Sense W 所有的同义词;
{Wc} : Sense W 所有的相关的属类;
{We} : Sense W 的解释中所有的实词。
3.2.2:意义相似度和词语相似度的计算
根据上面对词汇语义特征的描述,两个意义(Sense)之间的相似度可以通过计算其在三
个不同的意义特征空间中的距离来得到。距离越小,相似度越大。依据意义相似度我们就可
以容易地计算出 WordNet 中两个词语之间的相似度。
• 意义相似度
∑ {Wsj} ×IDF (wi ) 2 + w ∈{Wci}∩{Kc}× IDF (wi ) 2 + w ∈{Wei}∩{Ke}× IDF (wi ) 2
∑ Wcj ∑ Wej
Ks
1 wi ∈{Wsi}∩
Similarity( SWi , SW j ) = × i i
∑ K × IDF ( wi ) 2 × ∑ K × IDF ( w j ) 2
No( SWi) × No( SWj )
i∈QU , K ∈{ Ks , Kc , Ke ) j∈Qv , K ∈{ Ks , Kc , Ke )
其中:
No(SW): W 意义的顺序。例如,the first sense =1, the second sense =2……
IDF( wi ): 从 WordNet 中训练得到的构建 WordNet 时出现某个 wi 的文档的倒数
Ks=1.5: 同义词特征的权重,
Kc=1: 类属特征的权重,
Ke=0.5:意义解释的权重,
:出现 wi 的指标集,
QU
:出现 w j 的指标集
Qv
• 词语相似度
∑ ∑
max ( Similarity ( SW 1i , SW 2 j ) + max ( Similarity ( SW 2i , SW 1 j )
j∈{1,..,| SW 2|} j∈{1,..,| SW 1|}
i∈{1,..,| SW 1|} i∈{1,..,| SW 2|}
Similarity (W1 , W2 ) =
| SW 1 | + | SW 2 |
其中:
|SW1|:W1 的 sense 的个数,
|SW2|:W2 的 sense 的个数。
5. 4 实验结果及分析
我们对实验结果进行了人工的评价,评价方法主要是对计算得到的语义相似度的序列和
人工的排列结果进行比较,结果表明前述方法的计算结果和人工按照语义相似度的排序结果
基本一致。在后续的工作中我们打算把相似度检索结果作为信息检索系统的一个部分,具体
考察词语相似度计算对我们工作的贡献。
语义相似度计算,其单个词与中心词语义相似度的具体取值并不重要,那只是统计意义
上的一个数值,重要的是这些词汇相对于中心词可以依照语义相似度的取值相互比较,并形
成语义相似度由高到低的序列。我们的工作目标是研制一个实用的信息检索系统,语义相似
度对于提高信息检索的精确度和召回率都有重要的实用价值。
比如,用户想检索含有“sanctity”的相关文章,但我们的文档中没有这个词,我们就
可以依据相似度计算(见表 1)结果,在相似词序列中按相似度从高到低的顺序检索出相关文
档返回给用户。
表 1:sanctity 相似词语序列(部分)
中心词 相似词 相似度 sanctity simpleness 0.149
sanctity holiness 1.000 sanctity wholesomeness 0.148
sanctity sacredness 0.230 sanctity unlawfulness 0.148
sanctity expressiveness 0.199 sanctity incredibility 0.148
sanctity insolubility 0.199 sanctity incredibleness 0.148
sanctity counter factuality 0.194 sanctity worldliness 0.146
sanctity constructiveness 0.189 sanctity factuality 0.141
sanctity unpopularity 0.185 sanctity factualness 0.141
sanctity unholiness 0.169 sanctity popularity 0.140
sanctity humanness 0.161 sanctity lawfulness 0.139
sanctity parental quality 0.161 sanctity unsatisfactoriness 0.139
sanctity particularity 0.161 sanctity finitude 0.136
sanctity inaccuracy 0.158 sanctity boundedness 0.136
sanctity ethnicity 0.155 sanctity finiteness 0.136
sanctity measurability 0.154 sanctity satisfactoriness 0.135
sanctity quantifiability 0.154 sanctity ordinariness 0.126
sanctity destructiveness 0.151 sanctity negativism 0.121
…… ………… ……
sanctity nativeness 0.150
我们的方法提供了一个针对某一个中心词的按照相似度从高到低排列的词语的序列, ,
同时我们可以区分出同一个词不同的词性的相似词语的序列。 这一点在英语信息检索中很有
价值。用户的 query 如果是一句话,那么我们就可以根据词语的不同词性进行检索或进行问
题扩展。比如,在 WordNet 中“doctor”有名词和动词两种词性,我们就可以依据不同的词
性来选择合适的相似词语序列进行检索或问题扩展。
名词“doctor” :
1. doctor, doc, physician, MD, Dr., medico -- (a licensed medical practitioner; quot;I felt so bad I
went to see my doctorquot;)
2. Doctor of the Church, Doctor -- ((Roman Catholic Church) a title conferred on 33 saints
who distinguished themselves through the othodoxy of their theological teaching; quot;the Doctors of
6. the Church greatly influenced Christian thought down to the late Middle Agesquot;)
3. doctor -- (children take the roles of doctor or patient or nurse and pretend they are at the
doctor's office; quot;the children explored each other's bodies by playing the game of doctorquot;)
4. doctor, Dr. -- (a person who holds Ph.D. degree from an academic institution; quot;she is a
doctor of philosophy in physicsquot;)
表 2:doctor(名词)相似词语序列(部分)
中心词 相似词 相似度 模式
doctor doc 0.400 <N N>
doctor physician 0.400 <N N>
doctor medical practitioner 0.270 <N N>
doctor medical man 0.270 <N N>
doctor Doctor 0.250 <N N>
doctor health professional 0.223 <N N>
doctor health care provider 0.223 <N N>
doctor medical specialist 0.209 <N N>
doctor caregiver 0.186 <N N>
doctor professional person 0.150 <N N>
doctor professional 0.107 <N N>
doctor grownup 0.099 <N N>
doctor adult 0.083 <N N>
doctor sawbones 0.081 <N N>
doctor operating surgeon 0.081 <N N>
doctor surgeon 0.081 <N N>
doctor psychoanalyst 0.077 <N N>
doctor alienist 0.077 <N N>
doctor horse doctor 0.076 <N N>
doctor pathologist 0.075 <N N>
doctor diagnostician 0.075 <N N>
doctor brain doctor 0.074 <N N>
doctor neurologist 0.074 <N N>
…… ………… …… ……
(其中“模式”表示的是中心词和相似词的词性)
动词“doctor” :
1:sophisticate, doctor, doctor up -- (alter and make impure, as with the intention to deceive;
quot;Sophisticate rose water with geraniolquot;) => adulterate, stretch, dilute, debase -- (corrupt, debase,
or make impure by adding a foreign or inferior substance; often by replacing valuable ingredients
with inferior ones; quot;adulterate liquorquot;)
2:doctor -- (give medical treatment to)=> treat, care for -- (provide treatment for; quot;The
doctor treated my broken legquot;; quot;The nurses cared for the bomb victimsquot;; quot;The patient must be
treated right away or she will diequot;; quot;Treat the infection with antibioticsquot;)
3: repair, mend, fix, bushel, doctor, furbish up, restore, touch on -- (restore by replacing a part
or putting together what is torn or broken; quot;She repaired her TV setquot;; quot;Repair my shoes pleasequot;)=>
better, improve, amend, ameliorate, meliorate -- (to make better; quot;The editor improved the
7. manuscript with his changesquot;
表 3:doctor(动词)相似词语序列(部分)
中心词 相似词 相似度 模式
doctor doctor up 0.500 <V V>
doctor adulterate 0.400 <V V>
doctor Doctor 0.333 <V V>
doctor adulterate 0.065 <V A>
doctor sophisticate 0.047 <V V>
doctor furbish up 0.040 <V V>
doctor bushel 0.040 <V V>
doctor repair 0.040 <V V>
doctor sophisticate 0.040 <V N>
doctor mend 0.032 <V V>
doctor Dr. 0.032 <V N>
doctor darn 0.031 <V V>
doctor trouble-shoot 0.031 <V V>
doctor sole 0.028 <V V>
doctor reheel 0.028 <V V>
doctor repoint 0.028 <V V>
doctor resole 0.028 <V V>
doctor revamp 0.027 <V V>
doctor patch up 0.023 <V V>
doctor restore 0.017 <V V>
doctor fix 0.012 <V V>
…… ………… …… ……
(其中“模式”表示的是中心词和相似词的词性)
5 结论
对于自然语言处理,语义分析面临的首要任务就是词汇间语义关系的数量化,即选择合
适的方法和模型来描述语义关系。作为一个初步的研究,将词汇间的种种关系,映射为一个
表示词语相似程度的数值,并以这一数值来产生词语相似性的排列序列,以简化词汇语义间
复杂的难以把握的语义关系,是研究语义关系的一个切入点。作为汉语信息处理研究,我们
可以从英语自然语言处理中借鉴许多有益的方法和手段。汉语语义分析,归根结底依赖于一
个大型词汇概念网络的建立,依赖于语言知识的注入。在目前汉语语义分析领域,我们还没
有这样一个词汇概念网络。同时,我们在进行这样一个词汇概念网络的设计时应该考虑到应
用领域、处理的颗粒度等许多实际的问题,这样我们的设计结果和人力物力的投入才会是有
意义的。
参 考 文 献
[1] George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. Introduction to
WordNet: An On-line Lexical Database[EB], Cognitive Science Laboratory Princeton University, 1993, 8.
[2] Rada R. etc Development and application of a metric on semantic nets. IEEE Transactions on System, Man and
8. Cybernetics, 1989
[3] Lee J.H. etc Information Retrieval based on conceptual distance in ISA hierarchies’, Journal of
Documentation,1993(49)
[4] Agirre E. and Rigau G. (1995), A proposal for word sense disambiguation using conceptual distance, in
International Conference quot;Recent Advances in Natural Language Processingquot; RANLP'95, Tzigov Chark,
Bulgaria,.
[5] P.Brown etc Word sense disambiguation using tactical methods. In Proceedings of 29th Meeting of the
Association for Computational Linguistics (ACL-91) ,1991
[6] Lillian Lee Similarity-Based Approaches to Natural Language Processing Ph.D. thesis. Harvard University
Technical Report TR-11-97.
[7] 于江生,俞士汶 中文概念词典的结构 中文信息学报 2002(4).
[8] 胡俊峰,俞士汶 唐宋诗中词汇语义相似度的统计分析及应用 中文信息学报 2002(4).
[9] 关毅,王晓龙 基于统计的汉语词汇间语义相似度计算 语言计算与基于内容的文本处理 清华大学出版
社 2003.8
[10] 刘群,李素建 基于《知网》的词汇语义相似度计算,第三届汉语词汇语义学研讨会,2002.5.