求助几个备份wiki的问题
-
一,我想在电脑本地创建一个某个wiki网站其中的一小部分的备份,但是我查了这个手册,
https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki/zh#
没有找到仅仅备份一部分的方法,这是一个只需要操作者编写一个特定脚本即可完成的操作吗?需要学习什么内容呢?
二,按照我的理解,这个备份文件是以数据库形式储存的,那么不能直接在电脑本地像在浏览器里在线那样浏览,有办法在电脑本地做到像在浏览器里那样浏览吗?我搜索了一下,没能找到解决方案。维基百科有对应的软件,但是应该不能用在别的wiki网站上?
又搜索了一下,发现Kiwix应该是一个符合要求的软件,我再学习一下
三,kiwix提供了一个在线工具
https://youzim.it/
可以爬取某个网站并生成kiwix阅读器可读的一个文件,还可以设置规则,但是规则我看不懂,GitHub仓库给的应该是另一个版本?要使用这个是不是要知道爬虫的一些基本概念? -
北大野史是吧。。凑满8个字
-
@pku_jerry 倒也不是,是另一个小众的国外wiki
-
-
-
-
-
大概解释一下zimit advanced setting
Language应该选UTF-8吧
depth :A website's crawl depth refers to the extent to which a search engine indexes the site's content. A site with high crawl depth will get a lot more indexed than a site with low crawl depth.我填的-7
extra hop翻译过来应该是跃点,不知道
Crawl scope:爬取范围
When defining a web application in the wizard, you must select a crawl scope setting. In case of authenticated scan, ensure that you always put the login link as the first link. The following settings are available.
Limit to URL hostname (abc.xyz)Select this setting to limit crawling to the hostname within the URL, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/. All links discovered in www.example.org domain will be crawled. Also all links discovered in http://www.example.org/support and https://www.example.org:8080/logout will be crawled. No links will be followed from subdomains of www.example.org. This means http://www2.example.org and http://cdn.www.example.org/ will not be crawled.
Limit to content located at or below URL subdirectorySelect this setting to crawl all links starting with a URL subdirectory using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/. All links starting with http://www.example.org/news/ will be crawled. Also http://www.example.org/news/headlines and https://www.example.org:8080/news/ will be crawled. Links like http://www.example.org/agenda and http://www2.example.org will not be crawled.
Limit to URL hostname and specified sub-domainSelect this setting to crawl only the URL hostname and one specified sub-domain, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/ and the sub-domain is cdn.example.org. All links discovered in www.example.org and in cdn.example.org and any of its subdomains will be crawled. Also these domains will be crawled: http://www.example.org/support, https://www.example.org:8080/logout, http://cdn.example.org/images/ and http://videos.cdn.example.org. Links whose domain does not match the web application URL hostname or is not a sub-domain of cdn.example.org will not be followed. This means http://videos.example.org will not be crawled.
Limit to URL hostname and specified domainsSelect this setting to crawl only the URL hostname and specified domains, using HTTP or HTTPS and any port. Let's say your starting URL is http://www.example.org/news/ and the specified domains are cdn.example.org and site.example.org. All links discovered in www.example.org and in cdn.example.org and all other domains specified will be crawled. This means these domains will be crawled: http://www.example.org/support, https://www.example.org:8080/logout and http://cdn.example.org/images/. Links whose domain does not match web application URL hostname or one of the domains specified will not be followed. This means http://videos.example.org and http://videos.cdn.example.org will not be crawled.
其他应该不太重要吧,我也不太懂爬虫,以下是全部的参量和简略官方说明,求助爬虫大佬
Language
ISO-639-3 (3 chars) language code of content. Defaults toeng
Title
Custom title for ZIM. Defaults to title of main page
Description
Description for ZIM
Illustration
URL for Illustration. If unspecified, will attempt to use favicon from main page.
ZIM filename
ZIM file name (based on --name if not provided). Make sure to end with _{period}.zim
ZIM Tags
List of Tags for the ZIM file.
Content Creator
Name of content creator.
Content Source
Source name/URL of content
New Context
The context for each new capture. Defaults to page
WaitUntil
Puppeteer page.goto() condition to wait for before continuing. Defaults toload
Depth
The depth of the crawl for all seeds. Defaults to -1
Extra Hops
Number of extra 'hops' to follow, beyond the current scope. Defaults to 0
Scope Type
A predfined scope of the crawl. For more customization, use 'custom' and set include regexes. Defaults to prefix.
Include
Regex of page URLs that should be included in the crawl (defaults to the immediate directory of URL)
Exclude
Regex of page URLs that should be excluded from the crawl
Allow Hashtag URLs
Not set
Allow Hashtag URLs, useful for single-page-application crawling or when different hashtags load dynamic content
As device
Device to crawl as. Defaults toIphone X
. See Pupeeter's DeviceDescriptors.
User Agent
Override user-agent with specified
Use sitemap
Use as sitemap to get additional URLs for the crawl (usually at /sitemap.xml)
Behaviors
Which background behaviors to enable on each page. Defaults to autoplay,autofetch,siteSpecific.
Behavior Timeout
If >0, timeout (in seconds) for in-page behavior will run on each page. If 0, a behavior can run until finish. Defaults to 90
Size Limit
If set, save state and exit if size limit exceeds this value, in bytes
Time Limit
If set, save state and exit after time limit, in seconds -
@kgdjcb46158 感谢解释,改天我再试试
-
https://s3.us-west-1.wasabisys.com/org-kiwix-zimit/other/www.pkuanvil.com_c59aa3b1.zim
这次zim文件可以在内部打开链接了,快来试试
上面是下载链接,在有些kiwix客户端效果不好,kiwix pwa还行https://pwa.kiwix.org -
浏览器插件应该也行
-
@admin 我找不到提交的地方,好像要先fork再发请求,我就放我的仓库吧,你直接fork就行了https://github.com/pkej1236/pkuanvil_zim
-
@kgdjcb46158 这是目前本站的zim?
-
-
Zimit 生成的zim文件依赖kiwix pwa,因此需要联网,如果不科学还有点慢,但已经把网站所有链接整合了
-
crawl scope应该选domain,这样才能把pkuanvil整个域下资源爬下来